Symptoms
DNS propagation delays are among the most operationally disruptive issues in infrastructure work. You have updated a record on your authoritative nameserver — changed an A record, swapped an MX endpoint, removed a deprecated CNAME — and yet hours later, behavior is inconsistent. Some users reach the new destination; others are still hitting the old one. The discrepancy is hard to reproduce because it depends entirely on which resolver a client happens to be using.
Common symptoms include:
- Running
dig solvethenetwork.com @8.8.8.8
returns the new IP butdig solvethenetwork.com @1.1.1.1
still returns the old one - Users in different geographic regions or on different ISPs receive different IP addresses for the same hostname
- After a server migration, the decommissioned host continues receiving production traffic
- TLS certificate issuance via ACME fails intermittently because the ACME challenge DNS record resolves to the wrong IP from the CA's resolver
- Email delivery bounces or defers because MX changes haven't reached the recipient's mail server resolver
- Monitoring and health checks report the new record while end-user support tickets report the old behavior
- A newly created subdomain returns NXDOMAIN from some resolvers even though it exists on the authoritative server
Propagation delays are rarely caused by a single failure. They are typically a combination of several independently-caching systems — each operating on its own TTL clock — that must all expire their cached data before the change is universally visible. Diagnosing the problem requires interrogating each layer of the resolution stack in isolation.
Root Cause 1: TTL Too High on the Old Record
Why It Happens
The Time-to-Live (TTL) field on every DNS resource record instructs recursive resolvers how long they are permitted to serve that record from cache before they must re-query the authoritative server. If a record carried a TTL of 86400 (24 hours) at the time a resolver last fetched it, that resolver is fully compliant with RFC 1035 when it continues serving the cached answer for the next 24 hours — even after you've changed the record on the authoritative server. This is not a bug; it is DNS behaving exactly as designed.
The failure mode is almost always a process failure, not a technical one. An operator makes a record change and simultaneously lowers the TTL in the same zone file edit. By the time any resolver sees the new TTL value, it has already cached the old record under the old TTL, so the lower TTL has no effect on the current cache cycle. It only benefits the next re-fetch — which still won't happen until the original high TTL has expired.
How to Identify It
Query the authoritative nameserver directly to see the TTL currently published in the zone:
dig @ns1.solvethenetwork.com solvethenetwork.com A +norecurse
;; ANSWER SECTION:
solvethenetwork.com. 86400 IN A 10.10.1.75
Then query a public recursive resolver. The TTL in the answer will count down from the value at the time of caching:
dig @8.8.8.8 solvethenetwork.com A
;; ANSWER SECTION:
solvethenetwork.com. 82341 IN A 10.10.1.50
A remaining TTL of 82341 means the resolver cached this entry approximately 4059 seconds ago (86400 − 82341) and will continue serving the stale answer for another 22.8 hours. The authoritative server has the new record (10.10.1.75) but the resolver still has the old one (10.10.1.50) locked in for nearly a full day.
How to Fix It
The only correct procedure is to lower the TTL well in advance of any planned record change — before the change window, not during it. The lead time must equal at least one full current TTL cycle so that all resolvers that were holding the record under the old TTL have had a chance to refresh and pick up the new low TTL.
; Step 1: 24+ hours before the change window, lower the TTL
solvethenetwork.com. 300 IN A 10.10.1.50
; Step 2: After one full TTL cycle, make the record change
solvethenetwork.com. 300 IN A 10.10.1.75
; Step 3: After confirming propagation, restore TTL to operational value
solvethenetwork.com. 86400 IN A 10.10.1.75
If you are already past the point of no return — the record has changed but the old high-TTL version is cached everywhere — there is no shortcut. You must wait. You can monitor progress by querying multiple public resolvers every few minutes and watching the TTL value count down toward zero. When it hits zero, that resolver will re-query and pick up the new record.
Root Cause 2: Negative Cache Not Expired
Why It Happens
RFC 2308 defines negative caching: when a resolver queries for a record that does not exist (NXDOMAIN response) or queries for a record type that has no entries for a name (NOERROR/NODATA), the resolver caches that negative result. The duration of that negative cache is governed by the MINIMUM field of the zone's SOA record — commonly called the negative TTL.
This creates two distinct problem scenarios. First, if you create a new hostname or record type that previously did not exist, any resolver that already queried and received an NXDOMAIN response will refuse to re-query until the negative TTL expires — even though the record now exists. Second, if a record was briefly absent during a zone reload, BIND restart, or accidental deletion, resolvers may have cached an NXDOMAIN during that window. Those resolvers will continue returning NXDOMAIN until the negative cache expires, regardless of how quickly you restore the record.
How to Identify It
Inspect the SOA record to determine the negative TTL value (the last field in the SOA RDATA):
dig solvethenetwork.com SOA +short
ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2024040601 3600 900 604800 300
The seventh field (300) is the negative TTL in seconds. To confirm that a remote resolver has cached a negative response, query it and inspect the status and authority section:
dig @8.8.8.8 api.solvethenetwork.com A
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 44271
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; AUTHORITY SECTION:
solvethenetwork.com. 287 IN SOA ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2024040601 3600 900 604800 300
Status NXDOMAIN combined with a SOA record in the authority section is the definitive signature of a cached negative response. The TTL on the SOA entry (287) is the remaining seconds before this negative cache expires and the resolver will try again.
How to Fix It
Reduce the negative TTL in the SOA record for zones subject to frequent changes. Edit the zone file on sw-infrarunbook-01 and set the seventh SOA field to 60 seconds:
$TTL 86400
@ IN SOA ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. (
2024040602 ; Serial — increment after every change
3600 ; Refresh
900 ; Retry
604800 ; Expire
60 ) ; Negative TTL — reduced from 300 to 60
After editing, increment the serial and reload the zone:
rndc reload solvethenetwork.com
server: reloading zone 'solvethenetwork.com/IN': success
For records that were just added and have already been negatively cached, you must wait for the negative TTL to expire on each affected resolver. You cannot remotely flush external caches. The only workaround for end users who cannot wait is to point them at a resolver with a fresher cache, or have them flush their local OS-level cache.
Root Cause 3: ISP Resolver Caching Stale Data
Why It Happens
ISP-operated recursive resolvers are shared infrastructure serving potentially hundreds of thousands of subscribers. Some of these resolvers implement aggressive caching strategies that deliberately extend TTL values beyond what the authoritative server advertises. This behavior — sometimes called TTL stretching — reduces the resolver's upstream query volume and improves perceived response times for subscribers. It is technically non-compliant with RFC 1035 but is common enough in the wild to be a regular source of propagation complaints.
Even when an ISP resolver faithfully honors TTL values, its popularity problem is significant: a heavily-used resolver may receive thousands of queries for
solvethenetwork.comper minute. Each query that arrives before the TTL expires resets nothing — the cached entry stays alive as long as the TTL on the existing cache entry hasn't hit zero. But in practice, a record that is constantly being queried will always find the entry still in cache when it is re-requested, so it stays perpetually cached until an explicit re-fetch cycle completes.
How to Identify It
The telltale sign is divergence between well-known public resolvers and a specific ISP's resolver. Query each vantage point and compare:
# Ground truth — authoritative server
dig @ns1.solvethenetwork.com solvethenetwork.com A +norecurse +short
10.10.1.75
# Google Public DNS
dig @8.8.8.8 solvethenetwork.com A +short
10.10.1.75
# Cloudflare DNS
dig @1.1.1.1 solvethenetwork.com A +short
10.10.1.75
# ISP-assigned resolver (RFC 1918 address seen via DHCP)
dig @192.168.100.1 solvethenetwork.com A +short
10.10.1.50 # still returning old IP
If the authoritative server and multiple public resolvers all return the new record but a specific ISP resolver returns the old one — and the TTL on the SOA-published record has clearly long expired — ISP-side TTL stretching or aggressive caching is the cause.
How to Fix It
You have no direct mechanism to flush a third-party ISP's resolver cache. Your practical options are:
- Wait: Even ISP resolvers with aggressive caching eventually expire entries. Most will re-query within hours even if they ignore the published TTL.
- Redirect users temporarily: Instruct affected users to configure their system DNS to 8.8.8.8 or 1.1.1.1 until the ISP cache clears.
- Contact the ISP NOC: Some ISPs will flush their resolver cache for a specific domain on request. Success rate is variable, but worth attempting for critical migrations.
- Proxy old to new: If traffic volume is critical, keep the old server running with a reverse proxy or redirect rule pointing to the new server while the ISP cache drains.
The most effective long-term mitigation is strict pre-migration TTL reduction. If the record's TTL was already at 300 seconds for 24 hours before the change, even a misbehaving ISP resolver will re-fetch within 5 minutes of each subscriber's query cycle.
Root Cause 4: Authoritative Server Not Updated
Why It Happens
Production DNS deployments almost universally involve a primary authoritative server and one or more secondary servers. Changes applied to the primary do not automatically appear on secondaries. Secondaries must either receive a DNS NOTIFY message from the primary and then pull an IXFR or AXFR, or they must poll the primary periodically (every Refresh interval from the SOA) and detect a serial number change. If the NOTIFY is lost, the secondary's poll interval is long, or the transfer itself fails silently, the secondary will continue serving the old zone data indefinitely.
This is particularly insidious because a zone's NS records list all authoritative servers equally. Recursive resolvers typically round-robin across listed nameservers or select them by latency. This means some resolvers will hit the updated primary and get the correct answer, while others hit the stale secondary and get the old answer. The result is non-deterministic: the same query from two different resolvers returns different answers, and neither client can explain the discrepancy.
How to Identify It
Query each listed authoritative server directly and compare SOA serials and record values:
# Check SOA serial on primary
dig @ns1.solvethenetwork.com solvethenetwork.com SOA +short
ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2024040602 3600 900 604800 300
# Check SOA serial on secondary
dig @ns2.solvethenetwork.com solvethenetwork.com SOA +short
ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2024040601 3600 900 604800 300
The serial mismatch (2024040602 vs 2024040601) confirms the secondary is serving a stale zone version. Verify by querying the specific record that was changed on each server:
dig @ns1.solvethenetwork.com www.solvethenetwork.com A +short
10.10.1.75
dig @ns2.solvethenetwork.com www.solvethenetwork.com A +short
10.10.1.50 # old IP — secondary is stale
How to Fix It
On sw-infrarunbook-01 (the primary), force a NOTIFY to all configured secondaries:
rndc notify solvethenetwork.com
zone 'solvethenetwork.com' is now notified
Watch the BIND log for confirmation that the transfer completed cleanly:
tail -f /var/log/named/named.log
06-Apr-2024 14:22:10.341 notify: info: zone solvethenetwork.com/IN: sending notifies (serial 2024040602)
06-Apr-2024 14:22:10.502 xfer-in: info: transfer of 'solvethenetwork.com/IN' from 10.10.1.10#53: Transfer completed: 1 messages, 14 records, 476 bytes, 0.001 secs (476000 bytes/sec)
If NOTIFY fails, manually trigger a retransfer from the secondary:
# Run on the secondary nameserver
rndc retransfer solvethenetwork.com
If zone transfers are being blocked, verify the primary's
named.confallows the secondary's IP under
allow-transfer, and confirm that TCP/53 is open between the servers at the firewall level. Zone transfers always use TCP; a firewall that only permits UDP/53 will silently block all transfers.
Root Cause 5: Partial Zone Transfer
Why It Happens
A zone transfer — whether a full AXFR or an incremental IXFR — can fail midway through due to a network interruption, a TCP session timeout, a firewall stateful table overflow, or a TSIG authentication failure on a single packet in a multi-packet sequence. When this occurs, the secondary may apply only part of the zone changes, leaving the zone in an internally inconsistent state: some records reflect the new version, others still carry old values. Critically, the secondary's SOA serial may or may not reflect the intended zone version depending on exactly when in the transfer process the failure occurred.
IXFR transfers are more susceptible to this failure mode than full AXFR transfers. An IXFR conveys only the diff — the sequence of deletions and additions between two zone serial numbers. If that diff is complex (many records changed in a single serial increment) or if the transfer spans multiple TCP segments, a mid-transfer reset can leave the secondary in an indeterminate state that is difficult to detect by serial comparison alone.
How to Identify It
The warning sign of a partial transfer is matching serial numbers across servers accompanied by mismatched record data. Start with serial comparison:
dig @ns1.solvethenetwork.com solvethenetwork.com SOA +short
ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2024040605 3600 900 604800 300
dig @ns2.solvethenetwork.com solvethenetwork.com SOA +short
ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2024040605 3600 900 604800 300
Serials match — but now compare specific records that were part of the update:
dig @ns1.solvethenetwork.com app.solvethenetwork.com A +short
10.10.2.20
dig @ns2.solvethenetwork.com app.solvethenetwork.com A +short
10.10.1.80 # old value — partial transfer left this record unchanged
Matching serials with divergent record data is the definitive fingerprint of a partial zone transfer. Confirm by checking the secondary's BIND log for IXFR errors:
grep -i "ixfr\|xfer\|transfer\|failed" /var/log/named/named.log
06-Apr-2024 12:14:33.110 xfer-in: error: transfer of 'solvethenetwork.com/IN' from 10.10.1.10#53: IXFR failed: unexpected end of input
06-Apr-2024 12:14:33.112 xfer-in: info: transfer of 'solvethenetwork.com/IN' from 10.10.1.10#53: retrying AXFR
06-Apr-2024 12:14:33.891 xfer-in: error: transfer of 'solvethenetwork.com/IN' from 10.10.1.10#53: transfer failed: timed out
How to Fix It
Force the secondary to discard its current zone state and perform a fresh full AXFR. The safest method is to remove the zone journal file (which tracks IXFR history) and retrigger the transfer:
# On the secondary (sw-infrarunbook-01 acting as secondary)
systemctl stop named
rm /var/cache/bind/solvethenetwork.com.jnl
systemctl start named
# Then explicitly request retransfer
rndc retransfer solvethenetwork.com
Confirm a clean completion in the logs:
06-Apr-2024 14:30:01.221 xfer-in: info: transfer of 'solvethenetwork.com/IN' from 10.10.1.10#53: Transfer completed: 4 messages, 52 records, 2041 bytes, 0.004 secs
To reduce the frequency of partial IXFR failures for critical zones, add
request-ixfr no;to the secondary's zone configuration to always use full AXFR. This increases transfer bandwidth but eliminates incremental transfer failures as a failure mode entirely. Alternatively, deploy TSIG for authenticated zone transfers — TSIG covers the entire transfer stream at the message level, providing integrity verification and making partial transfer corruption detectable immediately.
Root Cause 6: Multiple Authoritative Servers Running Different Zone Versions
Why It Happens
Some organizations operate DNS in a stealth primary configuration where the authoritative servers listed in the parent zone's NS delegation are all secondaries, with the actual primary hidden from public view. Changes flow: operator edits primary → primary notifies all secondaries → secondaries transfer. If a deployment pipeline applies changes to some secondaries but not others — due to a partial Ansible run, a failed configuration management job, or a network partition — the visible authoritative servers diverge silently.
How to Identify It
Enumerate all NS records and query each in turn:
dig solvethenetwork.com NS +short
ns1.solvethenetwork.com.
ns2.solvethenetwork.com.
for ns in ns1.solvethenetwork.com ns2.solvethenetwork.com; do
printf "%-35s" "$ns:"
dig @$ns solvethenetwork.com A +short
done
ns1.solvethenetwork.com: 10.10.1.75
ns2.solvethenetwork.com: 10.10.1.50
How to Fix It
Re-apply the full zone update to all out-of-sync authoritative servers. If you use a configuration management system, ensure all DNS nodes are included in the target inventory and that the run completes successfully on every node. After applying changes, always run the serial comparison check across all listed NS records as part of your post-change verification before closing the change ticket.
Root Cause 7: OS and Browser DNS Cache
Why It Happens
End-user operating systems maintain a local DNS cache independent of any upstream resolver. On Linux,
systemd-resolvedor
nscdcaches records locally. On macOS,
mDNSResponderhandles caching. Web browsers implement their own separate DNS cache on top of the OS cache — Chromium-based browsers cache positive responses for up to 60 seconds by default, and this is not controlled by the record's TTL.
How to Identify It
If a specific user reports stale resolution but querying the same upstream resolver from a different machine returns the correct answer, the problem is local to that user's machine:
# Check systemd-resolved cache statistics
resolvectl statistics
Current Transactions: 0
Total Transactions: 5112
Current Cache Size: 94
Cache Hits: 4803
Cache Misses: 309
How to Fix It
Flush the local OS cache and the browser cache:
# Linux — systemd-resolved
resolvectl flush-caches
# Linux — nscd
nscd -i hosts
# macOS
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
For browser caches, navigate to the internal DNS cache management page and flush it directly:
# Chromium-based browsers
chrome://net-internals/#dns
# Firefox
about:networking#dns
Prevention
Most DNS propagation issues are preventable with disciplined operational processes. The following practices eliminate the majority of propagation-related incidents:
- Pre-lower TTLs before every record change. Reduce the target record's TTL to 300 seconds or less at least one full TTL-cycle before the planned change. Never change the record value and TTL simultaneously in the same edit.
- Keep the negative TTL low. Set the SOA MINIMUM field to 60–300 seconds for all zones where new records may be added. A high negative TTL (e.g., 86400) causes newly created records to be invisible from previously-queried resolvers for an entire day.
- Alert on SOA serial mismatches. Implement monitoring that queries all authoritative nameservers for each zone's SOA serial every 5 minutes and alerts if any server lags behind the primary by more than one serial increment. This catches failed transfers before they affect users.
- Authenticate zone transfers with TSIG. Deploy TSIG keys between primary and secondary servers. TSIG authenticates every DNS message in a transfer sequence, making partial or corrupted transfers detectable immediately.
- Verify across all authoritative servers after every change. Before closing any DNS change, query every NS record listed for the zone and confirm the expected value and serial are returned by each server.
- Use a DNS change runbook. Standardize a four-step procedure: lower TTL → wait one TTL cycle → make the record change → verify across all authoritatives and multiple resolvers → restore TTL.
- Audit parent zone NS delegation regularly. Decommissioned nameservers left in parent-zone NS delegation will serve stale or empty responses to resolvers that happen to query them. Audit and clean up NS records whenever a nameserver is retired.
- Document rollback values. Before making any DNS change, record the current record value and TTL. If rollback is needed, time-to-restore matters — have the previous zone file state committed to version control and ready to re-apply immediately.
Frequently Asked Questions
Q: What is DNS propagation and why does it take time?
A: DNS propagation is the process by which a change made on an authoritative nameserver becomes visible to all recursive resolvers worldwide. It takes time because resolvers cache records for the duration of the record's TTL. Until that TTL expires and the resolver re-queries the authoritative server, it continues serving the cached answer. There is no broadcast mechanism in DNS — each resolver independently decides when to refresh its cache.
Q: How do I check whether a DNS change has propagated to a specific resolver?
A: Query that resolver directly using the
@flag in dig. For example:
dig @8.8.8.8 solvethenetwork.com A +short. Compare the result to your authoritative server:
dig @ns1.solvethenetwork.com solvethenetwork.com A +norecurse +short. If the answers differ, the resolver is still serving cached data.
Q: What TTL value should I normally set on A records?
A: For stable infrastructure records that rarely change, 3600–86400 seconds is reasonable. For records on hosts that are subject to migrations or failover, 300–600 seconds is more appropriate. As a rule: the higher the TTL, the longer propagation takes when a change is needed. Match your TTL to your operational risk tolerance for change propagation latency.
Q: Can I force external resolvers to clear their cache for my domain?
A: For most public resolvers, no — you cannot directly flush their caches. Google Public DNS provides a cache flush tool at its resolver management console. Cloudflare provides a cache purge API. For ISP resolvers, you can call the NOC and request a manual flush. In all cases, the most reliable solution is to have a low TTL in place before making the change so the propagation window is short by design.
Q: What is the difference between AXFR and IXFR, and which should I use?
A: AXFR (Authoritative Full Zone Transfer) transfers the complete zone file. IXFR (Incremental Zone Transfer) transfers only the diff between two serial numbers. IXFR is more bandwidth-efficient for large zones but is more complex and can fail in ways that leave the secondary in a partially-updated state. For small-to-medium zones or for critical zones where correctness is paramount, AXFR is the safer choice. You can force AXFR-only behavior on a secondary by setting
request-ixfr no;in the zone block in
named.conf.
Q: How does negative caching affect newly created DNS records?
A: If a resolver queried for a hostname before that hostname's record was created and received an NXDOMAIN response, the resolver caches that negative result for the duration of the zone's negative TTL (SOA MINIMUM field). Even after the record is created, that resolver will return NXDOMAIN until the negative cache entry expires. This is a common cause of confusion when provisioning new services — the record exists on the authoritative server but is invisible from resolvers that pre-cached the NXDOMAIN.
Q: Why do some users see the new record immediately after a change while others do not?
A: Different users use different resolvers. A user querying a resolver that hadn't previously cached the record will get the new answer immediately. A user querying a resolver that cached the old answer under a high TTL will continue seeing the old answer until that TTL expires. The geographic and ISP diversity of resolvers in use by your user base directly determines the spread of propagation lag you observe.
Q: How do I verify that my BIND secondary performed a complete and correct zone transfer?
A: Check three things: (1) compare the SOA serial on the primary and secondary using
dig @primary solvethenetwork.com SOA +shortand
dig @secondary solvethenetwork.com SOA +short— they should match; (2) query several recently-changed records on both servers and compare the answers; (3) review the BIND transfer log at
/var/log/named/named.logfor the most recent xfer-in entry and confirm it completed without errors and with a reasonable record count.
Q: What does it mean when my SOA serials match but record values differ between primary and secondary?
A: This is the signature of a partial zone transfer. The secondary received enough of the transfer to update its serial to match the primary's, but the transfer was interrupted before all record changes were applied. The secondary's zone data is now internally inconsistent. The fix is to delete the zone journal file on the secondary, restart BIND, and allow a clean full AXFR to complete.
Q: How can I test DNS propagation across multiple global resolvers at once?
A: You can script a check across multiple known public resolvers using a loop:
for resolver in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
printf "%-18s" "$resolver:"
dig @$resolver solvethenetwork.com A +short
done
This gives you a snapshot of resolver agreement at a point in time and lets you identify which resolvers are still serving stale data.
Q: Is there any way to speed up propagation after a mistake has already been made with a high TTL?
A: Once a high-TTL record is cached by a resolver, you cannot force that resolver to expire it early. Your practical options are: (1) wait for the TTL to naturally expire; (2) run the old and new services in parallel so traffic going to either destination is handled correctly; (3) instruct users to switch to a public resolver like 8.8.8.8 which may have a fresher cache or will at least refresh as soon as the current TTL expires; (4) lower the TTL immediately so that the next re-fetch cycle (whenever it occurs) will result in a short new cache duration. Option 4 shortens the propagation tail even if it does not fix the current cache cycle.
Q: How does TSIG help prevent zone transfer issues?
A: TSIG (Transaction Signature, RFC 2845) adds a cryptographic MAC to each DNS message in a zone transfer sequence. This serves two purposes: authentication (the secondary can verify the transfer is coming from an authorized source) and integrity verification (any corruption or truncation of the transfer stream is immediately detectable). Without TSIG, a partial transfer that happens to terminate cleanly at a message boundary may not generate any log error, leaving the secondary silently inconsistent. With TSIG, any message that fails MAC verification causes the transfer to be rejected and logged, triggering a retry.
