Symptoms
DNS resolution failures surface across every layer of your infrastructure simultaneously. When DNS breaks, applications stall mid-connection, SSH sessions hang for 30 seconds on hostname lookups before timing out, monitoring systems start firing cascade alerts, and users report that "the internet is down" when the network itself is perfectly healthy. Recognizing the symptom pattern before running any diagnostic commands saves significant time.
Common symptoms include:
- Name resolution timeout:
ping sw-infrarunbook-01.solvethenetwork.com
hangs for several seconds then returnsping: sw-infrarunbook-01.solvethenetwork.com: Name or service not known
- SERVFAIL responses:
dig
returnsstatus: SERVFAIL
even for names that definitively exist in the zone - NXDOMAIN on existing records: Records that exist in the zone return NXDOMAIN, indicating the resolver cannot reach or does not trust the authoritative server
- REFUSED responses: The resolver explicitly rejects queries — often due to recursion being disabled or ACL blocks excluding the client subnet
- Partial resolution failures: Internal hostnames fail while external names resolve correctly (or vice versa), pointing to a split-horizon, forwarder, or zone loading issue
- Intermittent failures: Some queries succeed while others fail for the same name — common when a round-robin resolver pool has one unhealthy member
- Application-level errors:
ERR_NAME_NOT_RESOLVED
in browsers,getaddrinfo ENOTFOUND
in Node.js logs,java.net.UnknownHostException
in JVM applications
Root Cause 1: Resolver Down
Why It Happens
The resolver — whether a BIND instance, Unbound, dnsmasq, or a network appliance — can crash, become unreachable, or exhaust system resources. Common triggers include OOM kills when cache memory grows unbounded, a runaway query flood that spikes CPU and causes the process watchdog to kill it, a corrupt configuration that causes the daemon to exit on reload, or a network change that isolates the resolver host from its clients. This is the first thing to rule out because it is the most complete failure mode: nothing resolves.
How to Identify It
From a client host, run a direct query against the configured resolver and observe the response:
infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 solvethenetwork.com A
; <<>> DiG 9.18.12 <<>> @192.168.1.10 solvethenetwork.com A
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
A timeout with
no servers could be reachedis the clearest possible sign the resolver is not responding. Confirm whether the host is reachable at all before logging into it:
infrarunbook-admin@sw-infrarunbook-01:~$ ping -c 3 192.168.1.10
PING 192.168.1.10 (192.168.1.10) 56(84) bytes of data.
64 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.412 ms
64 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.388 ms
64 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.401 ms
The host responds to ICMP but DNS is not answering. Log into the resolver and check the service state:
infrarunbook-admin@sw-infrarunbook-01:~$ systemctl status named
● named.service - BIND Domain Name Server
Loaded: loaded (/lib/systemd/system/named.service; enabled)
Active: failed (Result: exit-code) since Fri 2026-04-04 09:12:44 UTC; 18min ago
Process: 2341 ExecStart=/usr/sbin/named -f -u bind (code=exited, status=1/FAILURE)
Main PID: 2341 (code=exited, status=1/FAILURE)
Apr 04 09:12:44 sw-infrarunbook-01 named[2341]: loading configuration from '/etc/bind/named.conf'
Apr 04 09:12:44 sw-infrarunbook-01 named[2341]: /etc/bind/named.conf.options:12: unknown option 'forwarders-policy'
Apr 04 09:12:44 sw-infrarunbook-01 named[2341]: loading configuration: unexpected token
If the process was killed by the OOM killer rather than a config error, check kernel logs:
infrarunbook-admin@sw-infrarunbook-01:~$ journalctl -k | grep -i "oom\|killed"
Apr 04 08:55:10 sw-infrarunbook-01 kernel: Out of memory: Killed process 2219 (named) score 312 total-vm:1048576kB
How to Fix It
Address the configuration error, validate it, then restart the service:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf /etc/bind/named.conf
/etc/bind/named.conf.options:12: unknown option 'forwarders-policy'
# Correct the typo in named.conf.options — remove or fix the invalid directive
infraunbook-admin@sw-infrarunbook-01:~$ sudo nano /etc/bind/named.conf.options
infraunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf /etc/bind/named.conf
# No output means clean
infraunbook-admin@sw-infrarunbook-01:~$ sudo systemctl start named
infraunbook-admin@sw-infrarunbook-01:~$ systemctl status named
● named.service - BIND Domain Name Server
Active: active (running) since Fri 2026-04-04 09:31:02 UTC; 3s ago
If the OOM killer is the culprit, reduce the resolver cache size in
named.conf.options:
options {
max-cache-size 256m;
max-cache-ttl 3600;
};
Root Cause 2: Wrong Nameserver Configured
Why It Happens
A client pointed at the wrong nameserver will receive incorrect or empty answers. This arises from stale DHCP leases pushing an old resolver IP, a manually misconfigured
/etc/resolv.confthat survived a rebuild, a broken systemd-resolved stub listener, or a cloud metadata service returning resolver addresses from a different VPC or subnet. When the queried server has no knowledge of the internal zone, it returns NXDOMAIN (if it has no forwarder configured for the zone) or silently forwards to upstream public resolvers that also have no knowledge of your private namespace.
How to Identify It
Check what nameserver the system is actually using at this moment:
infrarunbook-admin@sw-infrarunbook-01:~$ cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8).
nameserver 172.16.0.254
search solvethenetwork.com
infraunbook-admin@sw-infrarunbook-01:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 172.16.0.254
DNS Servers: 172.16.0.254
Now query that server directly and compare the result against the known authoritative resolver:
infrarunbook-admin@sw-infrarunbook-01:~$ dig @172.16.0.254 sw-infrarunbook-01.solvethenetwork.com A
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 44312
infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19872
;; ANSWER SECTION:
sw-infrarunbook-01.solvethenetwork.com. 300 IN A 192.168.1.50
The authoritative resolver at
192.168.1.10returns the correct answer. The wrong server at
172.16.0.254returns NXDOMAIN because it has no knowledge of the internal zone.
How to Fix It
On systemd-networkd managed hosts, update the DNS setting in the network unit file:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo nano /etc/systemd/network/10-eth0.network
[Network]
DNS=192.168.1.10
DNS=192.168.1.11
infraunbook-admin@sw-infrarunbook-01:~$ sudo systemctl restart systemd-networkd
infraunbook-admin@sw-infrarunbook-01:~$ resolvectl status
Current DNS Server: 192.168.1.10
DNS Servers: 192.168.1.10 192.168.1.11
If
/etc/resolv.confis managed manually and needs direct correction:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo chattr -i /etc/resolv.conf
infraunbook-admin@sw-infrarunbook-01:~$ sudo tee /etc/resolv.conf <<'EOF'
nameserver 192.168.1.10
nameserver 192.168.1.11
search solvethenetwork.com
EOF
Root Cause 3: Firewall Blocking UDP/TCP Port 53
Why It Happens
DNS traffic runs on port 53 using UDP for standard queries and TCP for zone transfers, large responses, and any response exceeding 512 bytes (or 4096 bytes with EDNS0). Overly restrictive firewall rules — applied on the resolver host itself, on an intermediate network appliance, or in a cloud security group — can silently drop DNS packets in one or both directions. A particularly common scenario is a security hardening script that locks down all UDP by default and only whitelists specific application UDP ports, forgetting to include 53. Another scenario is a stateful firewall that permits outbound queries but blocks the return UDP datagrams because they arrive on unexpected ports or with unexpected source IPs.
How to Identify It
Test both UDP and TCP connectivity to port 53 directly, bypassing the resolver library entirely:
infrarunbook-admin@sw-infrarunbook-01:~$ nc -zuv 192.168.1.10 53
Connection to 192.168.1.10 53 port [udp/domain] succeeded!
infraunbook-admin@sw-infrarunbook-01:~$ nc -zv 192.168.1.10 53
nc: connect to 192.168.1.10 port 53 (tcp) failed: Connection refused
UDP succeeds but TCP is blocked. This will cause silent failures for large DNS responses (DNSSEC, long TXT records, large ANY responses). Confirm with dig over TCP:
infrarunbook-admin@sw-infrarunbook-01:~$ dig +tcp @192.168.1.10 solvethenetwork.com ANY
;; communications error to 192.168.1.10#53: connection refused
On the resolver host, inspect the active firewall rules:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo iptables -L INPUT -n -v
Chain INPUT (policy DROP)
pkts bytes target prot opt in out source destination
0 0 ACCEPT udp -- * * 0.0.0.0/0 192.168.1.10 udp dpt:53
0 0 DROP tcp -- * * 0.0.0.0/0 192.168.1.10 tcp dpt:53
The DROP rule on TCP port 53 is explicit. If using nftables instead:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo nft list ruleset | grep -B2 -A2 "port 53"
How to Fix It
Insert rules to permit both UDP and TCP on port 53 and persist them:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo iptables -I INPUT -p tcp --dport 53 -j ACCEPT
infraunbook-admin@sw-infrarunbook-01:~$ sudo iptables -I INPUT -p udp --dport 53 -j ACCEPT
# Persist across reboots:
infraunbook-admin@sw-infrarunbook-01:~$ sudo netfilter-persistent save
run-parts: executing /usr/share/netfilter-persistent/plugins.d/15-ip4tables save
run-parts: executing /usr/share/netfilter-persistent/plugins.d/25-ip6tables save
For nftables environments:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo nft add rule inet filter input ip daddr 192.168.1.10 tcp dport 53 accept
infraunbook-admin@sw-infrarunbook-01:~$ sudo nft add rule inet filter input ip daddr 192.168.1.10 udp dport 53 accept
Verify the fix resolves both protocols:
infrarunbook-admin@sw-infrarunbook-01:~$ dig +tcp @192.168.1.10 solvethenetwork.com SOA
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6712
;; ANSWER SECTION:
solvethenetwork.com. 86400 IN SOA ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2026040401 3600 900 604800 300
Root Cause 4: Recursion Disabled
Why It Happens
BIND distinguishes between authoritative queries (the server holds the zone and answers from local data) and recursive queries (the server looks up external names on behalf of the client, following the delegation chain from root). When
recursion no;is set globally — a common hardening step for authoritative-only nameservers — any client that needs the resolver to look up external names will receive an explicit
REFUSED. The same result occurs when recursion is enabled globally but the
allow-recursionACL does not include the querying client's subnet. This failure mode is especially disruptive because the resolver is healthy and the zone is loaded correctly — only recursive queries fail.
How to Identify It
infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 55023
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
The message
WARNING: recursion requested but not availableis the definitive indicator. Internal zone names may still work if the server is authoritative for them, but all external lookups will fail. Confirm by checking the BIND configuration:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo grep -n "recursion\|allow-recursion" /etc/bind/named.conf.options
8: recursion no;
9: allow-recursion { none; };
Or, if recursion is enabled but the ACL is too narrow:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo grep -n "allow-recursion" /etc/bind/named.conf.options
8: allow-recursion { 10.10.10.0/24; };
# A client on 192.168.1.0/24 is not in this ACL and will receive REFUSED
How to Fix It
Edit
/etc/bind/named.conf.optionsto enable recursion and specify the authorized client subnets:
options {
directory "/var/cache/bind";
recursion yes;
allow-recursion {
192.168.0.0/16;
10.0.0.0/8;
172.16.0.0/12;
127.0.0.1;
};
forwarders {
8.8.8.8;
1.1.1.1;
};
forward only;
};
Validate the configuration and reload without restarting:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf
# No output = clean
infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc reload
server reload successful
infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31881
;; ANSWER SECTION:
cloudflare.com. 299 IN A 104.16.132.229
Security note: Never setallow-recursion { any; };on a resolver with a public IP. Open resolvers are abused for DNS amplification DDoS attacks. Always restrict recursion to known RFC 1918 ranges or trusted client prefixes.
Root Cause 5: Broken Zone File
Why It Happens
Zone file syntax errors are one of the most common causes of resolution failure for internal names. A missing trailing dot on a fully qualified domain name, an incorrect or non-monotonic serial number, a syntax error introduced during a manual edit (missing parenthesis in the SOA record, extra whitespace in a field), a record type used incorrectly, or a corrupted dynamic DNS journal file can all prevent BIND from loading the zone. When a zone fails to load, all queries for names within that zone return SERVFAIL — not NXDOMAIN, because the server knows it should be authoritative but cannot serve the data.
How to Identify It
infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23017
infraunbook-admin@sw-infrarunbook-01:~$ sudo journalctl -u named --since "10 min ago"
Apr 04 09:45:12 sw-infrarunbook-01 named[3104]: zone solvethenetwork.com/IN: loading from master file /etc/bind/zones/db.solvethenetwork.com failed: not at top of zone
Apr 04 09:45:12 sw-infrarunbook-01 named[3104]: zone solvethenetwork.com/IN: not loaded due to errors.
Use
named-checkzonefor a detailed report with line numbers:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkzone solvethenetwork.com /etc/bind/zones/db.solvethenetwork.com
/etc/bind/zones/db.solvethenetwork.com:22: solvethenetwork.com: not at top of zone
zone solvethenetwork.com/IN: not loaded due to errors.
Inspect the zone file at and around line 22:
infrarunbook-admin@sw-infrarunbook-01:~$ sed -n '18,25p' /etc/bind/zones/db.solvethenetwork.com
$ORIGIN solvethenetwork.com.
$TTL 300
@ IN SOA ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. (
2026040401 ; serial
3600 ; refresh
900 ; retry
604800 ; expire
300 ) ; minimum TTL
@ IN NS ns1.solvethenetwork.com.
; Missing trailing dot on the CNAME target:
www IN CNAME solvethenetwork.com
The CNAME target
solvethenetwork.comwithout a trailing dot is treated as a relative name, expanding to
solvethenetwork.com.solvethenetwork.com.— a name outside the zone origin, triggering the
not at top of zoneerror.
How to Fix It
Add the trailing dot and increment the serial number:
; Before:
www IN CNAME solvethenetwork.com
; After:
www IN CNAME solvethenetwork.com.
; Also increment serial from 2026040401 to 2026040402
Validate then reload the zone without a full named restart:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkzone solvethenetwork.com /etc/bind/zones/db.solvethenetwork.com
zone solvethenetwork.com/IN: loaded serial 2026040402
OK
infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc reload solvethenetwork.com
zone reload up-to-date
infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
;; ANSWER SECTION:
sw-infrarunbook-01.solvethenetwork.com. 300 IN A 192.168.1.50
Root Cause 6: Stale Cache or Long Negative TTL
Why It Happens
DNS resolvers cache both positive responses (the record exists) and negative responses (NXDOMAIN — the record does not exist). If a record was deleted or changed but the old TTL has not expired, clients continue receiving the stale answer. Conversely, if a host was temporarily unreachable and NXDOMAIN was cached with a long negative TTL (controlled by the SOA
minimumfield per RFC 2308), clients receive NXDOMAIN long after the issue is resolved.
How to Identify It
infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
;; ANSWER SECTION:
sw-infrarunbook-01.solvethenetwork.com. 287 IN A 192.168.1.50
# TTL of 287 (counting down from 300) means this is a cached response
# Query the authoritative server directly with +norecurse to compare:
infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A +norecurse
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 91204
The resolver has a positive cached answer but the authoritative server now returns NXDOMAIN, confirming a stale positive cache entry.
How to Fix It
# Flush a specific name from the BIND cache:
infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc flushname sw-infrarunbook-01.solvethenetwork.com
# Flush all cached data for a zone:
infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc flush solvethenetwork.com
# Nuclear option — flush the entire resolver cache:
infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc flush
# For systemd-resolved clients:
infraunbook-admin@sw-infrarunbook-01:~$ sudo resolvectl flush-caches
Root Cause 7: DNSSEC Validation Failure
Why It Happens
When DNSSEC validation is enabled on a resolver, any mismatch between the zone's RRSIG signatures and the DS records published in the parent zone causes the resolver to return SERVFAIL instead of the actual answer. This occurs after a KSK/ZSK rollover that was not properly coordinated with the parent zone, after zone signing configuration changes, or when the resolver host's system clock is skewed enough to fall outside a signature's validity window.
How to Identify It
infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 solvethenetwork.com A +dnssec
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 9901
# Re-query with +cd (checking disabled) to bypass validation:
infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 solvethenetwork.com A +dnssec +cd
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2213
;; ANSWER SECTION:
solvethenetwork.com. 300 IN A 192.168.1.100
The
+cdflag bypasses DNSSEC validation and returns a result, confirming the data exists but validation is failing. Check resolver logs for the specific validation error:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo journalctl -u named | grep -i "dnssec\|bogus\|validation"
Apr 04 10:02:31 sw-infrarunbook-01 named[3104]: validating solvethenetwork.com/DNSKEY: no valid signature found (DS)
How to Fix It
For internal zones where DNSSEC is not operationally required, disable validation for that specific zone:
zone "solvethenetwork.com" IN {
type forward;
forwarders { 192.168.1.10; };
forward only;
dnssec-validation no;
};
For production zones, re-sign the zone and re-publish the DS record in the parent to restore the chain of trust. Check the system clock skew first:
infrarunbook-admin@sw-infrarunbook-01:~$ timedatectl status
Local time: Fri 2026-04-04 10:05:11 UTC
Universal time: Fri 2026-04-04 10:05:11 UTC
NTP service: active
NTP synchronized: yes
Root Cause 8: Forwarder Misconfiguration
Why It Happens
Most internal resolvers are configured to forward queries for unknown zones to upstream servers. If the forwarder IP is wrong, the upstream is unreachable, the upstream is returning errors, or the
forward only;directive prevents fallback to root hints when the forwarder fails, all recursive lookups for external names will return SERVFAIL — even though the resolver daemon itself is running correctly and internal zones are loading fine.
How to Identify It
infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 77231
infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
;; ANSWER SECTION:
sw-infrarunbook-01.solvethenetwork.com. 300 IN A 192.168.1.50
Internal names resolve; external names return SERVFAIL. This asymmetry almost always points to a broken forwarder. Check the configuration:
infrarunbook-admin@sw-infrarunbook-01:~$ sudo grep -A5 "forwarders" /etc/bind/named.conf.options
forwarders {
172.16.99.99; # This host does not exist on the network
};
forward only;
infraunbook-admin@sw-infrarunbook-01:~$ dig @172.16.99.99 cloudflare.com A
;; connection timed out; no servers could be reached
How to Fix It
infrarunbook-admin@sw-infrarunbook-01:~$ sudo nano /etc/bind/named.conf.options
forwarders {
8.8.8.8;
8.8.4.4;
};
forward only;
infraunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf && sudo rndc reload
server reload successful
infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
;; ANSWER SECTION:
cloudflare.com. 299 IN A 104.16.132.229
Prevention
The majority of DNS resolution failures are preventable through configuration discipline, pre-deployment validation, redundancy, and active monitoring. Adopt the following practices to avoid outages before they happen:
- Automate zone file validation in CI/CD. Run
named-checkconf
andnamed-checkzone
as blocking gates in every pipeline that touches DNS configuration. A zone file error that passes peer review but failsnamed-checkzone
should never reach production. - Deploy a minimum of two resolvers per site. A single-resolver environment means any service restart, host reboot, or configuration reload failure causes a total DNS outage. Configure both IPs in DHCP scope options and in each host's
/etc/resolv.conf
or network unit file. - Monitor resolver health with synthetic probes. Use Prometheus's
blackbox_exporter
DNS probe module to query a known-good name every 30 seconds and alert on SERVFAIL, NXDOMAIN, or response time exceeding 200ms. - Keep negative TTLs short. Set the SOA
minimum
field to 300–600 seconds. A value of 3600 means a misconfiguration that causes NXDOMAIN caching takes a full hour to self-heal after the fix is applied. - Restrict recursion to named subnets. Always use
allow-recursion
with explicit RFC 1918 ranges. Never useallow-recursion { any; };
on any server reachable from outside your network perimeter. - Version-control all zone files. Store zone files in git with a commit per change. Each commit message should include the change reason and the new serial number, enabling rapid rollback and providing an audit trail.
- Test both UDP and TCP port 53 after every firewall change. Large DNSSEC responses, zone transfers, and EDNS0-extended responses all require TCP. A post-change test of both protocols should be a mandatory step in every firewall change procedure.
- Synchronize resolver system clocks with NTP. DNSSEC signatures have validity windows. Alert if clock drift on resolver hosts exceeds five seconds, and never let resolvers run without an NTP source.
- Use
rndc zonestatus
as a post-deploy check. After any zone reload,sudo rndc zonestatus solvethenetwork.com
immediately confirms the loaded serial and zone state without log scraping.
Frequently Asked Questions
Q: What is the difference between SERVFAIL and NXDOMAIN?
A: NXDOMAIN (Non-Existent Domain) means the authoritative server was successfully consulted and confirmed that the name does not exist in the zone. SERVFAIL means the resolver encountered an error trying to answer — the zone may exist, but the resolver could not retrieve or validate the data. SERVFAIL always warrants infrastructure investigation. NXDOMAIN usually means either the record is genuinely absent or the resolver is querying the wrong nameserver for the zone.
Q: Why does dig
return the correct answer but ping
or curl
fails to resolve the same name?
A:
digqueries the nameserver directly and bypasses the system resolver library entirely. Applications like
pingand
curluse
getaddrinfo(), which reads
/etc/nsswitch.confand may route queries through
/etc/hosts, mDNS, or systemd-resolved's stub listener at
127.0.0.53before reaching the DNS server configured in
/etc/resolv.conf. Check the
hosts:line in
/etc/nsswitch.confand run
resolvectl statusto see the effective resolver used by the stub.
Q: How do I flush the DNS cache on a Linux host without restarting the resolver?
A: On systemd-resolved clients:
sudo resolvectl flush-caches. On systems running nscd:
sudo systemctl restart nscd. On the BIND resolver itself:
sudo rndc flushclears the entire cache,
sudo rndc flush solvethenetwork.comclears only that zone's cached data, and
sudo rndc flushname sw-infrarunbook-01.solvethenetwork.comclears a single name.
Q: Why does an authoritative nameserver return REFUSED for some queries?
A: Authoritative-only nameservers run with
recursion no;by design. They answer queries only for zones they are configured to host. Queries for any other name — including names on the internet — return REFUSED. Make sure clients that need general recursive resolution are pointed at a recursive resolver, not directly at an authoritative nameserver.
Q: How can I confirm whether DNS traffic is going over UDP or TCP?
A: Force TCP in dig with the
+tcpflag:
dig +tcp @192.168.1.10 solvethenetwork.com. Without it, dig uses UDP. To capture both protocols in real time on the resolver:
sudo tcpdump -i eth0 -nn port 53— UDP queries show as UDP datagrams, TCP queries establish a three-way handshake before the DNS payload.
Q: What does a missing trailing dot in a zone file actually cause?
A: Without a trailing dot, BIND treats the name as relative and appends the current
$ORIGIN. In a zone file with
$ORIGIN solvethenetwork.com., a CNAME target written as
solvethenetwork.com(no dot) expands to
solvethenetwork.com.solvethenetwork.com.— a completely different and almost certainly non-existent name. Always use trailing dots on FQDNs inside zone file records.
Q: How do I verify that my resolver is not an open resolver?
A: Query your resolver from a host outside your network for a name it should not be authoritative for:
dig @<resolver-public-IP> cloudflare.com A. If it returns a valid answer, your resolver is accepting recursive queries from the internet and is an open resolver. Immediately restrict
allow-recursionin
named.conf.optionsto RFC 1918 ranges and reload BIND.
Q: Secondary nameservers are not picking up zone changes. Where should I look?
A: Start with the serial number — secondaries will not initiate a transfer if the primary's serial is equal to or lower than what they already hold. Then verify: TSIG key configuration matches on both primary and secondary,
allow-transferon the primary includes the secondary's IP, and firewall rules permit TCP port 53 between primary and secondary. Use
dig @192.168.1.11 solvethenetwork.com SOAon the secondary to compare its loaded serial with the primary's.
Q: Why do SSH connections hang for 30 seconds before succeeding or failing?
A: SSH performs a reverse DNS (PTR) lookup on the connecting client's IP when
UseDNS yesis set in
sshd_config. If the resolver is slow or no PTR record exists for the client's IP, the lookup times out and delays the connection. Set
UseDNS noin
/etc/ssh/sshd_configon servers where reverse DNS is unreliable, or add PTR records for the client subnets your users connect from.
Q: How do I test DNS from inside a container that has no dig or nslookup installed?
A: Use
getent hosts sw-infrarunbook-01.solvethenetwork.com— it calls
getaddrinfo()using the system resolver, respecting
/etc/resolv.confinside the container. You can also use
cat /etc/resolv.confto see which resolver the container is using, and
curl --resolve sw-infrarunbook-01.solvethenetwork.com:80:192.168.1.50 http://sw-infrarunbook-01.solvethenetwork.com/to bypass DNS entirely for connectivity tests.
Q: What is the fastest way to confirm BIND loaded a zone update correctly?
A: Run
sudo rndc zonestatus solvethenetwork.comimmediately after the reload. It reports the loaded serial, zone type (primary/secondary), file path, and whether the zone is active or in an error state — faster and more reliable than tailing log files.
Q: Can a correct zone file cause SERVFAIL if the serial number is not incremented?
A: Not SERVFAIL, but stale data. BIND reloads the zone file when instructed via
rndc reload, regardless of serial number. However, secondary nameservers use the serial to decide whether to request a zone transfer from the primary. If you update a zone file but do not increment the serial, secondary servers will not fetch the new data — clients querying the secondary will keep getting the old answers. Always increment the serial on every zone file change.
