InfraRunBook
    Back to articles

    DNS Resolution Failure Troubleshooting

    DNS
    Published: Apr 4, 2026
    Updated: Apr 4, 2026

    A step-by-step guide to diagnosing and resolving DNS resolution failures, covering resolver outages, misconfigured nameservers, firewall blocks on port 53, recursion misconfigurations, and broken BIND zone files.

    DNS Resolution Failure Troubleshooting

    Symptoms

    DNS resolution failures surface across every layer of your infrastructure simultaneously. When DNS breaks, applications stall mid-connection, SSH sessions hang for 30 seconds on hostname lookups before timing out, monitoring systems start firing cascade alerts, and users report that "the internet is down" when the network itself is perfectly healthy. Recognizing the symptom pattern before running any diagnostic commands saves significant time.

    Common symptoms include:

    • Name resolution timeout:
      ping sw-infrarunbook-01.solvethenetwork.com
      hangs for several seconds then returns
      ping: sw-infrarunbook-01.solvethenetwork.com: Name or service not known
    • SERVFAIL responses:
      dig
      returns
      status: SERVFAIL
      even for names that definitively exist in the zone
    • NXDOMAIN on existing records: Records that exist in the zone return NXDOMAIN, indicating the resolver cannot reach or does not trust the authoritative server
    • REFUSED responses: The resolver explicitly rejects queries — often due to recursion being disabled or ACL blocks excluding the client subnet
    • Partial resolution failures: Internal hostnames fail while external names resolve correctly (or vice versa), pointing to a split-horizon, forwarder, or zone loading issue
    • Intermittent failures: Some queries succeed while others fail for the same name — common when a round-robin resolver pool has one unhealthy member
    • Application-level errors:
      ERR_NAME_NOT_RESOLVED
      in browsers,
      getaddrinfo ENOTFOUND
      in Node.js logs,
      java.net.UnknownHostException
      in JVM applications

    Root Cause 1: Resolver Down

    Why It Happens

    The resolver — whether a BIND instance, Unbound, dnsmasq, or a network appliance — can crash, become unreachable, or exhaust system resources. Common triggers include OOM kills when cache memory grows unbounded, a runaway query flood that spikes CPU and causes the process watchdog to kill it, a corrupt configuration that causes the daemon to exit on reload, or a network change that isolates the resolver host from its clients. This is the first thing to rule out because it is the most complete failure mode: nothing resolves.

    How to Identify It

    From a client host, run a direct query against the configured resolver and observe the response:

    infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 solvethenetwork.com A
    
    ; <<>> DiG 9.18.12 <<>> @192.168.1.10 solvethenetwork.com A
    ; (1 server found)
    ;; global options: +cmd
    ;; connection timed out; no servers could be reached

    A timeout with

    no servers could be reached
    is the clearest possible sign the resolver is not responding. Confirm whether the host is reachable at all before logging into it:

    infrarunbook-admin@sw-infrarunbook-01:~$ ping -c 3 192.168.1.10
    PING 192.168.1.10 (192.168.1.10) 56(84) bytes of data.
    64 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.412 ms
    64 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.388 ms
    64 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.401 ms

    The host responds to ICMP but DNS is not answering. Log into the resolver and check the service state:

    infrarunbook-admin@sw-infrarunbook-01:~$ systemctl status named
    ● named.service - BIND Domain Name Server
         Loaded: loaded (/lib/systemd/system/named.service; enabled)
         Active: failed (Result: exit-code) since Fri 2026-04-04 09:12:44 UTC; 18min ago
        Process: 2341 ExecStart=/usr/sbin/named -f -u bind (code=exited, status=1/FAILURE)
       Main PID: 2341 (code=exited, status=1/FAILURE)
    
    Apr 04 09:12:44 sw-infrarunbook-01 named[2341]: loading configuration from '/etc/bind/named.conf'
    Apr 04 09:12:44 sw-infrarunbook-01 named[2341]: /etc/bind/named.conf.options:12: unknown option 'forwarders-policy'
    Apr 04 09:12:44 sw-infrarunbook-01 named[2341]: loading configuration: unexpected token

    If the process was killed by the OOM killer rather than a config error, check kernel logs:

    infrarunbook-admin@sw-infrarunbook-01:~$ journalctl -k | grep -i "oom\|killed"
    Apr 04 08:55:10 sw-infrarunbook-01 kernel: Out of memory: Killed process 2219 (named) score 312 total-vm:1048576kB

    How to Fix It

    Address the configuration error, validate it, then restart the service:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf /etc/bind/named.conf
    /etc/bind/named.conf.options:12: unknown option 'forwarders-policy'
    
    # Correct the typo in named.conf.options — remove or fix the invalid directive
    infraunbook-admin@sw-infrarunbook-01:~$ sudo nano /etc/bind/named.conf.options
    
    infraunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf /etc/bind/named.conf
    # No output means clean
    
    infraunbook-admin@sw-infrarunbook-01:~$ sudo systemctl start named
    infraunbook-admin@sw-infrarunbook-01:~$ systemctl status named
    ● named.service - BIND Domain Name Server
         Active: active (running) since Fri 2026-04-04 09:31:02 UTC; 3s ago

    If the OOM killer is the culprit, reduce the resolver cache size in

    named.conf.options
    :

    options {
        max-cache-size 256m;
        max-cache-ttl 3600;
    };

    Root Cause 2: Wrong Nameserver Configured

    Why It Happens

    A client pointed at the wrong nameserver will receive incorrect or empty answers. This arises from stale DHCP leases pushing an old resolver IP, a manually misconfigured

    /etc/resolv.conf
    that survived a rebuild, a broken systemd-resolved stub listener, or a cloud metadata service returning resolver addresses from a different VPC or subnet. When the queried server has no knowledge of the internal zone, it returns NXDOMAIN (if it has no forwarder configured for the zone) or silently forwards to upstream public resolvers that also have no knowledge of your private namespace.

    How to Identify It

    Check what nameserver the system is actually using at this moment:

    infrarunbook-admin@sw-infrarunbook-01:~$ cat /etc/resolv.conf
    # This file is managed by man:systemd-resolved(8).
    nameserver 172.16.0.254
    search solvethenetwork.com
    
    infraunbook-admin@sw-infrarunbook-01:~$ resolvectl status
    Global
           Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
    resolv.conf mode: stub
    Current DNS Server: 172.16.0.254
           DNS Servers: 172.16.0.254

    Now query that server directly and compare the result against the known authoritative resolver:

    infrarunbook-admin@sw-infrarunbook-01:~$ dig @172.16.0.254 sw-infrarunbook-01.solvethenetwork.com A
    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 44312
    
    infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19872
    ;; ANSWER SECTION:
    sw-infrarunbook-01.solvethenetwork.com. 300 IN A 192.168.1.50

    The authoritative resolver at

    192.168.1.10
    returns the correct answer. The wrong server at
    172.16.0.254
    returns NXDOMAIN because it has no knowledge of the internal zone.

    How to Fix It

    On systemd-networkd managed hosts, update the DNS setting in the network unit file:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo nano /etc/systemd/network/10-eth0.network
    
    [Network]
    DNS=192.168.1.10
    DNS=192.168.1.11
    
    infraunbook-admin@sw-infrarunbook-01:~$ sudo systemctl restart systemd-networkd
    infraunbook-admin@sw-infrarunbook-01:~$ resolvectl status
    Current DNS Server: 192.168.1.10
           DNS Servers: 192.168.1.10 192.168.1.11

    If

    /etc/resolv.conf
    is managed manually and needs direct correction:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo chattr -i /etc/resolv.conf
    infraunbook-admin@sw-infrarunbook-01:~$ sudo tee /etc/resolv.conf <<'EOF'
    nameserver 192.168.1.10
    nameserver 192.168.1.11
    search solvethenetwork.com
    EOF

    Root Cause 3: Firewall Blocking UDP/TCP Port 53

    Why It Happens

    DNS traffic runs on port 53 using UDP for standard queries and TCP for zone transfers, large responses, and any response exceeding 512 bytes (or 4096 bytes with EDNS0). Overly restrictive firewall rules — applied on the resolver host itself, on an intermediate network appliance, or in a cloud security group — can silently drop DNS packets in one or both directions. A particularly common scenario is a security hardening script that locks down all UDP by default and only whitelists specific application UDP ports, forgetting to include 53. Another scenario is a stateful firewall that permits outbound queries but blocks the return UDP datagrams because they arrive on unexpected ports or with unexpected source IPs.

    How to Identify It

    Test both UDP and TCP connectivity to port 53 directly, bypassing the resolver library entirely:

    infrarunbook-admin@sw-infrarunbook-01:~$ nc -zuv 192.168.1.10 53
    Connection to 192.168.1.10 53 port [udp/domain] succeeded!
    
    infraunbook-admin@sw-infrarunbook-01:~$ nc -zv 192.168.1.10 53
    nc: connect to 192.168.1.10 port 53 (tcp) failed: Connection refused

    UDP succeeds but TCP is blocked. This will cause silent failures for large DNS responses (DNSSEC, long TXT records, large ANY responses). Confirm with dig over TCP:

    infrarunbook-admin@sw-infrarunbook-01:~$ dig +tcp @192.168.1.10 solvethenetwork.com ANY
    ;; communications error to 192.168.1.10#53: connection refused

    On the resolver host, inspect the active firewall rules:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo iptables -L INPUT -n -v
    Chain INPUT (policy DROP)
     pkts bytes target  prot  opt in  out  source        destination
        0     0 ACCEPT  udp   --  *   *    0.0.0.0/0     192.168.1.10   udp dpt:53
        0     0 DROP    tcp   --  *   *    0.0.0.0/0     192.168.1.10   tcp dpt:53

    The DROP rule on TCP port 53 is explicit. If using nftables instead:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo nft list ruleset | grep -B2 -A2 "port 53"

    How to Fix It

    Insert rules to permit both UDP and TCP on port 53 and persist them:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo iptables -I INPUT -p tcp --dport 53 -j ACCEPT
    infraunbook-admin@sw-infrarunbook-01:~$ sudo iptables -I INPUT -p udp --dport 53 -j ACCEPT
    
    # Persist across reboots:
    infraunbook-admin@sw-infrarunbook-01:~$ sudo netfilter-persistent save
    run-parts: executing /usr/share/netfilter-persistent/plugins.d/15-ip4tables save
    run-parts: executing /usr/share/netfilter-persistent/plugins.d/25-ip6tables save

    For nftables environments:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo nft add rule inet filter input ip daddr 192.168.1.10 tcp dport 53 accept
    infraunbook-admin@sw-infrarunbook-01:~$ sudo nft add rule inet filter input ip daddr 192.168.1.10 udp dport 53 accept

    Verify the fix resolves both protocols:

    infrarunbook-admin@sw-infrarunbook-01:~$ dig +tcp @192.168.1.10 solvethenetwork.com SOA
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6712
    ;; ANSWER SECTION:
    solvethenetwork.com. 86400 IN SOA ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. 2026040401 3600 900 604800 300

    Root Cause 4: Recursion Disabled

    Why It Happens

    BIND distinguishes between authoritative queries (the server holds the zone and answers from local data) and recursive queries (the server looks up external names on behalf of the client, following the delegation chain from root). When

    recursion no;
    is set globally — a common hardening step for authoritative-only nameservers — any client that needs the resolver to look up external names will receive an explicit
    REFUSED
    . The same result occurs when recursion is enabled globally but the
    allow-recursion
    ACL does not include the querying client's subnet. This failure mode is especially disruptive because the resolver is healthy and the zone is loaded correctly — only recursive queries fail.

    How to Identify It

    infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
    ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 55023
    ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
    ;; WARNING: recursion requested but not available

    The message

    WARNING: recursion requested but not available
    is the definitive indicator. Internal zone names may still work if the server is authoritative for them, but all external lookups will fail. Confirm by checking the BIND configuration:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo grep -n "recursion\|allow-recursion" /etc/bind/named.conf.options
    8:  recursion no;
    9:  allow-recursion { none; };

    Or, if recursion is enabled but the ACL is too narrow:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo grep -n "allow-recursion" /etc/bind/named.conf.options
    8:  allow-recursion { 10.10.10.0/24; };
    
    # A client on 192.168.1.0/24 is not in this ACL and will receive REFUSED

    How to Fix It

    Edit

    /etc/bind/named.conf.options
    to enable recursion and specify the authorized client subnets:

    options {
        directory "/var/cache/bind";
    
        recursion yes;
        allow-recursion {
            192.168.0.0/16;
            10.0.0.0/8;
            172.16.0.0/12;
            127.0.0.1;
        };
    
        forwarders {
            8.8.8.8;
            1.1.1.1;
        };
        forward only;
    };

    Validate the configuration and reload without restarting:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf
    # No output = clean
    
    infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc reload
    server reload successful
    
    infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31881
    ;; ANSWER SECTION:
    cloudflare.com. 299 IN A 104.16.132.229
    Security note: Never set
    allow-recursion { any; };
    on a resolver with a public IP. Open resolvers are abused for DNS amplification DDoS attacks. Always restrict recursion to known RFC 1918 ranges or trusted client prefixes.

    Root Cause 5: Broken Zone File

    Why It Happens

    Zone file syntax errors are one of the most common causes of resolution failure for internal names. A missing trailing dot on a fully qualified domain name, an incorrect or non-monotonic serial number, a syntax error introduced during a manual edit (missing parenthesis in the SOA record, extra whitespace in a field), a record type used incorrectly, or a corrupted dynamic DNS journal file can all prevent BIND from loading the zone. When a zone fails to load, all queries for names within that zone return SERVFAIL — not NXDOMAIN, because the server knows it should be authoritative but cannot serve the data.

    How to Identify It

    infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23017
    
    infraunbook-admin@sw-infrarunbook-01:~$ sudo journalctl -u named --since "10 min ago"
    Apr 04 09:45:12 sw-infrarunbook-01 named[3104]: zone solvethenetwork.com/IN: loading from master file /etc/bind/zones/db.solvethenetwork.com failed: not at top of zone
    Apr 04 09:45:12 sw-infrarunbook-01 named[3104]: zone solvethenetwork.com/IN: not loaded due to errors.

    Use

    named-checkzone
    for a detailed report with line numbers:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkzone solvethenetwork.com /etc/bind/zones/db.solvethenetwork.com
    /etc/bind/zones/db.solvethenetwork.com:22: solvethenetwork.com: not at top of zone
    zone solvethenetwork.com/IN: not loaded due to errors.

    Inspect the zone file at and around line 22:

    infrarunbook-admin@sw-infrarunbook-01:~$ sed -n '18,25p' /etc/bind/zones/db.solvethenetwork.com
    $ORIGIN solvethenetwork.com.
    $TTL 300
    @   IN  SOA  ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. (
                    2026040401  ; serial
                    3600        ; refresh
                    900         ; retry
                    604800      ; expire
                    300 )       ; minimum TTL
    @   IN  NS   ns1.solvethenetwork.com.
    ; Missing trailing dot on the CNAME target:
    www IN  CNAME  solvethenetwork.com

    The CNAME target

    solvethenetwork.com
    without a trailing dot is treated as a relative name, expanding to
    solvethenetwork.com.solvethenetwork.com.
    — a name outside the zone origin, triggering the
    not at top of zone
    error.

    How to Fix It

    Add the trailing dot and increment the serial number:

    ; Before:
    www IN  CNAME  solvethenetwork.com
    
    ; After:
    www IN  CNAME  solvethenetwork.com.
    
    ; Also increment serial from 2026040401 to 2026040402

    Validate then reload the zone without a full named restart:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo named-checkzone solvethenetwork.com /etc/bind/zones/db.solvethenetwork.com
    zone solvethenetwork.com/IN: loaded serial 2026040402
    OK
    
    infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc reload solvethenetwork.com
    zone reload up-to-date
    
    infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
    ;; ANSWER SECTION:
    sw-infrarunbook-01.solvethenetwork.com. 300 IN A 192.168.1.50

    Root Cause 6: Stale Cache or Long Negative TTL

    Why It Happens

    DNS resolvers cache both positive responses (the record exists) and negative responses (NXDOMAIN — the record does not exist). If a record was deleted or changed but the old TTL has not expired, clients continue receiving the stale answer. Conversely, if a host was temporarily unreachable and NXDOMAIN was cached with a long negative TTL (controlled by the SOA

    minimum
    field per RFC 2308), clients receive NXDOMAIN long after the issue is resolved.

    How to Identify It

    infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
    ;; ANSWER SECTION:
    sw-infrarunbook-01.solvethenetwork.com. 287 IN A 192.168.1.50
    
    # TTL of 287 (counting down from 300) means this is a cached response
    # Query the authoritative server directly with +norecurse to compare:
    infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A +norecurse
    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 91204

    The resolver has a positive cached answer but the authoritative server now returns NXDOMAIN, confirming a stale positive cache entry.

    How to Fix It

    # Flush a specific name from the BIND cache:
    infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc flushname sw-infrarunbook-01.solvethenetwork.com
    
    # Flush all cached data for a zone:
    infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc flush solvethenetwork.com
    
    # Nuclear option — flush the entire resolver cache:
    infraunbook-admin@sw-infrarunbook-01:~$ sudo rndc flush
    
    # For systemd-resolved clients:
    infraunbook-admin@sw-infrarunbook-01:~$ sudo resolvectl flush-caches

    Root Cause 7: DNSSEC Validation Failure

    Why It Happens

    When DNSSEC validation is enabled on a resolver, any mismatch between the zone's RRSIG signatures and the DS records published in the parent zone causes the resolver to return SERVFAIL instead of the actual answer. This occurs after a KSK/ZSK rollover that was not properly coordinated with the parent zone, after zone signing configuration changes, or when the resolver host's system clock is skewed enough to fall outside a signature's validity window.

    How to Identify It

    infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 solvethenetwork.com A +dnssec
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 9901
    
    # Re-query with +cd (checking disabled) to bypass validation:
    infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 solvethenetwork.com A +dnssec +cd
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2213
    ;; ANSWER SECTION:
    solvethenetwork.com. 300 IN A 192.168.1.100

    The

    +cd
    flag bypasses DNSSEC validation and returns a result, confirming the data exists but validation is failing. Check resolver logs for the specific validation error:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo journalctl -u named | grep -i "dnssec\|bogus\|validation"
    Apr 04 10:02:31 sw-infrarunbook-01 named[3104]: validating solvethenetwork.com/DNSKEY: no valid signature found (DS)

    How to Fix It

    For internal zones where DNSSEC is not operationally required, disable validation for that specific zone:

    zone "solvethenetwork.com" IN {
        type forward;
        forwarders { 192.168.1.10; };
        forward only;
        dnssec-validation no;
    };

    For production zones, re-sign the zone and re-publish the DS record in the parent to restore the chain of trust. Check the system clock skew first:

    infrarunbook-admin@sw-infrarunbook-01:~$ timedatectl status
                   Local time: Fri 2026-04-04 10:05:11 UTC
               Universal time: Fri 2026-04-04 10:05:11 UTC
         NTP service: active
    NTP synchronized: yes

    Root Cause 8: Forwarder Misconfiguration

    Why It Happens

    Most internal resolvers are configured to forward queries for unknown zones to upstream servers. If the forwarder IP is wrong, the upstream is unreachable, the upstream is returning errors, or the

    forward only;
    directive prevents fallback to root hints when the forwarder fails, all recursive lookups for external names will return SERVFAIL — even though the resolver daemon itself is running correctly and internal zones are loading fine.

    How to Identify It

    infrarunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 77231
    
    infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 sw-infrarunbook-01.solvethenetwork.com A
    ;; ANSWER SECTION:
    sw-infrarunbook-01.solvethenetwork.com. 300 IN A 192.168.1.50

    Internal names resolve; external names return SERVFAIL. This asymmetry almost always points to a broken forwarder. Check the configuration:

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo grep -A5 "forwarders" /etc/bind/named.conf.options
    forwarders {
        172.16.99.99;   # This host does not exist on the network
    };
    forward only;
    
    infraunbook-admin@sw-infrarunbook-01:~$ dig @172.16.99.99 cloudflare.com A
    ;; connection timed out; no servers could be reached

    How to Fix It

    infrarunbook-admin@sw-infrarunbook-01:~$ sudo nano /etc/bind/named.conf.options
    
    forwarders {
        8.8.8.8;
        8.8.4.4;
    };
    forward only;
    
    infraunbook-admin@sw-infrarunbook-01:~$ sudo named-checkconf && sudo rndc reload
    server reload successful
    
    infraunbook-admin@sw-infrarunbook-01:~$ dig @192.168.1.10 cloudflare.com A
    ;; ANSWER SECTION:
    cloudflare.com. 299 IN A 104.16.132.229

    Prevention

    The majority of DNS resolution failures are preventable through configuration discipline, pre-deployment validation, redundancy, and active monitoring. Adopt the following practices to avoid outages before they happen:

    • Automate zone file validation in CI/CD. Run
      named-checkconf
      and
      named-checkzone
      as blocking gates in every pipeline that touches DNS configuration. A zone file error that passes peer review but fails
      named-checkzone
      should never reach production.
    • Deploy a minimum of two resolvers per site. A single-resolver environment means any service restart, host reboot, or configuration reload failure causes a total DNS outage. Configure both IPs in DHCP scope options and in each host's
      /etc/resolv.conf
      or network unit file.
    • Monitor resolver health with synthetic probes. Use Prometheus's
      blackbox_exporter
      DNS probe module to query a known-good name every 30 seconds and alert on SERVFAIL, NXDOMAIN, or response time exceeding 200ms.
    • Keep negative TTLs short. Set the SOA
      minimum
      field to 300–600 seconds. A value of 3600 means a misconfiguration that causes NXDOMAIN caching takes a full hour to self-heal after the fix is applied.
    • Restrict recursion to named subnets. Always use
      allow-recursion
      with explicit RFC 1918 ranges. Never use
      allow-recursion { any; };
      on any server reachable from outside your network perimeter.
    • Version-control all zone files. Store zone files in git with a commit per change. Each commit message should include the change reason and the new serial number, enabling rapid rollback and providing an audit trail.
    • Test both UDP and TCP port 53 after every firewall change. Large DNSSEC responses, zone transfers, and EDNS0-extended responses all require TCP. A post-change test of both protocols should be a mandatory step in every firewall change procedure.
    • Synchronize resolver system clocks with NTP. DNSSEC signatures have validity windows. Alert if clock drift on resolver hosts exceeds five seconds, and never let resolvers run without an NTP source.
    • Use
      rndc zonestatus
      as a post-deploy check.
      After any zone reload,
      sudo rndc zonestatus solvethenetwork.com
      immediately confirms the loaded serial and zone state without log scraping.

    Frequently Asked Questions

    Q: What is the difference between SERVFAIL and NXDOMAIN?

    A: NXDOMAIN (Non-Existent Domain) means the authoritative server was successfully consulted and confirmed that the name does not exist in the zone. SERVFAIL means the resolver encountered an error trying to answer — the zone may exist, but the resolver could not retrieve or validate the data. SERVFAIL always warrants infrastructure investigation. NXDOMAIN usually means either the record is genuinely absent or the resolver is querying the wrong nameserver for the zone.

    Q: Why does
    dig
    return the correct answer but
    ping
    or
    curl
    fails to resolve the same name?

    A:

    dig
    queries the nameserver directly and bypasses the system resolver library entirely. Applications like
    ping
    and
    curl
    use
    getaddrinfo()
    , which reads
    /etc/nsswitch.conf
    and may route queries through
    /etc/hosts
    , mDNS, or systemd-resolved's stub listener at
    127.0.0.53
    before reaching the DNS server configured in
    /etc/resolv.conf
    . Check the
    hosts:
    line in
    /etc/nsswitch.conf
    and run
    resolvectl status
    to see the effective resolver used by the stub.

    Q: How do I flush the DNS cache on a Linux host without restarting the resolver?

    A: On systemd-resolved clients:

    sudo resolvectl flush-caches
    . On systems running nscd:
    sudo systemctl restart nscd
    . On the BIND resolver itself:
    sudo rndc flush
    clears the entire cache,
    sudo rndc flush solvethenetwork.com
    clears only that zone's cached data, and
    sudo rndc flushname sw-infrarunbook-01.solvethenetwork.com
    clears a single name.

    Q: Why does an authoritative nameserver return REFUSED for some queries?

    A: Authoritative-only nameservers run with

    recursion no;
    by design. They answer queries only for zones they are configured to host. Queries for any other name — including names on the internet — return REFUSED. Make sure clients that need general recursive resolution are pointed at a recursive resolver, not directly at an authoritative nameserver.

    Q: How can I confirm whether DNS traffic is going over UDP or TCP?

    A: Force TCP in dig with the

    +tcp
    flag:
    dig +tcp @192.168.1.10 solvethenetwork.com
    . Without it, dig uses UDP. To capture both protocols in real time on the resolver:
    sudo tcpdump -i eth0 -nn port 53
    — UDP queries show as UDP datagrams, TCP queries establish a three-way handshake before the DNS payload.

    Q: What does a missing trailing dot in a zone file actually cause?

    A: Without a trailing dot, BIND treats the name as relative and appends the current

    $ORIGIN
    . In a zone file with
    $ORIGIN solvethenetwork.com.
    , a CNAME target written as
    solvethenetwork.com
    (no dot) expands to
    solvethenetwork.com.solvethenetwork.com.
    — a completely different and almost certainly non-existent name. Always use trailing dots on FQDNs inside zone file records.

    Q: How do I verify that my resolver is not an open resolver?

    A: Query your resolver from a host outside your network for a name it should not be authoritative for:

    dig @<resolver-public-IP> cloudflare.com A
    . If it returns a valid answer, your resolver is accepting recursive queries from the internet and is an open resolver. Immediately restrict
    allow-recursion
    in
    named.conf.options
    to RFC 1918 ranges and reload BIND.

    Q: Secondary nameservers are not picking up zone changes. Where should I look?

    A: Start with the serial number — secondaries will not initiate a transfer if the primary's serial is equal to or lower than what they already hold. Then verify: TSIG key configuration matches on both primary and secondary,

    allow-transfer
    on the primary includes the secondary's IP, and firewall rules permit TCP port 53 between primary and secondary. Use
    dig @192.168.1.11 solvethenetwork.com SOA
    on the secondary to compare its loaded serial with the primary's.

    Q: Why do SSH connections hang for 30 seconds before succeeding or failing?

    A: SSH performs a reverse DNS (PTR) lookup on the connecting client's IP when

    UseDNS yes
    is set in
    sshd_config
    . If the resolver is slow or no PTR record exists for the client's IP, the lookup times out and delays the connection. Set
    UseDNS no
    in
    /etc/ssh/sshd_config
    on servers where reverse DNS is unreliable, or add PTR records for the client subnets your users connect from.

    Q: How do I test DNS from inside a container that has no dig or nslookup installed?

    A: Use

    getent hosts sw-infrarunbook-01.solvethenetwork.com
    — it calls
    getaddrinfo()
    using the system resolver, respecting
    /etc/resolv.conf
    inside the container. You can also use
    cat /etc/resolv.conf
    to see which resolver the container is using, and
    curl --resolve sw-infrarunbook-01.solvethenetwork.com:80:192.168.1.50 http://sw-infrarunbook-01.solvethenetwork.com/
    to bypass DNS entirely for connectivity tests.

    Q: What is the fastest way to confirm BIND loaded a zone update correctly?

    A: Run

    sudo rndc zonestatus solvethenetwork.com
    immediately after the reload. It reports the loaded serial, zone type (primary/secondary), file path, and whether the zone is active or in an error state — faster and more reliable than tailing log files.

    Q: Can a correct zone file cause SERVFAIL if the serial number is not incremented?

    A: Not SERVFAIL, but stale data. BIND reloads the zone file when instructed via

    rndc reload
    , regardless of serial number. However, secondary nameservers use the serial to decide whether to request a zone transfer from the primary. If you update a zone file but do not increment the serial, secondary servers will not fetch the new data — clients querying the secondary will keep getting the old answers. Always increment the serial on every zone file change.

    Frequently Asked Questions

    What is the difference between SERVFAIL and NXDOMAIN?

    NXDOMAIN means the authoritative server confirmed the name does not exist in the zone. SERVFAIL means the resolver encountered an error trying to answer — the resolver could not retrieve or validate the data. SERVFAIL always warrants infrastructure investigation; NXDOMAIN usually means the record is genuinely absent or the resolver is querying the wrong nameserver.

    Why does dig return the correct answer but ping or curl fails to resolve the same name?

    dig queries the nameserver directly, bypassing the system resolver library. Applications like ping and curl use getaddrinfo(), which routes queries through /etc/nsswitch.conf, /etc/hosts, mDNS, or systemd-resolved's stub listener at 127.0.0.53. Check the hosts: line in /etc/nsswitch.conf and run resolvectl status to see the effective resolver.

    How do I flush the DNS cache on a Linux host without restarting the resolver?

    On systemd-resolved clients: sudo resolvectl flush-caches. On nscd systems: sudo systemctl restart nscd. On BIND: sudo rndc flush clears the full cache, sudo rndc flush solvethenetwork.com clears a zone, and sudo rndc flushname sw-infrarunbook-01.solvethenetwork.com clears a single name.

    Why does an authoritative nameserver return REFUSED for some queries?

    Authoritative-only nameservers run with recursion no; and only answer queries for zones they host. Queries for any other name return REFUSED. Clients needing recursive resolution must be pointed at a recursive resolver, not directly at an authoritative nameserver.

    How can I confirm whether DNS traffic is going over UDP or TCP?

    Force TCP in dig with the +tcp flag: dig +tcp @192.168.1.10 solvethenetwork.com. Without it, dig uses UDP by default. To capture both protocols in real time on the resolver: sudo tcpdump -i eth0 -nn port 53.

    What does a missing trailing dot in a zone file actually cause?

    Without a trailing dot, BIND treats the name as relative and appends the current $ORIGIN. In a zone file with $ORIGIN solvethenetwork.com., a target written without a trailing dot expands to name.solvethenetwork.com.solvethenetwork.com. — a completely wrong name. Always use trailing dots on FQDNs in zone records.

    How do I verify that my resolver is not an open resolver?

    Query your resolver from outside your network for an external name: dig @<resolver-public-IP> cloudflare.com A. If it returns a valid answer, your resolver accepts recursive queries from the internet and must be secured immediately by restricting allow-recursion to RFC 1918 ranges in named.conf.options.

    Secondary nameservers are not picking up zone changes. Where should I look?

    Start with the serial number — secondaries will not transfer if the primary serial is not higher than what they hold. Then verify: TSIG keys match, allow-transfer on the primary includes the secondary's IP, and TCP port 53 is permitted between primary and secondary hosts.

    Why do SSH connections hang for 30 seconds before succeeding or failing?

    SSH performs a reverse DNS PTR lookup on the connecting client's IP when UseDNS yes is set in sshd_config. Set UseDNS no in /etc/ssh/sshd_config on servers where reverse DNS is unreliable, or add PTR records for client subnets.

    How do I test DNS from inside a container that has no dig or nslookup installed?

    Use getent hosts sw-infrarunbook-01.solvethenetwork.com — it calls getaddrinfo() using the container's system resolver. Check /etc/resolv.conf inside the container to see which resolver IP is configured, and use curl --resolve to bypass DNS entirely for connectivity testing.

    Related Articles