InfraRunBook
    Back to articles

    DNS SERVFAIL Debugging

    DNS
    Published: Apr 12, 2026
    Updated: Apr 13, 2026

    A practical field guide to diagnosing DNS SERVFAIL errors, covering recursive resolver failures, DNSSEC validation breakdowns, unreachable authoritatives, forwarder outages, and zone...

    DNS SERVFAIL Debugging

    Symptoms

    You're staring at a SERVFAIL and the clock is ticking. Maybe a monitoring alert fired, maybe a developer is messaging you because their app can't resolve anything, or maybe you ran a quick

    dig
    and got back a response code you didn't expect. DNS SERVFAIL — RCODE 2 — means the server encountered a failure it couldn't recover from. Unlike NXDOMAIN, which tells you the name doesn't exist, SERVFAIL tells you something broke in the resolution process itself. That distinction matters enormously when you're trying to debug.

    The symptoms usually look like one of these:

    $ dig solvethenetwork.com @192.168.1.53
    
    ; <<>> DiG 9.18.24 <<>> solvethenetwork.com @192.168.1.53
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 48291
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

    Or in BIND's query log:

    12-Apr-2026 08:14:32.412 client @0x7f3a8c003a20 192.168.10.45#51234 (solvethenetwork.com): query failed (SERVFAIL) for solvethenetwork.com/IN/A at query.c:9316

    Or surfacing in application logs as:

    SERVFAIL resolving 'mail.solvethenetwork.com/MX/IN': 192.168.1.53#53

    End users see "DNS_PROBE_FINISHED_SERVFAIL" in Chrome or "Server not found" in Firefox. The key thing to internalize: SERVFAIL is your resolver saying it tried and failed — not that the name doesn't exist. The cause could be anywhere in the resolution chain, from a misconfigured local resolver all the way out to a broken authoritative server on the far side of the internet. Your job is to work through that chain systematically.


    Root Cause 1: Recursive Resolver Failing

    This is always the first thing I check. A recursive resolver that's overloaded, misconfigured, or has exhausted a critical resource will throw SERVFAILs for everything — not just specific zones. If you're seeing blanket SERVFAIL responses across unrelated domains, the resolver itself is the problem, not any particular zone.

    Why does this happen? Common triggers include running out of file descriptors (the resolver can't open new UDP or TCP sockets to contact upstream servers), hitting the

    recursive-clients
    limit in BIND, or a resolver that was misconfigured for authoritative-only work but is still receiving recursive queries from clients.

    Start by checking whether the failure is universal or isolated to specific domains:

    $ dig google.com @192.168.1.53 +short
    ;; connection timed out; no servers could be reached
    
    $ dig cloudflare.com @192.168.1.53 +short
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 19234
    
    $ dig internal.solvethenetwork.com @192.168.1.53 +short
    10.10.1.50

    Internal zones resolving while external domains fail points directly at the recursion stack. Check BIND's current status and recursive client usage:

    $ rndc status
    version: BIND 9.18.24 (Stable Release)
    running on sw-infrarunbook-01: Linux x86_64 6.1.0
    ...
    recursive clients: 998/900/1000
    tcp clients: 287/300

    That format is current/softlimit/hardlimit. If the first number is near the hard limit, you're dropping queries. Also check system-level file descriptors:

    sw-infrarunbook-01:~# cat /proc/$(pgrep named)/limits | grep "open files"
    Max open files            4096                 4096                 files
    
    sw-infrarunbook-01:~# ls /proc/$(pgrep named)/fd | wc -l
    4087

    That's nearly exhausted. Raise the limits in

    /etc/security/limits.conf
    and BIND's startup environment file, then restart:

    # /etc/default/named (Debian/Ubuntu) or /etc/sysconfig/named (RHEL)
    OPTIONS="-u bind -n 4"
    
    # /etc/security/limits.conf
    bind    soft    nofile  65535
    bind    hard    nofile  65535

    In

    named.conf
    , also tune the client limits:

    options {
        recursive-clients 2000;
        tcp-clients 500;
    };

    Root Cause 2: DNSSEC Validation Failure

    In my experience, DNSSEC is responsible for more unexpected SERVFAILs than almost anything else — and it's especially insidious because the zone might resolve perfectly from a non-validating resolver, making it look like a resolver problem when it's actually a signing problem.

    When a DNSSEC-validating resolver (BIND with

    dnssec-validation auto;
    , or Unbound with validation enabled) can't verify the chain of trust for a response, it returns SERVFAIL rather than an unvalidated record. This happens when a zone's RRSIG has expired, when the DS record in the parent zone doesn't match the current DNSKEY, or when a key rollover occurred but the new DS record hasn't been published at the registrar yet.

    The diagnostic trick here is comparing

    dig +dnssec
    against
    dig +cd
    (checking disabled). If the query succeeds with
    +cd
    but fails without it, DNSSEC validation is your culprit:

    # This FAILS — validation is running and rejecting the response
    $ dig solvethenetwork.com @192.168.1.53 +dnssec
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 33812
    
    # This SUCCEEDS — validation bypassed with +cd
    $ dig solvethenetwork.com @192.168.1.53 +cd
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44201
    ;; ANSWER SECTION:
    solvethenetwork.com.  300  IN  A  203.0.113.10

    Now check whether the RRSIG is still valid:

    $ dig solvethenetwork.com @8.8.8.8 +dnssec +multiline
    ;; ANSWER SECTION:
    solvethenetwork.com. 300 IN A 203.0.113.10
    solvethenetwork.com. 300 IN RRSIG A 8 2 300 (
                    20260101000000 20251202000000 12345 solvethenetwork.com.
                    Km8fT2nYq... )

    The first timestamp is the expiry. If it's in the past, zone signing has lapsed. Compare the DS record in the parent against your current DNSKEY:

    $ dig DS solvethenetwork.com @a.gtld-servers.net +norecurse
    ;; ANSWER SECTION:
    solvethenetwork.com.  3600  IN  DS  12345 8 2 A1B2C3D4E5F6...
    
    $ dig DNSKEY solvethenetwork.com @192.168.2.10
    ;; ANSWER SECTION:
    solvethenetwork.com.  3600  IN  DNSKEY  257 3 8 AwEAAb9xZ...

    Hash the DNSKEY and compare to the DS. A mismatch means your registrar has a stale DS record. The fix depends on your signing setup: if you're using inline signing in BIND, run

    rndc sign solvethenetwork.com
    to force a re-sign. If keys have rolled, update the DS record at your registrar. As an emergency measure while you resolve the signing issue, you can temporarily disable DNSSEC validation — but treat it as a stopgap, not a solution:

    // named.conf — emergency only, revert after fixing signing
    options {
        dnssec-validation no;
    };

    Root Cause 3: Upstream Authoritative Unreachable

    When your recursive resolver can't reach the authoritative name servers for a zone — because they're down, firewalled, or the glue records point to decommissioned IPs — it can't get an answer and returns SERVFAIL. This is different from a timeout in that the connection is outright refused or the servers simply don't respond at all.

    This happens after botched migrations where old nameservers were decommissioned before the TTL expired on NS records, or when a firewall rule accidentally blocks outbound UDP/TCP port 53 from your resolver host.

    Walk the delegation manually to verify each hop:

    $ dig NS solvethenetwork.com @a.root-servers.net +norecurse
    ;; AUTHORITY SECTION:
    solvethenetwork.com.  172800  IN  NS  ns1.solvethenetwork.com.
    solvethenetwork.com.  172800  IN  NS  ns2.solvethenetwork.com.
    
    # Check glue records
    $ dig A ns1.solvethenetwork.com @a.root-servers.net
    ;; ADDITIONAL SECTION:
    ns1.solvethenetwork.com.  3600  IN  A  203.0.113.5
    
    # Now try to reach that authoritative directly
    $ dig solvethenetwork.com @203.0.113.5 +time=3 +tries=1
    ;; connection timed out; no servers could be reached

    Confirm from the resolver host itself:

    sw-infrarunbook-01:~# nc -zvu 203.0.113.5 53
    nc: connect to 203.0.113.5 port 53 (udp) failed: Connection refused
    
    sw-infrarunbook-01:~# traceroute -n 203.0.113.5
    traceroute to 203.0.113.5, 30 hops max
     1  10.0.0.1     0.8ms
     2  * * *
     3  * * *

    Check for iptables rules blocking outbound port 53 from the resolver:

    sw-infrarunbook-01:~# iptables -L OUTPUT -nv | grep ":53"
        0     0 DROP       udp  --  *      *       0.0.0.0/0  0.0.0.0/0  udp dpt:53

    There's your block. Remove that rule and make the change persistent through your firewall management tool. If the authoritative server is genuinely down and you control it, bring it back. If it's a third-party provider, update your NS records to point to working servers and ensure the glue is correct at the registrar.


    Root Cause 4: Timeout from Authoritative

    Different from unreachable — the server is there, packets are routing, but it's not responding within your resolver's query timeout window. BIND has a default timeout for upstream responses, and if the authoritative is overloaded or experiencing latency spikes, your resolver gives up and SERVFAILs the client.

    I've seen this with authoritative servers that are under DDoS, with anycast DNS providers where the nearest node is temporarily degraded, and with under-resourced authoritative servers that queue up during traffic bursts. The symptom is intermittent SERVFAIL — queries sometimes succeed and sometimes don't — rather than consistent failure.

    Enable BIND's query logging and watch for timeout messages:

    $ rndc querylog on
    
    $ tail -f /var/log/named/queries.log | grep -E 'timed out|SERVFAIL'
    12-Apr-2026 09:22:14.811 resolver: fetch: solvethenetwork.com/A: timed out
    12-Apr-2026 09:22:14.812 client @0x7f3a... 192.168.10.45#51234: query failed (SERVFAIL)

    Confirm the latency by querying the authoritative directly with a generous timeout:

    $ dig solvethenetwork.com @203.0.113.5 +stats +time=10
    ;; Query time: 4971 msec
    ;; SERVER: 203.0.113.5#53(203.0.113.5)
    ;; WHEN: Sat Apr 12 09:22:19 UTC 2026
    
    # Compare against a healthy secondary:
    $ dig solvethenetwork.com @203.0.113.6 +stats +time=10
    ;; Query time: 22 msec

    Nearly 5 seconds from one authoritative, 22ms from the other. BIND's default resolver query timeout is 10 seconds but it's accounting for multiple retries — an individual query attempt timeout is lower. The short-term fix is to tune BIND's resolver timeout in

    named.conf
    :

    options {
        resolver-query-timeout 15000;  // milliseconds, default is 10000
    };

    The real fix is addressing the authoritative side: reduce load on the slow server, add capacity, or ensure your NS records include multiple geographically distributed servers so a slow node doesn't dominate. BIND tries authoritative servers based on measured RTT history, so a consistently slow server will eventually be deprioritized — but not until it's caused a run of SERVFAILs first.


    Root Cause 5: Forwarder Down

    Many corporate resolver setups use forwarders — your internal BIND instance handles authoritative answers for internal zones but forwards external queries to a central resolver (maybe 192.168.1.100 or a dedicated DNS appliance). When that forwarder is down or unreachable, every forwarded query returns SERVFAIL.

    This is particularly painful because it looks like "the internet is broken" to end users, while your internal zones resolve perfectly. The resolver is healthy. The zone is fine. It's just the forwarding chain that's broken.

    Check

    named.conf
    for the forwarder configuration:

    options {
        forwarders {
            192.168.1.100;
            192.168.1.101;
        };
        forward only;
    };

    That

    forward only;
    directive is the critical detail. It means if every configured forwarder is unreachable, BIND won't fall back to iterative resolution — it simply fails. Test each forwarder directly:

    sw-infrarunbook-01:~# dig google.com @192.168.1.100 +time=3 +tries=1
    ;; connection timed out; no servers could be reached
    
    sw-infrarunbook-01:~# dig google.com @192.168.1.101 +time=3 +tries=1
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9988
    ;; ANSWER SECTION:
    google.com.  299  IN  A  142.250.80.46

    First forwarder is dead, second is alive. But BIND tries forwarders in order and will time out on the dead one before successfully reaching the live one. Depending on client timeout settings, this causes either slow responses or outright SERVFAILs. Remove the dead forwarder from

    named.conf
    and reload:

    sw-infrarunbook-01:~# rndc reconfig
    server reload successful

    Structurally, if you have flexibility in your forwarding policy, change

    forward only
    to
    forward first
    . This makes BIND attempt forwarders but fall back to iterative resolution if they all fail:

    options {
        forwarders {
            192.168.1.100;
            192.168.1.101;
        };
        forward first;  // falls back to iterative if forwarders fail
    };

    Whether that's appropriate depends on your security posture — some environments use

    forward only
    deliberately to ensure all external DNS traffic flows through a controlled, logging forwarder. In that case, the forwarder availability is critical infrastructure and needs monitoring to match.


    Root Cause 6: Zone Data Misconfiguration

    Sometimes the authoritative server is up, reachable, and otherwise healthy — but the zone file itself is broken. A syntax error, a missing SOA record, an invalid RDATA value, or the classic missing trailing dot on an NS or MX record will cause BIND to refuse to load the zone entirely. Queries for names in that zone then return SERVFAIL from any resolver that follows the delegation to it.

    BIND logs a clear error when a zone fails to load, but it's easy to miss if you're not watching:

    sw-infrarunbook-01:~# journalctl -u named | grep -E 'failed|error' | tail -20
    Apr 12 08:30:01 sw-infrarunbook-01 named[3421]: dns_rdata_fromtext: /etc/bind/zones/solvethenetwork.com.zone:67: near 'mailsolvethenetwork.com.': bad name (missing dot ?)
    Apr 12 08:30:01 sw-infrarunbook-01 named[3421]: zone solvethenetwork.com/IN: loading from master file failed: bad owner name (check-names)
    Apr 12 08:30:01 sw-infrarunbook-01 named[3421]: zone solvethenetwork.com/IN: not loaded due to errors.

    Always validate with

    named-checkzone
    before applying changes:

    # Clean zone:
    sw-infrarunbook-01:~# named-checkzone solvethenetwork.com /etc/bind/zones/solvethenetwork.com.zone
    zone solvethenetwork.com/IN: loaded serial 2026041201
    OK
    
    # Broken zone (missing trailing dot on MX record):
    sw-infrarunbook-01:~# named-checkzone solvethenetwork.com /etc/bind/zones/solvethenetwork.com.zone
    /etc/bind/zones/solvethenetwork.com.zone:67: near 'mail.solvethenetwork.com': not a valid name
    zone solvethenetwork.com/IN: loading from master file /etc/bind/zones/solvethenetwork.com.zone failed: not a valid name
    zone solvethenetwork.com/IN: not loaded due to errors.

    Fix the zone file (add the trailing dot:

    mail.solvethenetwork.com.
    ), verify with
    named-checkzone
    , then reload just that zone:

    sw-infrarunbook-01:~# rndc reload solvethenetwork.com
    zone reload up-to-date

    Root Cause 7: Lame Delegation

    A lame delegation occurs when the parent zone delegates to a nameserver that isn't actually authoritative for the zone. The resolver follows the delegation, queries the listed NS, and gets back either a REFUSED response or a non-authoritative answer it can't use — so it SERVFAILs the client.

    This happens after NS record migrations where the new server hasn't been configured with the zone yet, or when a secondary nameserver's zone transfer silently fails and it stops serving the zone. Check whether each listed NS is actually serving the zone authoritatively — you want to see the

    aa
    (authoritative answer) flag in the response:

    # Query NS directly without recursion
    $ dig solvethenetwork.com @ns2.solvethenetwork.com +norec
    ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 55123
    ;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
    
    # Note: no 'aa' flag, REFUSED — this is a lame delegation
    
    # A properly authoritative response looks like this:
    ;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 3
    ;; ANSWER SECTION:
    solvethenetwork.com.  300  IN  A  203.0.113.10

    The

    aa
    flag must be present for a valid authoritative response. To fix a lame delegation, either configure the zone on the listed nameserver or update the NS records to remove the lame server. If it's a secondary that lost its zone, check whether zone transfers are completing:

    sw-infrarunbook-01:~# rndc retransfer solvethenetwork.com
    
    # Then check:
    sw-infrarunbook-01:~# rndc zonestatus solvethenetwork.com
    name: solvethenetwork.com
    type: secondary
    files: /var/cache/bind/solvethenetwork.com.jnl
    serial: 2026041201
    nodes: 24
    last loaded: Sat, 12 Apr 2026 08:00:00 GMT
    secure: yes
    status: transfer in progress

    Prevention

    Prevention is about building observability and resilience into your DNS infrastructure before the 3 AM page hits. The most impactful things I've implemented over the years:

    Monitor SERVFAIL rates, not just availability. A resolver that's up but returning 5% SERVFAILs looks perfectly healthy to a ping check. Enable BIND's statistics channel and scrape it with a metrics collector:

    options {
        statistics-channels {
            inet 127.0.0.1 port 8080 allow { 127.0.0.1; 192.168.1.0/24; };
        };
    };

    Use Prometheus's

    bind_exporter
    to collect these metrics and alert when the SERVFAIL rate climbs above a baseline. Even a 1% SERVFAIL rate on a busy resolver is thousands of failed queries per minute — worth investigating before users start noticing.

    Run

    named-checkzone
    in your deployment pipeline. Before any zone file change reaches a live server, validate it automatically. This single gate eliminates the entire class of zone data misconfiguration SERVFAILs. If your team uses Git for zone management (and they should), add a pre-commit hook or CI check that runs named-checkzone on every changed zone file.

    Test DNSSEC validation separately from resolution. Set up a monitoring check that runs both

    dig solvethenetwork.com +dnssec
    and
    dig solvethenetwork.com +cd
    every five minutes, and alerts if the validating query fails while the non-validating one succeeds. That's your DNSSEC canary. Also monitor RRSIG expiry dates directly — most inline-signing setups auto-resign, but verify the signatures aren't silently lapsing by checking the expiry timestamp periodically.

    Never use

    forward only
    without monitoring your forwarders. If your environment requires forced forwarding for security reasons, treat forwarder availability as a critical infrastructure metric. Monitor each forwarder with a synthetic DNS check, not just a ping. A forwarder that's reachable but returning SERVFAILs of its own is just as bad as one that's down.

    Always run at least two authoritative nameservers per zone. A single authoritative nameserver is a single point of failure for that zone. Two NS records pointing to geographically separated servers dramatically reduces the blast radius of hardware failures and upstream connectivity issues. This is DNS fundamentals, but I still see single-NS zones in production more often than I'd like.

    Build a delegation map for zones you own. Keep a runbook entry for each zone that lists the authoritative servers, their IPs, the registrar where NS and DS records are managed, and the signing configuration. When SERVFAIL hits during an incident, you don't want to spend the first 20 minutes figuring out where the zone even lives or who has access to update the DS record.

    SERVFAIL is almost always diagnosable in under ten minutes with the right methodology. Is the failure universal or zone-specific? Does

    +cd
    resolve it? Can I reach the authoritative directly? Is the zone actually loaded and serving the
    aa
    flag? Is my forwarder alive? Work through those questions systematically, one hop at a time from client to root, and you'll find the cause. The real skill isn't memorizing every possible failure mode upfront — it's building the muscle memory to follow the resolution chain and let the evidence tell you where it broke.

    Related Articles

    Frequently Asked Questions

    What is the difference between SERVFAIL and NXDOMAIN in DNS?

    NXDOMAIN (RCODE 3) means the queried name definitively does not exist in the zone. SERVFAIL (RCODE 2) means the resolver encountered an error during resolution — the name may well exist, but something in the resolution chain failed. SERVFAIL requires investigating the resolution chain; NXDOMAIN means you should check whether the record was ever created or was accidentally deleted.

    How can I tell if a SERVFAIL is caused by DNSSEC validation?

    Run the same query twice: once normally and once with the +cd (checking disabled) flag. If the query with +cd returns NOERROR and the query without it returns SERVFAIL, DNSSEC validation is the cause. You can also look for BIND log entries referencing 'validation' or 'RRSIG' errors in the query log.

    Why does 'forward only' in BIND cause more severe outages than 'forward first'?

    With 'forward only', BIND will not fall back to iterative resolution if all configured forwarders are unreachable or unresponsive — it returns SERVFAIL immediately. With 'forward first', BIND attempts the forwarders first but falls back to performing its own iterative resolution if they fail. 'forward only' is appropriate when you need to guarantee all external queries flow through a specific resolver for logging or filtering, but it requires robust forwarder monitoring.

    What does a lame delegation look like in a dig response?

    When you query a lame nameserver directly with +norec (no recursion), you'll get either a REFUSED status code or a response that lacks the 'aa' (authoritative answer) flag. A legitimate authoritative server will always return the aa flag. Missing aa combined with REFUSED or an empty answer section is the telltale sign of a lame delegation.

    How do I check if BIND is running out of recursive client slots?

    Run 'rndc status' and look at the 'recursive clients' line. The format is current/softlimit/hardlimit. If the current value is approaching the hard limit, BIND is dropping queries and returning SERVFAIL. Increase the limit in named.conf with 'recursive-clients 2000;' under the options block, and also verify system file descriptor limits aren't the underlying bottleneck.

    Related Articles