Symptoms
You're staring at a SERVFAIL and the clock is ticking. Maybe a monitoring alert fired, maybe a developer is messaging you because their app can't resolve anything, or maybe you ran a quick
digand got back a response code you didn't expect. DNS SERVFAIL — RCODE 2 — means the server encountered a failure it couldn't recover from. Unlike NXDOMAIN, which tells you the name doesn't exist, SERVFAIL tells you something broke in the resolution process itself. That distinction matters enormously when you're trying to debug.
The symptoms usually look like one of these:
$ dig solvethenetwork.com @192.168.1.53
; <<>> DiG 9.18.24 <<>> solvethenetwork.com @192.168.1.53
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 48291
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
Or in BIND's query log:
12-Apr-2026 08:14:32.412 client @0x7f3a8c003a20 192.168.10.45#51234 (solvethenetwork.com): query failed (SERVFAIL) for solvethenetwork.com/IN/A at query.c:9316
Or surfacing in application logs as:
SERVFAIL resolving 'mail.solvethenetwork.com/MX/IN': 192.168.1.53#53
End users see "DNS_PROBE_FINISHED_SERVFAIL" in Chrome or "Server not found" in Firefox. The key thing to internalize: SERVFAIL is your resolver saying it tried and failed — not that the name doesn't exist. The cause could be anywhere in the resolution chain, from a misconfigured local resolver all the way out to a broken authoritative server on the far side of the internet. Your job is to work through that chain systematically.
Root Cause 1: Recursive Resolver Failing
This is always the first thing I check. A recursive resolver that's overloaded, misconfigured, or has exhausted a critical resource will throw SERVFAILs for everything — not just specific zones. If you're seeing blanket SERVFAIL responses across unrelated domains, the resolver itself is the problem, not any particular zone.
Why does this happen? Common triggers include running out of file descriptors (the resolver can't open new UDP or TCP sockets to contact upstream servers), hitting the
recursive-clientslimit in BIND, or a resolver that was misconfigured for authoritative-only work but is still receiving recursive queries from clients.
Start by checking whether the failure is universal or isolated to specific domains:
$ dig google.com @192.168.1.53 +short
;; connection timed out; no servers could be reached
$ dig cloudflare.com @192.168.1.53 +short
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 19234
$ dig internal.solvethenetwork.com @192.168.1.53 +short
10.10.1.50
Internal zones resolving while external domains fail points directly at the recursion stack. Check BIND's current status and recursive client usage:
$ rndc status
version: BIND 9.18.24 (Stable Release)
running on sw-infrarunbook-01: Linux x86_64 6.1.0
...
recursive clients: 998/900/1000
tcp clients: 287/300
That format is current/softlimit/hardlimit. If the first number is near the hard limit, you're dropping queries. Also check system-level file descriptors:
sw-infrarunbook-01:~# cat /proc/$(pgrep named)/limits | grep "open files"
Max open files 4096 4096 files
sw-infrarunbook-01:~# ls /proc/$(pgrep named)/fd | wc -l
4087
That's nearly exhausted. Raise the limits in
/etc/security/limits.confand BIND's startup environment file, then restart:
# /etc/default/named (Debian/Ubuntu) or /etc/sysconfig/named (RHEL)
OPTIONS="-u bind -n 4"
# /etc/security/limits.conf
bind soft nofile 65535
bind hard nofile 65535
In
named.conf, also tune the client limits:
options {
recursive-clients 2000;
tcp-clients 500;
};
Root Cause 2: DNSSEC Validation Failure
In my experience, DNSSEC is responsible for more unexpected SERVFAILs than almost anything else — and it's especially insidious because the zone might resolve perfectly from a non-validating resolver, making it look like a resolver problem when it's actually a signing problem.
When a DNSSEC-validating resolver (BIND with
dnssec-validation auto;, or Unbound with validation enabled) can't verify the chain of trust for a response, it returns SERVFAIL rather than an unvalidated record. This happens when a zone's RRSIG has expired, when the DS record in the parent zone doesn't match the current DNSKEY, or when a key rollover occurred but the new DS record hasn't been published at the registrar yet.
The diagnostic trick here is comparing
dig +dnssecagainst
dig +cd(checking disabled). If the query succeeds with
+cdbut fails without it, DNSSEC validation is your culprit:
# This FAILS — validation is running and rejecting the response
$ dig solvethenetwork.com @192.168.1.53 +dnssec
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 33812
# This SUCCEEDS — validation bypassed with +cd
$ dig solvethenetwork.com @192.168.1.53 +cd
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44201
;; ANSWER SECTION:
solvethenetwork.com. 300 IN A 203.0.113.10
Now check whether the RRSIG is still valid:
$ dig solvethenetwork.com @8.8.8.8 +dnssec +multiline
;; ANSWER SECTION:
solvethenetwork.com. 300 IN A 203.0.113.10
solvethenetwork.com. 300 IN RRSIG A 8 2 300 (
20260101000000 20251202000000 12345 solvethenetwork.com.
Km8fT2nYq... )
The first timestamp is the expiry. If it's in the past, zone signing has lapsed. Compare the DS record in the parent against your current DNSKEY:
$ dig DS solvethenetwork.com @a.gtld-servers.net +norecurse
;; ANSWER SECTION:
solvethenetwork.com. 3600 IN DS 12345 8 2 A1B2C3D4E5F6...
$ dig DNSKEY solvethenetwork.com @192.168.2.10
;; ANSWER SECTION:
solvethenetwork.com. 3600 IN DNSKEY 257 3 8 AwEAAb9xZ...
Hash the DNSKEY and compare to the DS. A mismatch means your registrar has a stale DS record. The fix depends on your signing setup: if you're using inline signing in BIND, run
rndc sign solvethenetwork.comto force a re-sign. If keys have rolled, update the DS record at your registrar. As an emergency measure while you resolve the signing issue, you can temporarily disable DNSSEC validation — but treat it as a stopgap, not a solution:
// named.conf — emergency only, revert after fixing signing
options {
dnssec-validation no;
};
Root Cause 3: Upstream Authoritative Unreachable
When your recursive resolver can't reach the authoritative name servers for a zone — because they're down, firewalled, or the glue records point to decommissioned IPs — it can't get an answer and returns SERVFAIL. This is different from a timeout in that the connection is outright refused or the servers simply don't respond at all.
This happens after botched migrations where old nameservers were decommissioned before the TTL expired on NS records, or when a firewall rule accidentally blocks outbound UDP/TCP port 53 from your resolver host.
Walk the delegation manually to verify each hop:
$ dig NS solvethenetwork.com @a.root-servers.net +norecurse
;; AUTHORITY SECTION:
solvethenetwork.com. 172800 IN NS ns1.solvethenetwork.com.
solvethenetwork.com. 172800 IN NS ns2.solvethenetwork.com.
# Check glue records
$ dig A ns1.solvethenetwork.com @a.root-servers.net
;; ADDITIONAL SECTION:
ns1.solvethenetwork.com. 3600 IN A 203.0.113.5
# Now try to reach that authoritative directly
$ dig solvethenetwork.com @203.0.113.5 +time=3 +tries=1
;; connection timed out; no servers could be reached
Confirm from the resolver host itself:
sw-infrarunbook-01:~# nc -zvu 203.0.113.5 53
nc: connect to 203.0.113.5 port 53 (udp) failed: Connection refused
sw-infrarunbook-01:~# traceroute -n 203.0.113.5
traceroute to 203.0.113.5, 30 hops max
1 10.0.0.1 0.8ms
2 * * *
3 * * *
Check for iptables rules blocking outbound port 53 from the resolver:
sw-infrarunbook-01:~# iptables -L OUTPUT -nv | grep ":53"
0 0 DROP udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:53
There's your block. Remove that rule and make the change persistent through your firewall management tool. If the authoritative server is genuinely down and you control it, bring it back. If it's a third-party provider, update your NS records to point to working servers and ensure the glue is correct at the registrar.
Root Cause 4: Timeout from Authoritative
Different from unreachable — the server is there, packets are routing, but it's not responding within your resolver's query timeout window. BIND has a default timeout for upstream responses, and if the authoritative is overloaded or experiencing latency spikes, your resolver gives up and SERVFAILs the client.
I've seen this with authoritative servers that are under DDoS, with anycast DNS providers where the nearest node is temporarily degraded, and with under-resourced authoritative servers that queue up during traffic bursts. The symptom is intermittent SERVFAIL — queries sometimes succeed and sometimes don't — rather than consistent failure.
Enable BIND's query logging and watch for timeout messages:
$ rndc querylog on
$ tail -f /var/log/named/queries.log | grep -E 'timed out|SERVFAIL'
12-Apr-2026 09:22:14.811 resolver: fetch: solvethenetwork.com/A: timed out
12-Apr-2026 09:22:14.812 client @0x7f3a... 192.168.10.45#51234: query failed (SERVFAIL)
Confirm the latency by querying the authoritative directly with a generous timeout:
$ dig solvethenetwork.com @203.0.113.5 +stats +time=10
;; Query time: 4971 msec
;; SERVER: 203.0.113.5#53(203.0.113.5)
;; WHEN: Sat Apr 12 09:22:19 UTC 2026
# Compare against a healthy secondary:
$ dig solvethenetwork.com @203.0.113.6 +stats +time=10
;; Query time: 22 msec
Nearly 5 seconds from one authoritative, 22ms from the other. BIND's default resolver query timeout is 10 seconds but it's accounting for multiple retries — an individual query attempt timeout is lower. The short-term fix is to tune BIND's resolver timeout in
named.conf:
options {
resolver-query-timeout 15000; // milliseconds, default is 10000
};
The real fix is addressing the authoritative side: reduce load on the slow server, add capacity, or ensure your NS records include multiple geographically distributed servers so a slow node doesn't dominate. BIND tries authoritative servers based on measured RTT history, so a consistently slow server will eventually be deprioritized — but not until it's caused a run of SERVFAILs first.
Root Cause 5: Forwarder Down
Many corporate resolver setups use forwarders — your internal BIND instance handles authoritative answers for internal zones but forwards external queries to a central resolver (maybe 192.168.1.100 or a dedicated DNS appliance). When that forwarder is down or unreachable, every forwarded query returns SERVFAIL.
This is particularly painful because it looks like "the internet is broken" to end users, while your internal zones resolve perfectly. The resolver is healthy. The zone is fine. It's just the forwarding chain that's broken.
Check
named.conffor the forwarder configuration:
options {
forwarders {
192.168.1.100;
192.168.1.101;
};
forward only;
};
That
forward only;directive is the critical detail. It means if every configured forwarder is unreachable, BIND won't fall back to iterative resolution — it simply fails. Test each forwarder directly:
sw-infrarunbook-01:~# dig google.com @192.168.1.100 +time=3 +tries=1
;; connection timed out; no servers could be reached
sw-infrarunbook-01:~# dig google.com @192.168.1.101 +time=3 +tries=1
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9988
;; ANSWER SECTION:
google.com. 299 IN A 142.250.80.46
First forwarder is dead, second is alive. But BIND tries forwarders in order and will time out on the dead one before successfully reaching the live one. Depending on client timeout settings, this causes either slow responses or outright SERVFAILs. Remove the dead forwarder from
named.confand reload:
sw-infrarunbook-01:~# rndc reconfig
server reload successful
Structurally, if you have flexibility in your forwarding policy, change
forward onlyto
forward first. This makes BIND attempt forwarders but fall back to iterative resolution if they all fail:
options {
forwarders {
192.168.1.100;
192.168.1.101;
};
forward first; // falls back to iterative if forwarders fail
};
Whether that's appropriate depends on your security posture — some environments use
forward onlydeliberately to ensure all external DNS traffic flows through a controlled, logging forwarder. In that case, the forwarder availability is critical infrastructure and needs monitoring to match.
Root Cause 6: Zone Data Misconfiguration
Sometimes the authoritative server is up, reachable, and otherwise healthy — but the zone file itself is broken. A syntax error, a missing SOA record, an invalid RDATA value, or the classic missing trailing dot on an NS or MX record will cause BIND to refuse to load the zone entirely. Queries for names in that zone then return SERVFAIL from any resolver that follows the delegation to it.
BIND logs a clear error when a zone fails to load, but it's easy to miss if you're not watching:
sw-infrarunbook-01:~# journalctl -u named | grep -E 'failed|error' | tail -20
Apr 12 08:30:01 sw-infrarunbook-01 named[3421]: dns_rdata_fromtext: /etc/bind/zones/solvethenetwork.com.zone:67: near 'mailsolvethenetwork.com.': bad name (missing dot ?)
Apr 12 08:30:01 sw-infrarunbook-01 named[3421]: zone solvethenetwork.com/IN: loading from master file failed: bad owner name (check-names)
Apr 12 08:30:01 sw-infrarunbook-01 named[3421]: zone solvethenetwork.com/IN: not loaded due to errors.
Always validate with
named-checkzonebefore applying changes:
# Clean zone:
sw-infrarunbook-01:~# named-checkzone solvethenetwork.com /etc/bind/zones/solvethenetwork.com.zone
zone solvethenetwork.com/IN: loaded serial 2026041201
OK
# Broken zone (missing trailing dot on MX record):
sw-infrarunbook-01:~# named-checkzone solvethenetwork.com /etc/bind/zones/solvethenetwork.com.zone
/etc/bind/zones/solvethenetwork.com.zone:67: near 'mail.solvethenetwork.com': not a valid name
zone solvethenetwork.com/IN: loading from master file /etc/bind/zones/solvethenetwork.com.zone failed: not a valid name
zone solvethenetwork.com/IN: not loaded due to errors.
Fix the zone file (add the trailing dot:
mail.solvethenetwork.com.), verify with
named-checkzone, then reload just that zone:
sw-infrarunbook-01:~# rndc reload solvethenetwork.com
zone reload up-to-date
Root Cause 7: Lame Delegation
A lame delegation occurs when the parent zone delegates to a nameserver that isn't actually authoritative for the zone. The resolver follows the delegation, queries the listed NS, and gets back either a REFUSED response or a non-authoritative answer it can't use — so it SERVFAILs the client.
This happens after NS record migrations where the new server hasn't been configured with the zone yet, or when a secondary nameserver's zone transfer silently fails and it stops serving the zone. Check whether each listed NS is actually serving the zone authoritatively — you want to see the
aa(authoritative answer) flag in the response:
# Query NS directly without recursion
$ dig solvethenetwork.com @ns2.solvethenetwork.com +norec
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 55123
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
# Note: no 'aa' flag, REFUSED — this is a lame delegation
# A properly authoritative response looks like this:
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 3
;; ANSWER SECTION:
solvethenetwork.com. 300 IN A 203.0.113.10
The
aaflag must be present for a valid authoritative response. To fix a lame delegation, either configure the zone on the listed nameserver or update the NS records to remove the lame server. If it's a secondary that lost its zone, check whether zone transfers are completing:
sw-infrarunbook-01:~# rndc retransfer solvethenetwork.com
# Then check:
sw-infrarunbook-01:~# rndc zonestatus solvethenetwork.com
name: solvethenetwork.com
type: secondary
files: /var/cache/bind/solvethenetwork.com.jnl
serial: 2026041201
nodes: 24
last loaded: Sat, 12 Apr 2026 08:00:00 GMT
secure: yes
status: transfer in progress
Prevention
Prevention is about building observability and resilience into your DNS infrastructure before the 3 AM page hits. The most impactful things I've implemented over the years:
Monitor SERVFAIL rates, not just availability. A resolver that's up but returning 5% SERVFAILs looks perfectly healthy to a ping check. Enable BIND's statistics channel and scrape it with a metrics collector:
options {
statistics-channels {
inet 127.0.0.1 port 8080 allow { 127.0.0.1; 192.168.1.0/24; };
};
};
Use Prometheus's
bind_exporterto collect these metrics and alert when the SERVFAIL rate climbs above a baseline. Even a 1% SERVFAIL rate on a busy resolver is thousands of failed queries per minute — worth investigating before users start noticing.
Run named-checkzone
in your deployment pipeline. Before any zone file change reaches a live server, validate it automatically. This single gate eliminates the entire class of zone data misconfiguration SERVFAILs. If your team uses Git for zone management (and they should), add a pre-commit hook or CI check that runs named-checkzone on every changed zone file.
Test DNSSEC validation separately from resolution. Set up a monitoring check that runs both
dig solvethenetwork.com +dnssecand
dig solvethenetwork.com +cdevery five minutes, and alerts if the validating query fails while the non-validating one succeeds. That's your DNSSEC canary. Also monitor RRSIG expiry dates directly — most inline-signing setups auto-resign, but verify the signatures aren't silently lapsing by checking the expiry timestamp periodically.
Never use forward only
without monitoring your forwarders. If your environment requires forced forwarding for security reasons, treat forwarder availability as a critical infrastructure metric. Monitor each forwarder with a synthetic DNS check, not just a ping. A forwarder that's reachable but returning SERVFAILs of its own is just as bad as one that's down.
Always run at least two authoritative nameservers per zone. A single authoritative nameserver is a single point of failure for that zone. Two NS records pointing to geographically separated servers dramatically reduces the blast radius of hardware failures and upstream connectivity issues. This is DNS fundamentals, but I still see single-NS zones in production more often than I'd like.
Build a delegation map for zones you own. Keep a runbook entry for each zone that lists the authoritative servers, their IPs, the registrar where NS and DS records are managed, and the signing configuration. When SERVFAIL hits during an incident, you don't want to spend the first 20 minutes figuring out where the zone even lives or who has access to update the DS record.
SERVFAIL is almost always diagnosable in under ten minutes with the right methodology. Is the failure universal or zone-specific? Does
+cdresolve it? Can I reach the authoritative directly? Is the zone actually loaded and serving the
aaflag? Is my forwarder alive? Work through those questions systematically, one hop at a time from client to root, and you'll find the cause. The real skill isn't memorizing every possible failure mode upfront — it's building the muscle memory to follow the resolution chain and let the evidence tell you where it broke.
