InfraRunBook
    Back to articles

    What Happens When DNS Fails (Real Impact Explained)

    DNS
    Published: Apr 2, 2026
    Updated: Apr 2, 2026

    A technical deep-dive into DNS failure modes, their cascading impact across authentication, monitoring, and CI/CD, and step-by-step recovery procedures for infrastructure engineers.

    What Happens When DNS Fails (Real Impact Explained)

    Why DNS Failures Are Uniquely Destructive

    Most infrastructure components fail in isolation. A web server crash takes down a web server. A database timeout affects only queries against that database. DNS is fundamentally different. DNS is a shared dependency that sits beneath every layer of your stack—application connectivity, service discovery, certificate validation, authentication, logging, monitoring, and alerting. When DNS fails, it does not fail in isolation. It fails everywhere, simultaneously, and the symptoms look different in every system that depends on it.

    At solvethenetwork.com, an internal DNS failure during a routine resolver migration triggered a cascade that took down LDAP authentication, broke the internal certificate authority's OCSP responder, silenced Prometheus scraping, and prevented SSH from resolving jump host names—all within 90 seconds of the resolver at 10.0.1.10 going dark. The root cause was a single misconfigured

    named.conf
    include path. The blast radius looked like a multi-system catastrophe.

    Understanding DNS failure modes is not optional for infrastructure engineers. It is a prerequisite for incident response competency. This article dissects the specific ways DNS can fail, how each failure type propagates through dependent systems, and how to diagnose and recover from each scenario systematically.


    The DNS Resolution Chain: Where Things Break

    To understand DNS failure, you must first internalize the full resolution chain. When an application on sw-infrarunbook-01 wants to reach

    api.solvethenetwork.com
    , the following sequence executes:

    1. The stub resolver on sw-infrarunbook-01 checks
      /etc/hosts
      for a static entry
    2. The stub resolver sends a query to the configured recursive resolver at 10.0.1.10
    3. The recursive resolver checks its local cache for a valid, non-expired answer
    4. On a cache miss, the recursive resolver begins iterative resolution, querying a root nameserver
    5. The root refers the resolver to the
      .com
      TLD nameservers
    6. The TLD nameservers refer the resolver to solvethenetwork.com's authoritative nameservers
    7. The authoritative nameserver returns the A or AAAA record for the query
    8. The recursive resolver caches the result for the record's TTL duration and returns it to sw-infrarunbook-01
    9. The application receives the IP address and initiates a TCP connection

    A failure at any link in this chain produces a resolution failure at the application layer. The error message, timing, and observable behavior differ dramatically depending on where in the chain the break occurs. This is why DNS incidents are notoriously difficult to triage: the failure source is often three or four hops removed from the symptom.


    Type 1: Recursive Resolver Unavailability

    This is the most impactful single-point failure mode and the one that produces the most immediate, widespread symptoms. If the recursive resolver at 10.0.1.10 goes offline—whether due to a process crash, a network partition, an ACL change blocking port 53, or a firewall rule update—every host pointing to it for name resolution loses the ability to resolve any hostname entirely.

    The stub resolver on an affected host will attempt to reach the resolver, wait for the configured timeout (typically 2–5 seconds per attempt), retry the configured number of times, and then return either a timeout error or

    SERVFAIL
    to the calling application. Most Linux systems configure this behavior in
    /etc/resolv.conf
    . A standard configuration at solvethenetwork.com:

    nameserver 10.0.1.10
    nameserver 10.0.1.11
    search solvethenetwork.com
    options timeout:2 attempts:3 rotate

    With this configuration, if 10.0.1.10 is down, the stub resolver will retry that resolver twice before failing over to 10.0.1.11. That fallback introduces 4–6 seconds of latency into every single DNS query until the primary is restored. For high-frequency service discovery systems—Kubernetes, Consul, or any microservice mesh—this latency budget is catastrophic. Health checks timeout. Circuit breakers trip. Cascading failures begin within seconds.

    On systemd-based hosts,

    systemd-resolved
    manages DNS and must itself be operational. It exposes a local stub at
    127.0.0.53
    . If the process crashes, DNS fails even if your external resolvers are fully healthy, because the local listener is gone. Inspect its state with:

    systemctl status systemd-resolved
    resolvectl status
    resolvectl query api.solvethenetwork.com

    Type 2: Authoritative Nameserver Failure

    When the authoritative nameservers for

    solvethenetwork.com
    go offline, recursive resolvers can initially continue serving answers—but only from their cache. Once cached records reach their TTL boundary and expire, any resolver that attempts to refresh the answer receives no response from the authoritative server. Without a valid answer to cache, the resolver returns
    SERVFAIL
    to all clients asking for that name.

    This failure mode is delayed and gradual, which makes it deeply confusing during an active incident. Records with long TTLs (3600s or 86400s) continue resolving for hours or even days. Records with short TTLs (60s or 300s) begin failing within minutes. The result is an incident where some services work and others do not, with no obvious pattern—sending engineers down false trails while the actual authoritative failure sits waiting to be found.

    You can check the current remaining TTL of any record to estimate how long cached answers will continue to be served before failures begin:

    dig @10.0.1.10 api.solvethenetwork.com A +noall +answer
    ;; ANSWER SECTION:
    api.solvethenetwork.com.  287  IN  A  10.0.2.50

    The

    287
    is the remaining TTL in seconds. When it hits zero on a given resolver, any subsequent query from a client served by that resolver will fail until authoritative service is restored. Multiple resolvers cache records independently, so the failure onset will be staggered across your infrastructure.


    Type 3: Zone Expiry on Secondary Nameservers

    Secondary (replica) authoritative nameservers hold zone data copied from the primary via zone transfers. Each zone's SOA record contains an expire value—the maximum time a secondary will continue to serve zone data after losing contact with the primary. When this timer elapses without a successful zone transfer completing, the secondary stops answering authoritatively for that zone entirely and returns

    SERVFAIL
    for all queries it receives.

    A typical SOA record for the

    solvethenetwork.com
    zone on sw-infrarunbook-01:

    $TTL 3600
    @  IN  SOA  ns1.solvethenetwork.com. infrarunbook-admin.solvethenetwork.com. (
               2024040201  ; Serial
               3600        ; Refresh - how often secondary checks for updates
               900         ; Retry - how often secondary retries a failed refresh
               604800      ; Expire - max time to serve data without contact
               300 )       ; Negative cache TTL

    With an expire value of 604800 (one week), a secondary will serve potentially stale data for up to seven days before going silent. This is intentional—trading freshness for availability during extended primary outages. But if the primary is unreachable for longer than the expire window and the secondary's zone data ages out, that nameserver stops responding for your zone entirely. Any resolver that selects it from your NS record set will receive

    SERVFAIL
    .

    Monitor zone transfer health proactively using

    rndc
    :

    rndc zonestatus solvethenetwork.com
    zone solvethenetwork.com/IN: type secondary; serial 2024040201;
      next refresh: Thu, 03 Apr 2026 14:30:00 GMT
      expires: Thu, 10 Apr 2026 14:00:00 GMT
      last refresh: successful

    Compare serial numbers across all nameservers to detect silent replication failures before they become outages:

    for ns in ns1.solvethenetwork.com ns2.solvethenetwork.com; do
      echo -n "$ns serial: "
      dig @$ns solvethenetwork.com SOA +short | awk '{print $3}'
    done

    Type 4: DNSSEC Validation Failure

    DNSSEC adds cryptographic signatures to DNS responses and establishes a verifiable chain of trust from the root zone down to individual records. When a validating resolver receives a signed response, it must verify every signature in the chain. If any link is broken—an expired RRSIG, a missing DS record after a zone delegation change, a key rollover that was not coordinated with the parent zone, or a misconfigured NSEC chain—the validating resolver returns

    SERVFAIL
    to the client even though the authoritative server is healthy and returning correct data.

    DNSSEC failures are among the most confusing DNS incidents because:

    • Non-validating resolvers continue working normally—internal resolvers may succeed while external public resolvers fail
    • The authoritative server appears completely healthy when queried directly
    • Standard
      dig
      queries without
      +dnssec
      may show clean answers that mask the validation failure
    • Users and monitoring systems see
      SERVFAIL
      with no obvious indication that DNSSEC is the root cause

    To check whether DNSSEC validation is succeeding on the internal resolver at 10.0.1.10:

    dig @10.0.1.10 solvethenetwork.com A +dnssec
    
    ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

    The

    ad
    (Authenticated Data) flag in the response confirms DNSSEC validation succeeded end-to-end. If you get
    SERVFAIL
    and the
    ad
    flag is absent, validation is failing. Check RRSIG record expiry directly:

    dig @10.0.1.10 solvethenetwork.com RRSIG +dnssec +noall +answer
    solvethenetwork.com.  3600  IN  RRSIG  A 13 2 3600 (
        20260410120000 20260326120000 12345 solvethenetwork.com.
        [base64 signature data] )

    The two timestamps are signature expiry and inception respectively. If the current timestamp is past the expiry value, every validating resolver worldwide will reject responses for your zone. This is a full-zone outage for all clients using validating resolvers—which includes most modern public resolvers and ISP resolvers.


    Type 5: Negative Caching and NXDOMAIN Propagation

    When a resolver queries for a name that does not exist, the authoritative server returns

    NXDOMAIN
    . Resolvers cache this negative response for the duration specified in the SOA record's minimum TTL field (the last value in the SOA—typically 300 to 3600 seconds). This negative caching is valuable under normal operations but becomes a failure mode in specific scenarios.

    Common negative caching failure patterns at solvethenetwork.com:

    • A new DNS record is added to the zone but a resolver already holds a cached
      NXDOMAIN
      that has not expired—clients on that resolver continue to receive
      NXDOMAIN
      even after the record is published
    • A deployment automation script briefly deletes and recreates a record during a zone update—any resolver that queried during the deletion window caches
      NXDOMAIN
      for up to the SOA minimum TTL
    • A service is deployed before its DNS record is created—clients fail, resolvers cache the failure, and even after the record is added clients must wait for the negative cache to expire

    To force immediate resolution after a record is added or restored, flush the negative cache entry from the resolver:

    # Flush a specific name from BIND's cache
    rndc flushname api.solvethenetwork.com
    
    # Flush the entire resolver cache (use with caution in production)
    rndc flush
    
    # Flush cache on a systemd-resolved host
    resolvectl flush-caches

    Type 6: Split-Horizon Misconfiguration

    Split-horizon DNS (also called split-brain DNS) serves different answers for the same query depending on the source of the request. Internal clients receive RFC 1918 addresses for solvethenetwork.com hosts, while external clients receive public IPs. This is standard practice for enterprises with both internal services and public-facing infrastructure. When a split-horizon configuration breaks, the failure mode depends on which direction the misconfiguration goes.

    A standard BIND view configuration on sw-infrarunbook-01:

    acl internal_clients {
        10.0.0.0/8;
        172.16.0.0/12;
        192.168.0.0/16;
    };
    
    view "internal" {
        match-clients { internal_clients; };
        zone "solvethenetwork.com" {
            type primary;
            file "/etc/bind/zones/solvethenetwork.com.internal";
        };
    };
    
    view "external" {
        match-clients { any; };
        zone "solvethenetwork.com" {
            type primary;
            file "/etc/bind/zones/solvethenetwork.com.external";
        };
    };

    If the

    match-clients
    ACL is misconfigured—for example, if a subnet is missing from the internal ACL after a network expansion to a new
    10.0.5.0/24
    range—hosts on that subnet are silently routed to the external view. They receive public IP addresses that may be unreachable from inside the network, or they receive records pointing to a load balancer that uses host-based routing and returns a different TLS certificate than expected. Applications fail with TLS certificate errors or connection timeouts, and engineers spend time investigating the application layer while the DNS misconfiguration goes unnoticed.


    The Cascade: What Actually Breaks Across the Stack

    A DNS failure does not just mean web browsing stops working. The following is a realistic impact map for a production environment at solvethenetwork.com during a full recursive resolver outage:

    • Authentication and identity: Kerberos KDC discovery uses DNS SRV records (
      _kerberos._tcp.solvethenetwork.com
      ). LDAP clients resolve directory server hostnames at connection time. When these fail, Active Directory joins, SSH PAM lookups, sudo authentication, and application OAuth flows all collapse simultaneously.
    • TLS and PKI: OCSP responders and CRL distribution points are resolved by DNS at certificate validation time. When certificate validation fails because the OCSP responder hostname cannot be resolved, HTTPS connections are rejected at the TLS handshake—the web server is fully operational but unreachable.
    • Email delivery: MTA-to-MTA mail delivery requires MX record lookups. Inbound mail queues at sending servers. Outbound mail from sw-infrarunbook-01 cannot deliver to external domains. The mail queue grows silently until delivery timeout windows are reached and bounce messages are generated.
    • Kubernetes and service mesh: CoreDNS is the resolver inside Kubernetes clusters. If CoreDNS becomes degraded or if the upstream resolver it depends on fails, pod-to-pod communication using service names breaks. Health checks fail, backends are drained from load balancers, and rolling deployments stall with pods unable to pass readiness probes.
    • Monitoring and observability: Prometheus scrapes targets by hostname. If Prometheus cannot resolve
      sw-infrarunbook-01.solvethenetwork.com
      , it marks the target as down and stops collecting metrics—exactly when you need metrics most. Alerting rules that depend on those metrics stop firing, creating blind spots during the outage.
    • Log shipping: Fluentd, Logstash, and syslog forwarders resolve their destination aggregator by hostname at connection time and periodically during reconnects. When resolution fails, log agents buffer locally until configured disk limits are reached, at which point they either drop logs or cause application I/O pressure by blocking on write operations.
    • CI/CD pipelines: Package managers—apt, dnf, pip, npm, cargo—resolve repository mirror hostnames via DNS. Builds fail at the dependency fetch stage, producing error messages that look like repository failures or network issues rather than DNS failures. Engineers investigating build failures often do not check DNS first.
    • Backup and replication: Database replication, object storage sync, and backup agents resolve peer addresses by hostname. They fail silently and create data protection gaps that may not surface until the next restore test or disaster recovery drill.

    Step-by-Step DNS Failure Diagnosis

    When an incident is reported and DNS is a possible cause, follow this diagnostic sequence on sw-infrarunbook-01 or any affected host. Work from the bottom of the stack upward.

    Step 1: Confirm DNS is the failure layer, not network or application

    # Connect directly by IP, bypassing DNS entirely
    curl -o /dev/null -s -w "%{http_code}" http://10.0.2.50/healthz
    
    # If IP works but hostname fails, DNS is the culprit
    curl -o /dev/null -s -w "%{http_code}" http://api.solvethenetwork.com/healthz

    Step 2: Check resolver reachability on port 53

    # Attempt a minimal query with a short timeout
    dig @10.0.1.10 . SOA +time=2 +tries=1
    
    # If this times out, the resolver is down or port 53 is blocked
    nc -uzv 10.0.1.10 53

    Step 3: Verify which resolver the host is actually using

    cat /etc/resolv.conf
    resolvectl status | grep -A5 "DNS Servers"

    Step 4: Query the authoritative server directly, bypassing the resolver cache

    # Identify authoritative nameservers
    dig solvethenetwork.com NS +short
    
    # Query authoritative directly to isolate resolver vs authoritative failure
    dig @ns1.solvethenetwork.com api.solvethenetwork.com A +noall +answer

    Step 5: Check DNSSEC validation status

    dig @10.0.1.10 api.solvethenetwork.com A +dnssec
    # Look for the "ad" flag and presence of RRSIG records
    # SERVFAIL without ad flag = DNSSEC validation failure

    Step 6: Check SOA serial consistency across all nameservers

    for ns in ns1.solvethenetwork.com ns2.solvethenetwork.com; do
      echo -n "$ns serial: "
      dig @$ns solvethenetwork.com SOA +short | awk '{print $3}'
    done
    # Mismatched serials indicate zone transfer failure

    Step 7: Check BIND service status and logs on sw-infrarunbook-01

    systemctl status named
    journalctl -u named --since "1 hour ago" --no-pager
    tail -100 /var/log/named/named.log

    Recovery Procedures by Failure Type

    Resolver down: Immediately update

    /etc/resolv.conf
    on affected hosts to promote the secondary resolver at 10.0.1.11 to the primary position. For fleet-wide remediation, push the configuration change via your configuration management tooling. Do not wait for the primary to come back before doing this—every second of degraded resolver latency is generating cascading failures across dependent systems. Investigate the primary resolver's failure separately while the fleet operates on the secondary.

    Authoritative server failure: Confirm the secondary nameserver is still serving valid data by querying it directly and comparing the returned serial with your expected value. Identify the authoritative records with the shortest TTLs—those are the ones that will fail first as resolver caches expire. Prioritize restoring authoritative service or adjusting delegation before those TTLs expire. If restoration will take longer than the shortest TTL, consider temporarily increasing TTLs on critical records from a surviving nameserver to buy time.

    Zone expiry: Restore primary-secondary network connectivity, then trigger a manual zone transfer to refresh the secondary immediately:

    rndc retransfer solvethenetwork.com
    rndc zonestatus solvethenetwork.com

    If the primary is permanently lost and you only have a copy of the zone on the secondary, promote the secondary to primary immediately by changing its zone type configuration and updating your NS glue records at the domain registrar.

    DNSSEC validation failure: Re-sign the zone using BIND's inline signing commands. This requires access to the Zone Signing Key (ZSK) on the signing server:

    rndc sign solvethenetwork.com
    rndc loadkeys solvethenetwork.com
    
    # Verify new RRSIGs are published with future expiry
    dig @ns1.solvethenetwork.com solvethenetwork.com RRSIG +dnssec +short

    If the signing key itself has been lost or compromised, initiate an emergency key rollover. This requires coordination with your parent zone registrar to update the DS record. Until the new DS record propagates, validating resolvers will continue to return

    SERVFAIL
    .


    Prevention: Building DNS Resilience

    The most important prevention measure is geographic and topological diversity in nameserver placement. Never run both authoritative nameservers on the same subnet, the same physical rack, or the same availability zone. The probability of losing both 10.0.1.10 and a secondary on a completely separate network segment simultaneously is orders of magnitude lower than losing two servers on the same switch.

    Additional resilience measures for solvethenetwork.com's DNS infrastructure:

    • Monitor SOA serial consistency across all authoritative nameservers every five minutes and alert on divergence
    • Alert on RRSIG expiry with a minimum seven-day warning window to allow time for key rotation without emergency pressure
    • Set up synthetic DNS monitoring from at least two external vantage points that are independent of your internal infrastructure
    • Keep
      named.conf
      , zone files, and resolver configurations in version control and deploy all changes through peer-reviewed automation
    • Use TSIG keys for all zone transfers to authenticate replication and prevent unauthorized zone data enumeration
    • Document and rehearse your DNS recovery runbook quarterly—teams that have never practiced recovery are slow and error-prone during actual incidents
    • Set a realistic SOA expire value: long enough to survive extended primary outages, short enough that secondary nameservers do not serve dangerously stale data indefinitely

    Frequently Asked Questions

    Q: What does SERVFAIL mean in DNS and how is it different from other error codes?

    A: SERVFAIL (response code 2) means the server encountered an internal error and could not complete the query—but this tells you nothing about why. It is returned when a resolver cannot reach authoritative servers, when DNSSEC validation fails, when a zone has expired on a secondary, or when the resolver itself is misconfigured. NXDOMAIN (code 3) is fundamentally different: it means the name definitively does not exist according to an authoritative source. REFUSED (code 5) means the server received the query but has a policy-based reason to decline answering—typically an ACL blocking the client's IP. When you see SERVFAIL during an incident, treat it as a signal that something is broken in the resolution chain, not a description of what specifically is broken.

    Q: How do I confirm DNS is the cause of an outage rather than a network or application issue?

    A: Test connectivity to affected services using their IP addresses directly, bypassing DNS. If connecting by IP succeeds but connecting by hostname fails, DNS is the failure layer. This single test often saves 15–20 minutes of misdirected investigation. Use curl with an explicit IP and a Host header for HTTP services, or use ssh with an IP address to bypass DNS for remote access. If even IP connectivity fails, the issue is network-layer, not DNS.

    Q: Why does a DNS failure take down authentication systems like SSH and LDAP?

    A: Many authentication protocols use DNS for service discovery and endpoint resolution. Kerberos uses SRV records to locate Key Distribution Centers. LDAP clients resolve directory server hostnames when establishing connections. PAM modules performing reverse DNS lookups for logging or access control will block or fail if DNS is unavailable. SSH itself may perform reverse DNS lookups on connecting client IPs. Some of these behaviors can be disabled in configuration, but in a default enterprise setup, DNS unavailability reliably cascades into authentication failure within seconds.

    Q: What is the difference between NXDOMAIN and SERVFAIL?

    A: NXDOMAIN is an authoritative answer meaning the queried name genuinely does not exist in the zone. It comes from an authoritative nameserver and is a definitive negative response. SERVFAIL is a failure signal meaning the resolver was unable to obtain or validate an answer—it says nothing about whether the name exists. NXDOMAIN is expected and cached normally. SERVFAIL indicates a broken resolution path and should always be investigated. A common mistake during incidents is treating SERVFAIL as if it means the same thing as NXDOMAIN—it does not, and acting on that assumption delays finding the real cause.

    Q: How long does DNS recovery take after an outage is resolved?

    A: It depends on the failure type. Resolver-level failures can be mitigated in seconds by redirecting clients to a secondary resolver. Resolver cache poisoning from stale or incorrect data clears as TTLs expire—anywhere from seconds to 24 hours depending on record TTLs. Authoritative server failures resolved before TTL expiry are transparent to clients. NS record changes at the registrar level (needed if you change your authoritative nameservers) take up to 48 hours to propagate globally due to TLD nameserver TTLs. DNSSEC key rollover with DS record updates also follows the parent zone's TTL schedule. The fastest mitigation path is almost always at the resolver layer.

    Q: What is DNS TTL and how does it affect the severity of an outage?

    A: TTL (Time to Live) is the number of seconds a resolver should cache a DNS record before querying for a fresh copy. Short TTLs (60–300 seconds) allow rapid propagation of DNS changes but increase query volume against authoritative servers and reduce the grace period when those servers fail. Long TTLs (3600–86400 seconds) reduce load and provide a longer availability buffer during authoritative failures, but they mean DNS changes take much longer to propagate. During an active outage, long TTLs are your friend if the data is still correct—resolvers continue serving cached answers. They are your enemy if you need to rapidly update a record to restore service, because clients will keep using the old cached value until the TTL expires.

    Q: Can DNS fail partially, affecting only some clients or some record types?

    A: Yes, and partial failure is one of the most difficult diagnostic scenarios in DNS. Different records have different TTLs and expire at different times on different resolvers. DNSSEC failures affect only clients using validating resolvers. Split-horizon misconfigurations affect only clients on specific subnets. Negative caching failures affect only clients whose resolvers queried during a brief window when a record was absent. Zone transfer lag means secondary nameservers may have stale data while the primary has the correct answer. During any DNS incident, always query multiple resolvers from multiple source IPs to establish whether the failure is universal or scoped to a specific resolver, subnet, or client configuration.

    Q: Why does DNSSEC make DNS failures harder to diagnose?

    A: DNSSEC validation occurs invisibly inside the resolver. When it fails, the resolver returns SERVFAIL—the same code returned for every other resolution failure. There is no application-visible signal that DNSSEC specifically is the issue. To detect it, you must explicitly compare results between a validating resolver and a non-validating one, or use dig with the +dnssec flag and check for the ad flag and RRSIG records. The problem is compounded by the fact that internal resolvers are often non-validating while public resolvers validate—so internal testing passes while external users are completely blocked. Always test from both an internal resolver and an external validating resolver when investigating unexplained SERVFAIL responses.

    Q: What is the DNS resolution order on Linux and how can it affect failure behavior?

    A: Linux systems follow the order defined in

    /etc/nsswitch.conf
    . The typical setting is
    hosts: files dns
    , meaning
    /etc/hosts
    is checked first before any DNS query is issued. Entries in
    /etc/hosts
    always win over DNS—which can be both a lifesaver during DNS outages (add a static override to restore critical service access) and a source of confusion when stale static entries override correct DNS data. On systemd-based systems,
    systemd-resolved
    handles the actual DNS queries and adds an mDNS and LLMNR layer. The
    resolve
    entry in nsswitch.conf routes through systemd-resolved's local socket at
    127.0.0.53
    . If systemd-resolved is stopped, even correct DNS server configuration in
    /etc/resolv.conf
    may not help if nsswitch.conf routes through it.

    Q: How do I flush DNS cache on Linux during an incident?

    A: The correct command depends on what is managing DNS resolution. For systemd-resolved, run

    resolvectl flush-caches
    . For a local BIND instance acting as a caching resolver, run
    rndc flush
    to flush all caches or
    rndc flushname api.solvethenetwork.com
    to flush a specific name. For nscd (Name Service Cache Daemon), run
    nscd -i hosts
    . For dnsmasq, send SIGHUP:
    kill -HUP $(pidof dnsmasq)
    . On hosts running nothing local, there is no client-side cache to flush—the cache lives in the resolver at 10.0.1.10 or 10.0.1.11. Note that flushing a resolver cache during an active authoritative outage just causes immediate SERVFAIL—do not flush until the underlying cause is resolved.

    Q: What monitoring should be in place to detect DNS failures before users report them?

    A: Synthetic DNS monitoring is essential: a monitoring agent external to your infrastructure should query all authoritative nameservers for critical records every 60 seconds and alert if any return SERVFAIL, NXDOMAIN for records that should exist, or incorrect answers. Pair this with SOA serial consistency checks comparing all nameservers every five minutes—divergence indicates zone transfer failure. For DNSSEC, monitor RRSIG expiry dates with alerts at 14 days and 7 days before expiry. Monitor resolver availability by querying known-stable records (like the root SOA) against each resolver every 30 seconds. Finally, track resolver query latency as a metric—a spike in mean latency often precedes a full resolver failure and gives you an early warning window.

    Q: What are TSIG keys and why do they matter for DNS security and resilience?

    A: TSIG (Transaction SIGnatures) are shared-secret HMAC-based keys used to authenticate DNS messages, most importantly zone transfers. Without TSIG authentication on zone transfers, any host that can reach your authoritative server on port 53 can request a complete copy of your zone data—every hostname, IP address, internal service name, and mail record in the zone. This is a significant reconnaissance capability for an attacker. TSIG also ensures that zone transfer data has not been tampered with in transit. Configure TSIG on sw-infrarunbook-01 by generating a key with

    tsig-keygen
    , adding it to
    named.conf
    on both primary and secondary, and setting
    allow-transfer
    to require the key. Zone transfers that do not present the correct HMAC are rejected, protecting both confidentiality and integrity of your zone data.

    Frequently Asked Questions

    What does SERVFAIL mean in DNS and how is it different from other error codes?

    SERVFAIL (code 2) means the resolver encountered an internal error and could not complete the query. It differs from NXDOMAIN (code 3), which means the name definitively does not exist, and REFUSED (code 5), which means the server has a policy-based reason to decline answering. SERVFAIL is a signal that something is broken in the resolution chain and requires investigation.

    How do I confirm DNS is the cause of an outage rather than a network or application issue?

    Test connectivity to affected services using their IP addresses directly, bypassing DNS entirely. If connecting by IP succeeds but connecting by hostname fails, DNS is the failure layer. This single test eliminates network and application layers from the investigation immediately.

    Why does a DNS failure take down authentication systems like SSH and LDAP?

    Many authentication protocols rely on DNS for service discovery. Kerberos uses SRV records to locate KDC servers. LDAP clients resolve directory server hostnames at connection time. PAM modules perform reverse DNS lookups for logging and access control. In a default enterprise configuration, DNS unavailability cascades into authentication failure within seconds.

    What is the difference between NXDOMAIN and SERVFAIL?

    NXDOMAIN is an authoritative answer meaning the queried name does not exist in the zone—it is a definitive, expected negative response. SERVFAIL indicates a resolution failure and says nothing about whether the name exists. Confusing them during an incident causes misdirected investigation and delays finding the real cause.

    How long does DNS recovery take after an outage is resolved?

    Resolver-level mitigations take seconds. Stale cached data clears as TTLs expire, from seconds to 24 hours depending on configured TTL values. NS record changes at the registrar take up to 48 hours to propagate globally. DNSSEC key rollovers requiring DS record updates follow the parent zone TTL schedule. The fastest mitigation is always at the resolver layer.

    What is DNS TTL and how does it affect the severity of an outage?

    TTL is the number of seconds a resolver caches a record before refreshing it. Short TTLs allow rapid propagation of changes but reduce the availability buffer when authoritative servers fail. Long TTLs extend the grace window during authoritative outages but slow down recovery when you need to update a record quickly. Matching TTL values to your recovery time objectives is part of DNS resilience planning.

    Can DNS fail partially, affecting only some clients or some record types?

    Yes. Different records have different TTLs and expire at different times on different resolvers. DNSSEC failures affect only validating resolvers. Split-horizon misconfigurations affect only specific subnets. Negative caching failures affect only clients whose resolvers queried during a brief absence window. Always test from multiple resolvers and source IPs before concluding a failure is universal.

    Why does DNSSEC make DNS failures harder to diagnose?

    DNSSEC validation happens invisibly inside the resolver. When it fails, the resolver returns SERVFAIL—the same error code as every other resolution failure. Internal non-validating resolvers may succeed while external validating resolvers fail. Always compare results between an internal resolver and an external validating resolver, and use dig with the +dnssec flag to look for the ad flag and RRSIG records.

    What is the DNS resolution order on Linux and how does it affect failure behavior?

    Linux follows the order in /etc/nsswitch.conf, typically checking /etc/hosts before DNS. Entries in /etc/hosts bypass DNS entirely and can be used as emergency overrides during outages. On systemd-based systems, systemd-resolved adds an mDNS and LLMNR layer. If systemd-resolved stops, DNS resolution fails even if external resolvers are healthy, because the local stub at 127.0.0.53 is unavailable.

    How do I flush DNS cache on Linux during an incident?

    For systemd-resolved, run resolvectl flush-caches. For BIND acting as a local caching resolver, run rndc flush or rndc flushname for a specific name. For nscd, run nscd -i hosts. For dnsmasq, send SIGHUP. Do not flush until the underlying authoritative failure is resolved—flushing during an active outage causes immediate SERVFAIL for all names that were previously cached.

    Related Articles