InfraRunBook
    Back to articles

    Kubernetes DNS Not Resolving Inside Cluster

    Kubernetes
    Published: Apr 6, 2026
    Updated: Apr 6, 2026

    A step-by-step troubleshooting runbook for Kubernetes DNS failures inside the cluster, covering CoreDNS crashes, ConfigMap errors, NetworkPolicy blocks, ndots settings, and search domain mismatches with real CLI commands and fixes.

    Kubernetes DNS Not Resolving Inside Cluster

    DNS resolution failures inside a Kubernetes cluster are among the most disruptive and difficult-to-diagnose problems an infrastructure engineer will face. A pod that cannot resolve service names or external hostnames will fail silently or throw cryptic connection errors, making the root cause hard to pinpoint at first glance. This runbook walks through every common reason DNS stops working inside a cluster, complete with real diagnostic commands, actual error output, and step-by-step remediation for each cause.

    Symptoms

    When DNS resolution breaks inside a Kubernetes cluster, pods typically surface one or more of the following symptoms:

    • Application pods crash-loop with
      dial tcp: lookup <service>: no such host
      or similar resolver errors
    • nslookup
      or
      dig
      inside a pod returns
      SERVFAIL
      ,
      NXDOMAIN
      , or times out entirely
    • Services reachable by IP continue to work, but hostname-based connections fail
    • kubectl exec
      into a debug pod shows
      connection timed out; no servers could be reached
    • Intermittent DNS failures under load, suggesting resource exhaustion rather than complete failure
    • External DNS works but in-cluster service discovery does not, or vice versa
    • New pods fail to resolve immediately after deployment while older pods are unaffected

    Start every DNS investigation with a quick baseline test from inside an affected pod:

    kubectl run dns-debug --image=busybox:1.36 --restart=Never -it --rm -- nslookup kubernetes.default

    If this times out or returns SERVFAIL, CoreDNS itself is the problem. If it returns a valid answer but your application's target hostname does not resolve, the issue is more likely related to search domains, ndots settings, or ConfigMap forwarding rules.


    Root Cause 1: CoreDNS Pod Not Running

    Why It Happens

    CoreDNS runs as a Deployment inside the

    kube-system
    namespace. If the pods crash, get evicted due to node memory pressure, or are accidentally deleted, all in-cluster DNS resolution stops immediately. A bad ConfigMap update that causes CoreDNS to panic on startup, a broken container image reference, or a node that has run out of resources can all leave the cluster without any DNS pods. Because CoreDNS is a shared infrastructure component, a single Deployment outage affects every workload in every namespace simultaneously.

    How to Identify It

    Check the state of the CoreDNS Deployment and its pods:

    kubectl get pods -n kube-system -l k8s-app=kube-dns
    
    NAME                       READY   STATUS             RESTARTS   AGE
    coredns-5d78c9869d-4xkzp   0/1     CrashLoopBackOff   8          12m
    coredns-5d78c9869d-r9p2l   0/1     CrashLoopBackOff   8          12m

    Pull the logs from a crashing pod to understand what is going wrong:

    kubectl logs -n kube-system coredns-5d78c9869d-4xkzp --previous
    
    [ERROR] plugin/errors: 2 SERVFAIL (incomplete response) : no upstream is available
    plugin/reload: Running configuration md5 = 3a7bc4f1d9820e13
    [FATAL] Failed to initialize server: listen tcp 0.0.0.0:53: bind: address already in use

    Also verify the Deployment itself is reporting healthy conditions:

    kubectl describe deployment coredns -n kube-system | grep -A5 "Conditions:"
    
    Conditions:
      Type             Status  Reason
      ----             ------  ------
      Available        False   MinimumReplicasUnavailable
      Progressing      True    ReplicaSetUpdated

    How to Fix It

    If the pods are in CrashLoopBackOff due to a bad configuration, roll back the ConfigMap first (see Root Cause 2). If the pods are simply missing or evicted, scale the Deployment back up:

    kubectl scale deployment coredns -n kube-system --replicas=2

    If the container image cannot be pulled, verify the image reference and update it to a valid tag:

    kubectl set image deployment/coredns -n kube-system \
      coredns=registry.k8s.io/coredns/coredns:v1.11.1

    After making changes, watch the rollout until it completes:

    kubectl rollout status deployment/coredns -n kube-system
    
    Waiting for deployment "coredns" rollout to finish: 1 out of 2 new replicas have been updated...
    deployment "coredns" successfully rolled out

    Confirm DNS is restored by re-running the baseline nslookup test from a pod in an affected namespace.


    Root Cause 2: CoreDNS ConfigMap Misconfigured

    Why It Happens

    CoreDNS reads its configuration from a ConfigMap named

    coredns
    in the
    kube-system
    namespace. A typo in the
    Corefile
    , a missing plugin directive, a malformed forward address, or an incorrect zone block will prevent CoreDNS from loading its configuration — causing pods to crash or silently drop queries. This ConfigMap is often edited manually during cluster customisation to add custom stub zones, change upstream resolvers, or tune caching, making it a frequent source of breakage. Pointing the
    forward
    directive at the CoreDNS service IP itself creates an infinite loop and is one of the most common self-inflicted failures.

    How to Identify It

    Inspect the current ConfigMap:

    kubectl get configmap coredns -n kube-system -o yaml
    
    apiVersion: v1
    data:
      Corefile: |
        .:53 {
            errors
            health {
               lameduck 5s
            }
            ready
            kubernetes cluster.local in-addr.arpa ip6.arpa {
               pods insecure
               fallthrough in-addr.arpa ip6.arpa
               ttl 30
            }
            prometheus :9153
            forward . 10.96.0.10 {
               max_concurrent 1000
            }
            cache 30
            loop
            reload
            loadbalance
        }
    kind: ConfigMap

    In the output above, the

    forward
    directive points to
    10.96.0.10
    — which is the CoreDNS service itself. Every external query will loop back into CoreDNS, and the
    loop
    plugin will detect this and crash the pod. Another common mistake is removing the
    cluster.local
    zone from the
    kubernetes
    block, which stops all in-cluster service discovery.

    Test the Corefile syntax before applying it by running CoreDNS with the

    -dns.port 0
    flag to perform a dry-run parse:

    docker run --rm -v $(pwd)/Corefile:/Corefile \
      registry.k8s.io/coredns/coredns:v1.11.1 \
      -conf /Corefile -dns.port 0 2>&1
    
    .:53
    CoreDNS-1.11.1
    linux/amd64, go1.21.1, 1b5f4a0

    If there is a syntax error, CoreDNS will print it and exit non-zero before you have applied anything.

    How to Fix It

    Edit the ConfigMap and restore a known-good Corefile:

    kubectl edit configmap coredns -n kube-system

    A safe baseline Corefile for a standard cluster with upstream resolvers at

    10.0.0.1
    :

    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . 10.0.0.1
        cache 30
        loop
        reload
        loadbalance
    }

    To forward queries for

    solvethenetwork.com
    to an internal resolver at
    10.0.0.53
    , add a dedicated stub zone block above the catch-all:

    solvethenetwork.com:53 {
        errors
        forward . 10.0.0.53
        cache 30
    }
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . 10.0.0.1
        cache 30
        loop
        reload
        loadbalance
    }

    After saving, CoreDNS detects the change via the

    reload
    plugin without requiring a pod restart. Verify it picked up the new configuration:

    kubectl logs -n kube-system -l k8s-app=kube-dns | grep reload
    
    [INFO] Reloading
    [INFO] plugin/reload: Running configuration md5 = d41d8cd98f00b204
    [INFO] Reloading complete

    Root Cause 3: NetworkPolicy Blocking Port 53

    Why It Happens

    Kubernetes NetworkPolicy objects restrict pod-to-pod and pod-to-external traffic. A strict default-deny egress policy applied to application namespaces will block pods from reaching the CoreDNS service on port 53 over both UDP and TCP. Engineers often add default-deny NetworkPolicies to tighten namespace security without realising they have also cut off DNS. The symptom is pods that can reach other pods by direct IP address but cannot resolve any hostname — not even

    kubernetes.default
    — because every DNS query is silently dropped before it reaches CoreDNS.

    How to Identify It

    List NetworkPolicies in the affected namespace and look for any that restrict egress:

    kubectl get networkpolicy -n production
    
    NAME            POD-SELECTOR   AGE
    default-deny    <none>         3d
    allow-internal  app=api        3d

    Inspect the default-deny policy:

    kubectl get networkpolicy default-deny -n production -o yaml
    
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: default-deny
      namespace: production
    spec:
      podSelector: {}
      policyTypes:
      - Ingress
      - Egress

    This policy blocks all egress traffic from every pod in the namespace, including DNS queries to CoreDNS. Confirm it is the cause by running a debug pod in a namespace without NetworkPolicies and verifying DNS works there:

    kubectl run dns-debug --image=busybox:1.36 --restart=Never \
      -it --rm -n kube-system -- nslookup kubernetes.default
    
    Server:    10.96.0.10
    Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
    
    Name:      kubernetes.default
    Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

    DNS works in

    kube-system
    but fails in
    production
    — the NetworkPolicy is blocking port 53 egress.

    How to Fix It

    Add an explicit egress rule allowing UDP and TCP on port 53 targeted at the

    kube-system
    namespace where CoreDNS runs. Using a namespaceSelector is more portable than hardcoding the CoreDNS service CIDR:

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-dns-egress
      namespace: production
    spec:
      podSelector: {}
      policyTypes:
      - Egress
      egress:
      - ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
        to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system

    Apply it and verify resolution is restored:

    kubectl apply -f allow-dns-egress.yaml
    
    networkpolicy.networking.k8s.io/allow-dns-egress created
    
    kubectl exec -it <pod-name> -n production -- nslookup kubernetes.default
    
    Server:    10.96.0.10
    Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
    
    Name:      kubernetes.default
    Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

    Always include this DNS egress rule in any namespace that uses a default-deny egress policy. Treat it as mandatory boilerplate alongside the policy itself.


    Root Cause 4: ndots Misconfigured

    Why It Happens

    The

    ndots
    option in
    /etc/resolv.conf
    controls how many dots must appear in a query name before the resolver treats it as a fully qualified domain name (FQDN) and sends it as-is, without appending search domains. Kubernetes sets
    ndots:5
    by default, meaning a hostname like
    api.solvethenetwork.com
    (only two dots) will first be tried with each search domain appended before the bare name is attempted. This is intentional: it ensures short in-cluster service names like
    payments
    resolve correctly by appending
    .production.svc.cluster.local
    automatically.

    The problem arises when someone overrides

    ndots
    to a low value such as
    1
    in an attempt to reduce DNS lookup latency for external FQDNs. With
    ndots:1
    , the resolver treats any name with at least one dot as a FQDN and sends it directly without appending search domains, breaking all short in-cluster service name resolution entirely.

    How to Identify It

    Inspect

    /etc/resolv.conf
    inside an affected pod:

    kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf
    
    search production.svc.cluster.local svc.cluster.local cluster.local
    nameserver 10.96.0.10
    options ndots:1

    With

    ndots:1
    , querying the short service name
    payments
    sends it as the FQDN
    payments.
    immediately, bypassing the search list. Confirm the behaviour with
    dig
    :

    kubectl exec -it <pod-name> -n production -- dig payments
    
    ;; QUESTION SECTION:
    ;payments.                      IN      A
    
    ;; AUTHORITY SECTION:
    .                       86399   IN      SOA     a.root-servers.net. ...
    
    ;; Query time: 234 msec
    ;; SERVER: 10.96.0.10#53(10.96.0.10)

    The query reached CoreDNS, which forwarded it upstream to root servers rather than resolving it as an in-cluster service. With the correct

    ndots:5
    , the resolver would try
    payments.production.svc.cluster.local
    first and get an immediate answer.

    How to Fix It

    Set the correct

    ndots
    value in the pod's
    dnsConfig
    . The Kubernetes default of 5 is correct for most workloads. If external DNS resolution latency is a concern, a value of 2 is a reasonable compromise — it still forces short names through the search list while allowing typical FQDNs (which have three or more dots) to bypass it:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: api-deployment
      namespace: production
    spec:
      template:
        spec:
          dnsConfig:
            options:
            - name: ndots
              value: "5"
          containers:
          - name: api
            image: nginx:1.25

    For an already-running Deployment, patch it directly:

    kubectl patch deployment api-deployment -n production --type=json \
      -p='[{"op":"add","path":"/spec/template/spec/dnsConfig","value":{"options":[{"name":"ndots","value":"5"}]}}]'

    Confirm the resolv.conf inside the restarted pods reflects the correct value:

    kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf
    
    search production.svc.cluster.local svc.cluster.local cluster.local
    nameserver 10.96.0.10
    options ndots:5

    Root Cause 5: Search Domain Wrong

    Why It Happens

    Kubernetes automatically populates the

    search
    line in each pod's
    /etc/resolv.conf
    based on the pod's namespace and the cluster domain. The expected entries are
    <namespace>.svc.cluster.local
    ,
    svc.cluster.local
    , and
    cluster.local
    . This cluster domain is configured via the kubelet's
    --cluster-domain
    flag (or equivalent
    clusterDomain
    field in the kubelet config file). If the cluster was bootstrapped with a non-standard domain such as
    cluster.internal
    , but CoreDNS was left configured with
    cluster.local
    , the search domains injected into pods will not match the zone CoreDNS is authoritative for, and all in-cluster service lookups will silently fail. A misconfigured Helm chart or operator can also override the
    dnsConfig.searches
    field, replacing the correct search domains with something invalid.

    How to Identify It

    Check the search domains in the pod and compare them to what CoreDNS serves:

    kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf
    
    search production.svc.cluster.internal svc.cluster.internal cluster.internal
    nameserver 10.96.0.10
    options ndots:5

    Now check the CoreDNS ConfigMap zone declaration:

    kubectl get configmap coredns -n kube-system \
      -o jsonpath='{.data.Corefile}' | grep kubernetes
    
            kubernetes cluster.local in-addr.arpa ip6.arpa {

    The pod is searching

    cluster.internal
    but CoreDNS is authoritative only for
    cluster.local
    . Every short service name lookup appends the wrong suffix and receives NXDOMAIN. Confirm the mismatch by testing a fully-qualified lookup, which bypasses the search list entirely:

    kubectl exec -it <pod-name> -n production -- \
      nslookup kubernetes.default.svc.cluster.local
    
    Server:    10.96.0.10
    Address 1: 10.96.0.10
    
    Name:      kubernetes.default.svc.cluster.local
    Address 1: 10.96.0.1

    The FQDN resolves correctly, but

    nslookup kubernetes.default
    (which appends
    .production.svc.cluster.internal
    ) fails. Search domain mismatch is confirmed.

    How to Fix It

    The correct fix depends on where the mismatch originated. First, check the kubelet configuration on the affected node:

    ssh infrarunbook-admin@sw-infrarunbook-01 \
      "cat /var/lib/kubelet/config.yaml | grep clusterDomain"
    
    clusterDomain: cluster.internal

    If the kubelet is using the wrong domain, update it on every node and restart the kubelet. This is a cluster-wide change requiring a rolling node update:

    ssh infrarunbook-admin@sw-infrarunbook-01 \
      "sudo sed -i 's/cluster.internal/cluster.local/' \
      /var/lib/kubelet/config.yaml && sudo systemctl restart kubelet"

    For an immediate workaround without touching the kubelet, override the search domains directly in the affected pod's

    dnsConfig
    :

    spec:
      dnsConfig:
        searches:
        - production.svc.cluster.local
        - svc.cluster.local
        - cluster.local

    After applying the patch and restarting pods, confirm the resolv.conf search line is correct and that short-name resolution succeeds without using a FQDN.


    Root Cause 6: CoreDNS Resource Exhaustion

    Why It Happens

    Under heavy load — many pods, high service churn, or applications that issue a high volume of external DNS lookups per second — CoreDNS can exhaust its CPU or memory limits and begin dropping queries or responding with SERVFAIL. Kubernetes resource limits set too low for the cluster's actual scale are the most common cause. Because CoreDNS is a shared service, a single overloaded instance affects every workload simultaneously, making the problem appear cluster-wide and unrelated to any single application.

    How to Identify It

    Check real-time resource consumption for CoreDNS pods:

    kubectl top pods -n kube-system -l k8s-app=kube-dns
    
    NAME                       CPU(cores)   MEMORY(bytes)
    coredns-5d78c9869d-4xkzp   490m         98Mi
    coredns-5d78c9869d-r9p2l   495m         95Mi

    Both pods are running at 490–495m against a 500m CPU limit and are being throttled. Confirm the current limits:

    kubectl get deployment coredns -n kube-system \
      -o jsonpath='{.spec.template.spec.containers[0].resources}'
    
    {"limits":{"memory":"170Mi","cpu":"500m"},"requests":{"cpu":"100m","memory":"70Mi"}}

    How to Fix It

    Increase the resource limits and add a third replica to spread load across more pods:

    kubectl patch deployment coredns -n kube-system --type=json \
      -p='[
        {"op":"replace","path":"/spec/replicas","value":3},
        {"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/cpu","value":"1000m"},
        {"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"256Mi"}
      ]'

    A practical scaling guideline for CoreDNS is one replica per 500 pods in the cluster, with a minimum of two replicas for high availability. Always place replicas on different nodes using a

    podAntiAffinity
    rule.


    Root Cause 7: kube-proxy or iptables Rules Broken

    Why It Happens

    The CoreDNS service (typically at

    10.96.0.10
    ) is a ClusterIP service. Traffic to it is intercepted and redirected by kube-proxy via iptables or IPVS rules. If kube-proxy has crashed or if iptables rules have been flushed — which can happen during node maintenance, OS upgrades, or when firewall management tools conflict with Kubernetes — pods cannot reach the DNS service IP even when CoreDNS pods are perfectly healthy. The symptom is DNS timing out at the network level with no log output from CoreDNS at all, because the packets never arrive.

    How to Identify It

    Check whether kube-proxy is running on affected nodes:

    kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
    
    NAME               READY   STATUS    RESTARTS   AGE   NODE
    kube-proxy-9xzpk   0/1     Error     5          6m    sw-infrarunbook-01
    kube-proxy-m4qr2   1/1     Running   0          6d    10.0.1.22

    Check whether the iptables DNAT rule for the DNS service exists on the affected node:

    ssh infrarunbook-admin@sw-infrarunbook-01 \
      "sudo iptables -t nat -L KUBE-SERVICES | grep 10.96.0.10"
    
    # No output — rule is missing

    A node with a healthy kube-proxy would show:

    KUBE-SVC-ERIFXISQEP7F7OF4  udp  --  anywhere  10.96.0.10  \
      /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain

    How to Fix It

    kube-proxy runs as a DaemonSet. Delete the failing pod and let it be recreated automatically, which will re-sync all iptables rules:

    kubectl delete pod kube-proxy-9xzpk -n kube-system
    
    pod "kube-proxy-9xzpk" deleted

    Wait for the new pod to become ready and verify the iptables rule is restored:

    kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
    
    NAME               READY   STATUS    RESTARTS   AGE   NODE
    kube-proxy-7bnjq   1/1     Running   0          30s   sw-infrarunbook-01
    
    ssh infrarunbook-admin@sw-infrarunbook-01 \
      "sudo iptables -t nat -L KUBE-SERVICES | grep 10.96.0.10"
    
    KUBE-SVC-ERIFXISQEP7F7OF4  udp  --  anywhere  10.96.0.10  \
      /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain

    Prevention

    • Monitor CoreDNS with Prometheus. CoreDNS exposes metrics on port 9153. Alert on elevated
      coredns_dns_responses_total{rcode="SERVFAIL"}
      rates and high
      coredns_dns_request_duration_seconds
      p99 latency. Catch problems before they affect workloads.
    • Always include DNS egress rules in NetworkPolicy templates. Standardise on a base NetworkPolicy Helm chart that includes the port 53 UDP and TCP egress allowance for every namespace. Treat it as required boilerplate alongside the default-deny policy, never as an optional add-on.
    • Validate Corefile changes before applying. Use the CoreDNS binary's
      -conf
      flag in a CI pipeline step to syntax-check Corefile changes against a running CoreDNS container before any
      kubectl apply
      . A broken Corefile applied to production takes down DNS cluster-wide.
    • Set realistic resource limits for your cluster scale. Revisit CoreDNS CPU and memory limits whenever cluster pod count grows significantly. One CoreDNS replica per 500 pods with at least 500m CPU per replica is a safe baseline under typical load.
    • Use ndots:5 by default; only override with intent and testing. If you reduce ndots for external DNS performance, test thoroughly with both short in-cluster service names and fully-qualified external names before rolling out cluster-wide.
    • Pin cluster domain before first workload; never change it live. Changing
      clusterDomain
      after cluster creation requires coordinated kubelet restarts across every node — a high-risk operation. Decide on
      cluster.local
      or a custom domain at bootstrap time.
    • Run DNS smoke tests in CI/CD pipelines. Add a post-deployment step that launches a busybox pod and resolves a known in-cluster service name. Fail the pipeline if resolution takes more than 200ms or returns an error, catching NetworkPolicy and ConfigMap regressions before they reach production.
    • Use PodTopologySpread for CoreDNS. Ensure CoreDNS replicas are spread across multiple nodes so a single node failure does not take down all DNS pods simultaneously.

    Frequently Asked Questions

    Q: How do I quickly test if DNS is working inside a specific pod without modifying it?

    A: Use

    kubectl exec
    directly:
    kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default
    . If the pod's image does not include nslookup, use
    cat /etc/resolv.conf
    to check the nameserver and search configuration, then test connectivity with
    wget -qO- --timeout=3 http://kubernetes.default
    to verify both resolution and routing in one step.

    Q: What is the default CoreDNS service IP in a Kubernetes cluster?

    A: By default it is the tenth IP in the service CIDR. For a cluster with service CIDR

    10.96.0.0/12
    , the DNS service is at
    10.96.0.10
    . You can confirm it with
    kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
    .

    Q: Why does DNS work for some pods but not others in the same namespace?

    A: The most common causes are per-pod NetworkPolicy selectors that allow DNS egress only for pods with specific labels, or pods that were scheduled before a Corefile reload took effect and have a stale nameserver entry. Compare the

    /etc/resolv.conf
    contents and pod labels between working and failing pods — differences there will point to the cause.

    Q: What is the difference between NXDOMAIN and SERVFAIL?

    A: NXDOMAIN means the DNS server authoritative for the zone confirmed the name does not exist — typically a typo in the service name, wrong namespace, or the service was never created. SERVFAIL means the DNS server encountered an error while trying to resolve the query — typically an unreachable upstream, a loop in the Corefile, or a broken forward rule. SERVFAIL indicates a CoreDNS infrastructure problem; NXDOMAIN indicates the name itself is wrong.

    Q: Is it safe to restart CoreDNS pods during production hours?

    A: Yes, as long as you have more than one replica. A rolling restart replaces pods one at a time while keeping the remaining replicas serving queries. Run

    kubectl rollout restart deployment/coredns -n kube-system
    and monitor with
    kubectl rollout status deployment/coredns -n kube-system
    . The restart completes with zero DNS downtime for a correctly sized deployment.

    Q: Why does external DNS resolution work but in-cluster service discovery does not?

    A: This pattern means CoreDNS is running and forwarding queries upstream correctly, but the

    kubernetes
    plugin is not resolving in-cluster names. Check that the
    kubernetes
    block in the Corefile specifies the correct zone (
    cluster.local
    ) and that the pod's
    /etc/resolv.conf
    search line includes
    svc.cluster.local
    . A mismatch between the kubelet's
    clusterDomain
    and the CoreDNS zone is the most common cause.

    Q: How many CoreDNS replicas should I run?

    A: A minimum of two replicas for high availability on any production cluster. For larger clusters, scale at roughly one replica per 500 pods. Use a

    PodTopologySpread
    constraint or
    podAntiAffinity
    rule to ensure replicas land on different nodes, preventing a single node failure from eliminating all DNS capacity simultaneously.

    Q: Can I use dnsPolicy: None to fully control DNS in a pod?

    A: Yes. Setting

    dnsPolicy: None
    disables all automatic DNS configuration and requires you to supply the full
    dnsConfig
    block with explicit nameservers, search domains, and ndots options. This is useful for pods that must bypass CoreDNS entirely and query an external resolver directly, but any mistake in the
    dnsConfig
    will leave the pod with no working DNS at all — there is no fallback.

    Q: How do I debug slow DNS resolution without disrupting running pods?

    A: Deploy a dedicated debug pod:

    kubectl run dns-debug --image=tutum/dnsutils --restart=Never -it --rm -- bash
    . Then use
    dig
    with timing stats:
    dig @10.96.0.10 kubernetes.default.svc.cluster.local +stats
    . The
    ;; Query time:
    line shows actual latency. Anything above 5ms for in-cluster names under normal conditions suggests CoreDNS is throttled or the iptables path has a problem.

    Q: Can I add custom DNS entries for in-cluster names that do not correspond to real Services?

    A: Yes. Use the CoreDNS

    hosts
    plugin to add static A records directly in the Corefile, or deploy an internal-only zone using the
    file
    plugin pointing to a zone file stored in a ConfigMap. The
    hosts
    plugin is simpler for a small number of static overrides; the
    file
    plugin scales better for many records and supports full zone management including PTR records.

    Q: After fixing DNS, how do I verify that all existing pods have picked up the change?

    A: For ConfigMap and CoreDNS-level fixes, existing pods automatically benefit because their nameserver IP (

    10.96.0.10
    ) does not change — only the CoreDNS behaviour does. For fixes that required changing the pod's own
    dnsConfig
    or for ndots and search domain corrections, pods must be restarted to receive the updated
    /etc/resolv.conf
    . Rolling restart the affected Deployments with
    kubectl rollout restart deployment/<name> -n <namespace>
    .

    Q: How do I find which pods are currently failing DNS resolution cluster-wide without checking each one individually?

    A: Deploy a DaemonSet running a DNS check loop to every node, or use a tool such as

    kubectl-debug
    on sampled nodes. A faster triage approach is to check CoreDNS error metrics directly:
    kubectl exec -n kube-system <coredns-pod> -- wget -qO- http://localhost:9153/metrics | grep coredns_dns_responses_total
    and filter for
    rcode="SERVFAIL"
    or
    rcode="NXDOMAIN"
    . A spike in either counter cluster-wide confirms the scope of the problem before you start checking individual pods.

    Frequently Asked Questions

    How do I quickly test if DNS is working inside a specific pod without modifying it?

    Use kubectl exec directly: kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default. If the pod image does not include nslookup, use cat /etc/resolv.conf to check the nameserver and search configuration, then test with wget -qO- --timeout=3 http://kubernetes.default to verify both resolution and routing in one step.

    What is the default CoreDNS service IP in a Kubernetes cluster?

    By default it is the tenth IP in the service CIDR. For a cluster with service CIDR 10.96.0.0/12, the DNS service is at 10.96.0.10. Confirm it with: kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'.

    Why does DNS work for some pods but not others in the same namespace?

    The most common causes are per-pod NetworkPolicy selectors that allow DNS egress only for pods with specific labels, or pods scheduled before a Corefile reload took effect with stale nameserver entries. Compare /etc/resolv.conf contents and pod labels between working and failing pods — differences there will identify the cause.

    What is the difference between NXDOMAIN and SERVFAIL?

    NXDOMAIN means the authoritative DNS server confirmed the name does not exist — typically a typo, wrong namespace, or the Service was never created. SERVFAIL means the DNS server encountered an error while resolving — typically an unreachable upstream, a Corefile loop, or a broken forward rule. SERVFAIL indicates a CoreDNS infrastructure problem; NXDOMAIN indicates the name itself is wrong.

    Is it safe to restart CoreDNS pods during production hours?

    Yes, as long as you have more than one replica. A rolling restart replaces pods one at a time while keeping remaining replicas serving queries. Run kubectl rollout restart deployment/coredns -n kube-system and monitor with kubectl rollout status deployment/coredns -n kube-system. With a correctly sized deployment there is zero DNS downtime.

    Why does external DNS resolution work but in-cluster service discovery does not?

    This means CoreDNS is running and forwarding upstream correctly, but the kubernetes plugin is not resolving in-cluster names. Check that the kubernetes block in the Corefile specifies the correct zone (cluster.local) and that the pod's /etc/resolv.conf search line includes svc.cluster.local. A mismatch between the kubelet's clusterDomain and the CoreDNS zone is the most common cause.

    How many CoreDNS replicas should I run?

    A minimum of two replicas for high availability on any production cluster. For larger clusters, scale at roughly one replica per 500 pods. Use a PodTopologySpread constraint or podAntiAffinity rule to ensure replicas land on different nodes, preventing a single node failure from eliminating all DNS capacity.

    Can I use dnsPolicy: None to fully control DNS in a pod?

    Yes. Setting dnsPolicy: None disables all automatic DNS configuration and requires you to supply the full dnsConfig block with explicit nameservers, search domains, and ndots options. Any mistake in the dnsConfig will leave the pod with no working DNS — there is no automatic fallback.

    How do I debug slow DNS resolution without disrupting running pods?

    Deploy a dedicated debug pod: kubectl run dns-debug --image=tutum/dnsutils --restart=Never -it --rm -- bash. Then use dig with timing: dig @10.96.0.10 kubernetes.default.svc.cluster.local +stats. The Query time line shows actual latency. Anything above 5ms for in-cluster names under normal conditions suggests CoreDNS is throttled or an iptables routing problem exists.

    After fixing DNS, how do I verify that all existing pods have picked up the change?

    For ConfigMap and CoreDNS-level fixes, existing pods automatically benefit because the nameserver IP does not change. For fixes that changed the pod's own dnsConfig, ndots, or search domains, pods must be restarted to receive the updated /etc/resolv.conf. Rolling restart affected Deployments with kubectl rollout restart deployment/<name> -n <namespace>.

    Related Articles