Kubernetes DNS Not Resolving Inside Cluster

DNS resolution failures inside a Kubernetes cluster are among the most disruptive and difficult-to-diagnose problems an infrastructure engineer will face. A pod that cannot resolve service names or external hostnames will fail silently or throw cryptic connection errors, making the root cause hard to pinpoint at first glance. This runbook walks through every common reason DNS stops working inside a cluster, complete with real diagnostic commands, actual error output, and step-by-step remediation for each cause.

Symptoms

When DNS resolution breaks inside a Kubernetes cluster, pods typically surface one or more of the following symptoms:

Application pods crash-loop with
dial tcp: lookup <service>: no such host
or similar resolver errors
nslookup
or
dig
inside a pod returns
SERVFAIL
,
NXDOMAIN
, or times out entirely
Services reachable by IP continue to work, but hostname-based connections fail
kubectl exec
into a debug pod shows
connection timed out; no servers could be reached
Intermittent DNS failures under load, suggesting resource exhaustion rather than complete failure
External DNS works but in-cluster service discovery does not, or vice versa
New pods fail to resolve immediately after deployment while older pods are unaffected

Start every DNS investigation with a quick baseline test from inside an affected pod:

kubectl run dns-debug --image=busybox:1.36 --restart=Never -it --rm -- nslookup kubernetes.default

If this times out or returns SERVFAIL, CoreDNS itself is the problem. If it returns a valid answer but your application's target hostname does not resolve, the issue is more likely related to search domains, ndots settings, or ConfigMap forwarding rules.

Root Cause 1: CoreDNS Pod Not Running

Why It Happens

CoreDNS runs as a Deployment inside the

kube-system

namespace. If the pods crash, get evicted due to node memory pressure, or are accidentally deleted, all in-cluster DNS resolution stops immediately. A bad ConfigMap update that causes CoreDNS to panic on startup, a broken container image reference, or a node that has run out of resources can all leave the cluster without any DNS pods. Because CoreDNS is a shared infrastructure component, a single Deployment outage affects every workload in every namespace simultaneously.

How to Identify It

Check the state of the CoreDNS Deployment and its pods:

kubectl get pods -n kube-system -l k8s-app=kube-dns

NAME                       READY   STATUS             RESTARTS   AGE
coredns-5d78c9869d-4xkzp   0/1     CrashLoopBackOff   8          12m
coredns-5d78c9869d-r9p2l   0/1     CrashLoopBackOff   8          12m

Pull the logs from a crashing pod to understand what is going wrong:

kubectl logs -n kube-system coredns-5d78c9869d-4xkzp --previous

[ERROR] plugin/errors: 2 SERVFAIL (incomplete response) : no upstream is available
plugin/reload: Running configuration md5 = 3a7bc4f1d9820e13
[FATAL] Failed to initialize server: listen tcp 0.0.0.0:53: bind: address already in use

Also verify the Deployment itself is reporting healthy conditions:

kubectl describe deployment coredns -n kube-system | grep -A5 "Conditions:"

Conditions:
  Type             Status  Reason
  ----             ------  ------
  Available        False   MinimumReplicasUnavailable
  Progressing      True    ReplicaSetUpdated

How to Fix It

If the pods are in CrashLoopBackOff due to a bad configuration, roll back the ConfigMap first (see Root Cause 2). If the pods are simply missing or evicted, scale the Deployment back up:

kubectl scale deployment coredns -n kube-system --replicas=2

If the container image cannot be pulled, verify the image reference and update it to a valid tag:

kubectl set image deployment/coredns -n kube-system \
  coredns=registry.k8s.io/coredns/coredns:v1.11.1

After making changes, watch the rollout until it completes:

kubectl rollout status deployment/coredns -n kube-system

Waiting for deployment "coredns" rollout to finish: 1 out of 2 new replicas have been updated...
deployment "coredns" successfully rolled out

Confirm DNS is restored by re-running the baseline nslookup test from a pod in an affected namespace.

Root Cause 2: CoreDNS ConfigMap Misconfigured

Why It Happens

CoreDNS reads its configuration from a ConfigMap named

coredns

in the

kube-system

namespace. A typo in the

Corefile

, a missing plugin directive, a malformed forward address, or an incorrect zone block will prevent CoreDNS from loading its configuration — causing pods to crash or silently drop queries. This ConfigMap is often edited manually during cluster customisation to add custom stub zones, change upstream resolvers, or tune caching, making it a frequent source of breakage. Pointing the

forward

directive at the CoreDNS service IP itself creates an infinite loop and is one of the most common self-inflicted failures.

How to Identify It

Inspect the current ConfigMap:

kubectl get configmap coredns -n kube-system -o yaml

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . 10.96.0.10 {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap

In the output above, the

forward

directive points to

10.96.0.10

— which is the CoreDNS service itself. Every external query will loop back into CoreDNS, and the

loop

plugin will detect this and crash the pod. Another common mistake is removing the

cluster.local

zone from the

kubernetes

block, which stops all in-cluster service discovery.

Test the Corefile syntax before applying it by running CoreDNS with the

-dns.port 0

flag to perform a dry-run parse:

docker run --rm -v $(pwd)/Corefile:/Corefile \
  registry.k8s.io/coredns/coredns:v1.11.1 \
  -conf /Corefile -dns.port 0 2>&1

.:53
CoreDNS-1.11.1
linux/amd64, go1.21.1, 1b5f4a0

If there is a syntax error, CoreDNS will print it and exit non-zero before you have applied anything.

How to Fix It

Edit the ConfigMap and restore a known-good Corefile:

kubectl edit configmap coredns -n kube-system

A safe baseline Corefile for a standard cluster with upstream resolvers at

10.0.0.1

.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . 10.0.0.1
    cache 30
    loop
    reload
    loadbalance
}

To forward queries for

solvethenetwork.com

to an internal resolver at

10.0.0.53

, add a dedicated stub zone block above the catch-all:

solvethenetwork.com:53 {
    errors
    forward . 10.0.0.53
    cache 30
}
.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . 10.0.0.1
    cache 30
    loop
    reload
    loadbalance
}

After saving, CoreDNS detects the change via the

reload

plugin without requiring a pod restart. Verify it picked up the new configuration:

kubectl logs -n kube-system -l k8s-app=kube-dns | grep reload

[INFO] Reloading
[INFO] plugin/reload: Running configuration md5 = d41d8cd98f00b204
[INFO] Reloading complete

Root Cause 3: NetworkPolicy Blocking Port 53

Why It Happens

Kubernetes NetworkPolicy objects restrict pod-to-pod and pod-to-external traffic. A strict default-deny egress policy applied to application namespaces will block pods from reaching the CoreDNS service on port 53 over both UDP and TCP. Engineers often add default-deny NetworkPolicies to tighten namespace security without realising they have also cut off DNS. The symptom is pods that can reach other pods by direct IP address but cannot resolve any hostname — not even

kubernetes.default

— because every DNS query is silently dropped before it reaches CoreDNS.

How to Identify It

List NetworkPolicies in the affected namespace and look for any that restrict egress:

kubectl get networkpolicy -n production

NAME            POD-SELECTOR   AGE
default-deny    <none>         3d
allow-internal  app=api        3d

Inspect the default-deny policy:

kubectl get networkpolicy default-deny -n production -o yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

This policy blocks all egress traffic from every pod in the namespace, including DNS queries to CoreDNS. Confirm it is the cause by running a debug pod in a namespace without NetworkPolicies and verifying DNS works there:

kubectl run dns-debug --image=busybox:1.36 --restart=Never \
  -it --rm -n kube-system -- nslookup kubernetes.default

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

DNS works in

kube-system

but fails in

production

— the NetworkPolicy is blocking port 53 egress.

How to Fix It

Add an explicit egress rule allowing UDP and TCP on port 53 targeted at the

kube-system

namespace where CoreDNS runs. Using a namespaceSelector is more portable than hardcoding the CoreDNS service CIDR:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - ports:
    - port: 53
      protocol: UDP
    - port: 53
      protocol: TCP
    to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system

Apply it and verify resolution is restored:

kubectl apply -f allow-dns-egress.yaml

networkpolicy.networking.k8s.io/allow-dns-egress created

kubectl exec -it <pod-name> -n production -- nslookup kubernetes.default

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

Always include this DNS egress rule in any namespace that uses a default-deny egress policy. Treat it as mandatory boilerplate alongside the policy itself.

Root Cause 4: ndots Misconfigured

Why It Happens

The

ndots

option in

/etc/resolv.conf

controls how many dots must appear in a query name before the resolver treats it as a fully qualified domain name (FQDN) and sends it as-is, without appending search domains. Kubernetes sets

ndots:5

by default, meaning a hostname like

api.solvethenetwork.com

(only two dots) will first be tried with each search domain appended before the bare name is attempted. This is intentional: it ensures short in-cluster service names like

payments

resolve correctly by appending

.production.svc.cluster.local

automatically.

The problem arises when someone overrides

ndots

to a low value such as

1

in an attempt to reduce DNS lookup latency for external FQDNs. With

ndots:1

, the resolver treats any name with at least one dot as a FQDN and sends it directly without appending search domains, breaking all short in-cluster service name resolution entirely.

How to Identify It

Inspect

/etc/resolv.conf

inside an affected pod:

kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf

search production.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:1

With

ndots:1

, querying the short service name

payments

sends it as the FQDN

payments.

immediately, bypassing the search list. Confirm the behaviour with

dig

kubectl exec -it <pod-name> -n production -- dig payments

;; QUESTION SECTION:
;payments.                      IN      A

;; AUTHORITY SECTION:
.                       86399   IN      SOA     a.root-servers.net. ...

;; Query time: 234 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)

The query reached CoreDNS, which forwarded it upstream to root servers rather than resolving it as an in-cluster service. With the correct

ndots:5

, the resolver would try

payments.production.svc.cluster.local

first and get an immediate answer.

How to Fix It

Set the correct

ndots

value in the pod's

dnsConfig

. The Kubernetes default of 5 is correct for most workloads. If external DNS resolution latency is a concern, a value of 2 is a reasonable compromise — it still forces short names through the search list while allowing typical FQDNs (which have three or more dots) to bypass it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-deployment
  namespace: production
spec:
  template:
    spec:
      dnsConfig:
        options:
        - name: ndots
          value: "5"
      containers:
      - name: api
        image: nginx:1.25

For an already-running Deployment, patch it directly:

kubectl patch deployment api-deployment -n production --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/dnsConfig","value":{"options":[{"name":"ndots","value":"5"}]}}]'

Confirm the resolv.conf inside the restarted pods reflects the correct value:

kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf

search production.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

Root Cause 5: Search Domain Wrong

Why It Happens

Kubernetes automatically populates the

search

line in each pod's

/etc/resolv.conf

based on the pod's namespace and the cluster domain. The expected entries are

<namespace>.svc.cluster.local

svc.cluster.local

, and

cluster.local

. This cluster domain is configured via the kubelet's

--cluster-domain

flag (or equivalent

clusterDomain

field in the kubelet config file). If the cluster was bootstrapped with a non-standard domain such as

cluster.internal

, but CoreDNS was left configured with

cluster.local

, the search domains injected into pods will not match the zone CoreDNS is authoritative for, and all in-cluster service lookups will silently fail. A misconfigured Helm chart or operator can also override the

dnsConfig.searches

field, replacing the correct search domains with something invalid.

How to Identify It

Check the search domains in the pod and compare them to what CoreDNS serves:

kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf

search production.svc.cluster.internal svc.cluster.internal cluster.internal
nameserver 10.96.0.10
options ndots:5

Now check the CoreDNS ConfigMap zone declaration:

kubectl get configmap coredns -n kube-system \
  -o jsonpath='{.data.Corefile}' | grep kubernetes

        kubernetes cluster.local in-addr.arpa ip6.arpa {

The pod is searching

cluster.internal

but CoreDNS is authoritative only for

cluster.local

. Every short service name lookup appends the wrong suffix and receives NXDOMAIN. Confirm the mismatch by testing a fully-qualified lookup, which bypasses the search list entirely:

kubectl exec -it <pod-name> -n production -- \
  nslookup kubernetes.default.svc.cluster.local

Server:    10.96.0.10
Address 1: 10.96.0.10

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.96.0.1

The FQDN resolves correctly, but

nslookup kubernetes.default

(which appends

.production.svc.cluster.internal

) fails. Search domain mismatch is confirmed.

How to Fix It

The correct fix depends on where the mismatch originated. First, check the kubelet configuration on the affected node:

ssh infrarunbook-admin@sw-infrarunbook-01 \
  "cat /var/lib/kubelet/config.yaml | grep clusterDomain"

clusterDomain: cluster.internal

If the kubelet is using the wrong domain, update it on every node and restart the kubelet. This is a cluster-wide change requiring a rolling node update:

ssh infrarunbook-admin@sw-infrarunbook-01 \
  "sudo sed -i 's/cluster.internal/cluster.local/' \
  /var/lib/kubelet/config.yaml && sudo systemctl restart kubelet"

For an immediate workaround without touching the kubelet, override the search domains directly in the affected pod's

dnsConfig

spec:
  dnsConfig:
    searches:
    - production.svc.cluster.local
    - svc.cluster.local
    - cluster.local

After applying the patch and restarting pods, confirm the resolv.conf search line is correct and that short-name resolution succeeds without using a FQDN.

Root Cause 6: CoreDNS Resource Exhaustion

Why It Happens

Under heavy load — many pods, high service churn, or applications that issue a high volume of external DNS lookups per second — CoreDNS can exhaust its CPU or memory limits and begin dropping queries or responding with SERVFAIL. Kubernetes resource limits set too low for the cluster's actual scale are the most common cause. Because CoreDNS is a shared service, a single overloaded instance affects every workload simultaneously, making the problem appear cluster-wide and unrelated to any single application.

How to Identify It

Check real-time resource consumption for CoreDNS pods:

kubectl top pods -n kube-system -l k8s-app=kube-dns

NAME                       CPU(cores)   MEMORY(bytes)
coredns-5d78c9869d-4xkzp   490m         98Mi
coredns-5d78c9869d-r9p2l   495m         95Mi

Both pods are running at 490–495m against a 500m CPU limit and are being throttled. Confirm the current limits:

kubectl get deployment coredns -n kube-system \
  -o jsonpath='{.spec.template.spec.containers[0].resources}'

{"limits":{"memory":"170Mi","cpu":"500m"},"requests":{"cpu":"100m","memory":"70Mi"}}

How to Fix It

Increase the resource limits and add a third replica to spread load across more pods:

kubectl patch deployment coredns -n kube-system --type=json \
  -p='[
    {"op":"replace","path":"/spec/replicas","value":3},
    {"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/cpu","value":"1000m"},
    {"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"256Mi"}
  ]'

A practical scaling guideline for CoreDNS is one replica per 500 pods in the cluster, with a minimum of two replicas for high availability. Always place replicas on different nodes using a

podAntiAffinity

rule.

Root Cause 7: kube-proxy or iptables Rules Broken

Why It Happens

The CoreDNS service (typically at

10.96.0.10

) is a ClusterIP service. Traffic to it is intercepted and redirected by kube-proxy via iptables or IPVS rules. If kube-proxy has crashed or if iptables rules have been flushed — which can happen during node maintenance, OS upgrades, or when firewall management tools conflict with Kubernetes — pods cannot reach the DNS service IP even when CoreDNS pods are perfectly healthy. The symptom is DNS timing out at the network level with no log output from CoreDNS at all, because the packets never arrive.

How to Identify It

Check whether kube-proxy is running on affected nodes:

kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide

NAME               READY   STATUS    RESTARTS   AGE   NODE
kube-proxy-9xzpk   0/1     Error     5          6m    sw-infrarunbook-01
kube-proxy-m4qr2   1/1     Running   0          6d    10.0.1.22

Check whether the iptables DNAT rule for the DNS service exists on the affected node:

ssh infrarunbook-admin@sw-infrarunbook-01 \
  "sudo iptables -t nat -L KUBE-SERVICES | grep 10.96.0.10"

# No output — rule is missing

A node with a healthy kube-proxy would show:

KUBE-SVC-ERIFXISQEP7F7OF4  udp  --  anywhere  10.96.0.10  \
  /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain

How to Fix It

kube-proxy runs as a DaemonSet. Delete the failing pod and let it be recreated automatically, which will re-sync all iptables rules:

kubectl delete pod kube-proxy-9xzpk -n kube-system

pod "kube-proxy-9xzpk" deleted

Wait for the new pod to become ready and verify the iptables rule is restored:

kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide

NAME               READY   STATUS    RESTARTS   AGE   NODE
kube-proxy-7bnjq   1/1     Running   0          30s   sw-infrarunbook-01

ssh infrarunbook-admin@sw-infrarunbook-01 \
  "sudo iptables -t nat -L KUBE-SERVICES | grep 10.96.0.10"

KUBE-SVC-ERIFXISQEP7F7OF4  udp  --  anywhere  10.96.0.10  \
  /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain

Prevention

Monitor CoreDNS with Prometheus. CoreDNS exposes metrics on port 9153. Alert on elevated
coredns_dns_responses_total{rcode="SERVFAIL"}
rates and high
coredns_dns_request_duration_seconds
p99 latency. Catch problems before they affect workloads.
Always include DNS egress rules in NetworkPolicy templates. Standardise on a base NetworkPolicy Helm chart that includes the port 53 UDP and TCP egress allowance for every namespace. Treat it as required boilerplate alongside the default-deny policy, never as an optional add-on.
Validate Corefile changes before applying. Use the CoreDNS binary's
-conf
flag in a CI pipeline step to syntax-check Corefile changes against a running CoreDNS container before any
kubectl apply
. A broken Corefile applied to production takes down DNS cluster-wide.
Set realistic resource limits for your cluster scale. Revisit CoreDNS CPU and memory limits whenever cluster pod count grows significantly. One CoreDNS replica per 500 pods with at least 500m CPU per replica is a safe baseline under typical load.
Use ndots:5 by default; only override with intent and testing. If you reduce ndots for external DNS performance, test thoroughly with both short in-cluster service names and fully-qualified external names before rolling out cluster-wide.
Pin cluster domain before first workload; never change it live. Changing
clusterDomain
after cluster creation requires coordinated kubelet restarts across every node — a high-risk operation. Decide on
cluster.local
or a custom domain at bootstrap time.
Run DNS smoke tests in CI/CD pipelines. Add a post-deployment step that launches a busybox pod and resolves a known in-cluster service name. Fail the pipeline if resolution takes more than 200ms or returns an error, catching NetworkPolicy and ConfigMap regressions before they reach production.
Use PodTopologySpread for CoreDNS. Ensure CoreDNS replicas are spread across multiple nodes so a single node failure does not take down all DNS pods simultaneously.

Symptoms

Root Cause 1: CoreDNS Pod Not Running

Why It Happens

How to Identify It

How to Fix It

Root Cause 2: CoreDNS ConfigMap Misconfigured

Why It Happens

How to Identify It

How to Fix It

Root Cause 3: NetworkPolicy Blocking Port 53

Why It Happens

How to Identify It

How to Fix It

Root Cause 4: ndots Misconfigured

Why It Happens

How to Identify It

How to Fix It

Root Cause 5: Search Domain Wrong

Why It Happens

How to Identify It

How to Fix It

Root Cause 6: CoreDNS Resource Exhaustion

Why It Happens

How to Identify It

How to Fix It

Root Cause 7: kube-proxy or iptables Rules Broken

Why It Happens

How to Identify It

How to Fix It

Prevention

Related Articles

Frequently Asked Questions

How do I quickly test if DNS is working inside a specific pod without modifying it?

What is the default CoreDNS service IP in a Kubernetes cluster?

Why does DNS work for some pods but not others in the same namespace?

What is the difference between NXDOMAIN and SERVFAIL?

Is it safe to restart CoreDNS pods during production hours?

Why does external DNS resolution work but in-cluster service discovery does not?

How many CoreDNS replicas should I run?

Can I use dnsPolicy: None to fully control DNS in a pod?

How do I debug slow DNS resolution without disrupting running pods?

After fixing DNS, how do I verify that all existing pods have picked up the change?

Related Articles