DNS resolution failures inside a Kubernetes cluster are among the most disruptive and difficult-to-diagnose problems an infrastructure engineer will face. A pod that cannot resolve service names or external hostnames will fail silently or throw cryptic connection errors, making the root cause hard to pinpoint at first glance. This runbook walks through every common reason DNS stops working inside a cluster, complete with real diagnostic commands, actual error output, and step-by-step remediation for each cause.
Symptoms
When DNS resolution breaks inside a Kubernetes cluster, pods typically surface one or more of the following symptoms:
- Application pods crash-loop with
dial tcp: lookup <service>: no such host
or similar resolver errors nslookup
ordig
inside a pod returnsSERVFAIL
,NXDOMAIN
, or times out entirely- Services reachable by IP continue to work, but hostname-based connections fail
kubectl exec
into a debug pod showsconnection timed out; no servers could be reached
- Intermittent DNS failures under load, suggesting resource exhaustion rather than complete failure
- External DNS works but in-cluster service discovery does not, or vice versa
- New pods fail to resolve immediately after deployment while older pods are unaffected
Start every DNS investigation with a quick baseline test from inside an affected pod:
kubectl run dns-debug --image=busybox:1.36 --restart=Never -it --rm -- nslookup kubernetes.default
If this times out or returns SERVFAIL, CoreDNS itself is the problem. If it returns a valid answer but your application's target hostname does not resolve, the issue is more likely related to search domains, ndots settings, or ConfigMap forwarding rules.
Root Cause 1: CoreDNS Pod Not Running
Why It Happens
CoreDNS runs as a Deployment inside the
kube-systemnamespace. If the pods crash, get evicted due to node memory pressure, or are accidentally deleted, all in-cluster DNS resolution stops immediately. A bad ConfigMap update that causes CoreDNS to panic on startup, a broken container image reference, or a node that has run out of resources can all leave the cluster without any DNS pods. Because CoreDNS is a shared infrastructure component, a single Deployment outage affects every workload in every namespace simultaneously.
How to Identify It
Check the state of the CoreDNS Deployment and its pods:
kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-5d78c9869d-4xkzp 0/1 CrashLoopBackOff 8 12m
coredns-5d78c9869d-r9p2l 0/1 CrashLoopBackOff 8 12m
Pull the logs from a crashing pod to understand what is going wrong:
kubectl logs -n kube-system coredns-5d78c9869d-4xkzp --previous
[ERROR] plugin/errors: 2 SERVFAIL (incomplete response) : no upstream is available
plugin/reload: Running configuration md5 = 3a7bc4f1d9820e13
[FATAL] Failed to initialize server: listen tcp 0.0.0.0:53: bind: address already in use
Also verify the Deployment itself is reporting healthy conditions:
kubectl describe deployment coredns -n kube-system | grep -A5 "Conditions:"
Conditions:
Type Status Reason
---- ------ ------
Available False MinimumReplicasUnavailable
Progressing True ReplicaSetUpdated
How to Fix It
If the pods are in CrashLoopBackOff due to a bad configuration, roll back the ConfigMap first (see Root Cause 2). If the pods are simply missing or evicted, scale the Deployment back up:
kubectl scale deployment coredns -n kube-system --replicas=2
If the container image cannot be pulled, verify the image reference and update it to a valid tag:
kubectl set image deployment/coredns -n kube-system \
coredns=registry.k8s.io/coredns/coredns:v1.11.1
After making changes, watch the rollout until it completes:
kubectl rollout status deployment/coredns -n kube-system
Waiting for deployment "coredns" rollout to finish: 1 out of 2 new replicas have been updated...
deployment "coredns" successfully rolled out
Confirm DNS is restored by re-running the baseline nslookup test from a pod in an affected namespace.
Root Cause 2: CoreDNS ConfigMap Misconfigured
Why It Happens
CoreDNS reads its configuration from a ConfigMap named
corednsin the
kube-systemnamespace. A typo in the
Corefile, a missing plugin directive, a malformed forward address, or an incorrect zone block will prevent CoreDNS from loading its configuration — causing pods to crash or silently drop queries. This ConfigMap is often edited manually during cluster customisation to add custom stub zones, change upstream resolvers, or tune caching, making it a frequent source of breakage. Pointing the
forwarddirective at the CoreDNS service IP itself creates an infinite loop and is one of the most common self-inflicted failures.
How to Identify It
Inspect the current ConfigMap:
kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . 10.96.0.10 {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
In the output above, the
forwarddirective points to
10.96.0.10— which is the CoreDNS service itself. Every external query will loop back into CoreDNS, and the
loopplugin will detect this and crash the pod. Another common mistake is removing the
cluster.localzone from the
kubernetesblock, which stops all in-cluster service discovery.
Test the Corefile syntax before applying it by running CoreDNS with the
-dns.port 0flag to perform a dry-run parse:
docker run --rm -v $(pwd)/Corefile:/Corefile \
registry.k8s.io/coredns/coredns:v1.11.1 \
-conf /Corefile -dns.port 0 2>&1
.:53
CoreDNS-1.11.1
linux/amd64, go1.21.1, 1b5f4a0
If there is a syntax error, CoreDNS will print it and exit non-zero before you have applied anything.
How to Fix It
Edit the ConfigMap and restore a known-good Corefile:
kubectl edit configmap coredns -n kube-system
A safe baseline Corefile for a standard cluster with upstream resolvers at
10.0.0.1:
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . 10.0.0.1
cache 30
loop
reload
loadbalance
}
To forward queries for
solvethenetwork.comto an internal resolver at
10.0.0.53, add a dedicated stub zone block above the catch-all:
solvethenetwork.com:53 {
errors
forward . 10.0.0.53
cache 30
}
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . 10.0.0.1
cache 30
loop
reload
loadbalance
}
After saving, CoreDNS detects the change via the
reloadplugin without requiring a pod restart. Verify it picked up the new configuration:
kubectl logs -n kube-system -l k8s-app=kube-dns | grep reload
[INFO] Reloading
[INFO] plugin/reload: Running configuration md5 = d41d8cd98f00b204
[INFO] Reloading complete
Root Cause 3: NetworkPolicy Blocking Port 53
Why It Happens
Kubernetes NetworkPolicy objects restrict pod-to-pod and pod-to-external traffic. A strict default-deny egress policy applied to application namespaces will block pods from reaching the CoreDNS service on port 53 over both UDP and TCP. Engineers often add default-deny NetworkPolicies to tighten namespace security without realising they have also cut off DNS. The symptom is pods that can reach other pods by direct IP address but cannot resolve any hostname — not even
kubernetes.default— because every DNS query is silently dropped before it reaches CoreDNS.
How to Identify It
List NetworkPolicies in the affected namespace and look for any that restrict egress:
kubectl get networkpolicy -n production
NAME POD-SELECTOR AGE
default-deny <none> 3d
allow-internal app=api 3d
Inspect the default-deny policy:
kubectl get networkpolicy default-deny -n production -o yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This policy blocks all egress traffic from every pod in the namespace, including DNS queries to CoreDNS. Confirm it is the cause by running a debug pod in a namespace without NetworkPolicies and verifying DNS works there:
kubectl run dns-debug --image=busybox:1.36 --restart=Never \
-it --rm -n kube-system -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
DNS works in
kube-systembut fails in
production— the NetworkPolicy is blocking port 53 egress.
How to Fix It
Add an explicit egress rule allowing UDP and TCP on port 53 targeted at the
kube-systemnamespace where CoreDNS runs. Using a namespaceSelector is more portable than hardcoding the CoreDNS service CIDR:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
Apply it and verify resolution is restored:
kubectl apply -f allow-dns-egress.yaml
networkpolicy.networking.k8s.io/allow-dns-egress created
kubectl exec -it <pod-name> -n production -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
Always include this DNS egress rule in any namespace that uses a default-deny egress policy. Treat it as mandatory boilerplate alongside the policy itself.
Root Cause 4: ndots Misconfigured
Why It Happens
The
ndotsoption in
/etc/resolv.confcontrols how many dots must appear in a query name before the resolver treats it as a fully qualified domain name (FQDN) and sends it as-is, without appending search domains. Kubernetes sets
ndots:5by default, meaning a hostname like
api.solvethenetwork.com(only two dots) will first be tried with each search domain appended before the bare name is attempted. This is intentional: it ensures short in-cluster service names like
paymentsresolve correctly by appending
.production.svc.cluster.localautomatically.
The problem arises when someone overrides
ndotsto a low value such as
1in an attempt to reduce DNS lookup latency for external FQDNs. With
ndots:1, the resolver treats any name with at least one dot as a FQDN and sends it directly without appending search domains, breaking all short in-cluster service name resolution entirely.
How to Identify It
Inspect
/etc/resolv.confinside an affected pod:
kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf
search production.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:1
With
ndots:1, querying the short service name
paymentssends it as the FQDN
payments.immediately, bypassing the search list. Confirm the behaviour with
dig:
kubectl exec -it <pod-name> -n production -- dig payments
;; QUESTION SECTION:
;payments. IN A
;; AUTHORITY SECTION:
. 86399 IN SOA a.root-servers.net. ...
;; Query time: 234 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
The query reached CoreDNS, which forwarded it upstream to root servers rather than resolving it as an in-cluster service. With the correct
ndots:5, the resolver would try
payments.production.svc.cluster.localfirst and get an immediate answer.
How to Fix It
Set the correct
ndotsvalue in the pod's
dnsConfig. The Kubernetes default of 5 is correct for most workloads. If external DNS resolution latency is a concern, a value of 2 is a reasonable compromise — it still forces short names through the search list while allowing typical FQDNs (which have three or more dots) to bypass it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-deployment
namespace: production
spec:
template:
spec:
dnsConfig:
options:
- name: ndots
value: "5"
containers:
- name: api
image: nginx:1.25
For an already-running Deployment, patch it directly:
kubectl patch deployment api-deployment -n production --type=json \
-p='[{"op":"add","path":"/spec/template/spec/dnsConfig","value":{"options":[{"name":"ndots","value":"5"}]}}]'
Confirm the resolv.conf inside the restarted pods reflects the correct value:
kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf
search production.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5
Root Cause 5: Search Domain Wrong
Why It Happens
Kubernetes automatically populates the
searchline in each pod's
/etc/resolv.confbased on the pod's namespace and the cluster domain. The expected entries are
<namespace>.svc.cluster.local,
svc.cluster.local, and
cluster.local. This cluster domain is configured via the kubelet's
--cluster-domainflag (or equivalent
clusterDomainfield in the kubelet config file). If the cluster was bootstrapped with a non-standard domain such as
cluster.internal, but CoreDNS was left configured with
cluster.local, the search domains injected into pods will not match the zone CoreDNS is authoritative for, and all in-cluster service lookups will silently fail. A misconfigured Helm chart or operator can also override the
dnsConfig.searchesfield, replacing the correct search domains with something invalid.
How to Identify It
Check the search domains in the pod and compare them to what CoreDNS serves:
kubectl exec -it <pod-name> -n production -- cat /etc/resolv.conf
search production.svc.cluster.internal svc.cluster.internal cluster.internal
nameserver 10.96.0.10
options ndots:5
Now check the CoreDNS ConfigMap zone declaration:
kubectl get configmap coredns -n kube-system \
-o jsonpath='{.data.Corefile}' | grep kubernetes
kubernetes cluster.local in-addr.arpa ip6.arpa {
The pod is searching
cluster.internalbut CoreDNS is authoritative only for
cluster.local. Every short service name lookup appends the wrong suffix and receives NXDOMAIN. Confirm the mismatch by testing a fully-qualified lookup, which bypasses the search list entirely:
kubectl exec -it <pod-name> -n production -- \
nslookup kubernetes.default.svc.cluster.local
Server: 10.96.0.10
Address 1: 10.96.0.10
Name: kubernetes.default.svc.cluster.local
Address 1: 10.96.0.1
The FQDN resolves correctly, but
nslookup kubernetes.default(which appends
.production.svc.cluster.internal) fails. Search domain mismatch is confirmed.
How to Fix It
The correct fix depends on where the mismatch originated. First, check the kubelet configuration on the affected node:
ssh infrarunbook-admin@sw-infrarunbook-01 \
"cat /var/lib/kubelet/config.yaml | grep clusterDomain"
clusterDomain: cluster.internal
If the kubelet is using the wrong domain, update it on every node and restart the kubelet. This is a cluster-wide change requiring a rolling node update:
ssh infrarunbook-admin@sw-infrarunbook-01 \
"sudo sed -i 's/cluster.internal/cluster.local/' \
/var/lib/kubelet/config.yaml && sudo systemctl restart kubelet"
For an immediate workaround without touching the kubelet, override the search domains directly in the affected pod's
dnsConfig:
spec:
dnsConfig:
searches:
- production.svc.cluster.local
- svc.cluster.local
- cluster.local
After applying the patch and restarting pods, confirm the resolv.conf search line is correct and that short-name resolution succeeds without using a FQDN.
Root Cause 6: CoreDNS Resource Exhaustion
Why It Happens
Under heavy load — many pods, high service churn, or applications that issue a high volume of external DNS lookups per second — CoreDNS can exhaust its CPU or memory limits and begin dropping queries or responding with SERVFAIL. Kubernetes resource limits set too low for the cluster's actual scale are the most common cause. Because CoreDNS is a shared service, a single overloaded instance affects every workload simultaneously, making the problem appear cluster-wide and unrelated to any single application.
How to Identify It
Check real-time resource consumption for CoreDNS pods:
kubectl top pods -n kube-system -l k8s-app=kube-dns
NAME CPU(cores) MEMORY(bytes)
coredns-5d78c9869d-4xkzp 490m 98Mi
coredns-5d78c9869d-r9p2l 495m 95Mi
Both pods are running at 490–495m against a 500m CPU limit and are being throttled. Confirm the current limits:
kubectl get deployment coredns -n kube-system \
-o jsonpath='{.spec.template.spec.containers[0].resources}'
{"limits":{"memory":"170Mi","cpu":"500m"},"requests":{"cpu":"100m","memory":"70Mi"}}
How to Fix It
Increase the resource limits and add a third replica to spread load across more pods:
kubectl patch deployment coredns -n kube-system --type=json \
-p='[
{"op":"replace","path":"/spec/replicas","value":3},
{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/cpu","value":"1000m"},
{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"256Mi"}
]'
A practical scaling guideline for CoreDNS is one replica per 500 pods in the cluster, with a minimum of two replicas for high availability. Always place replicas on different nodes using a
podAntiAffinityrule.
Root Cause 7: kube-proxy or iptables Rules Broken
Why It Happens
The CoreDNS service (typically at
10.96.0.10) is a ClusterIP service. Traffic to it is intercepted and redirected by kube-proxy via iptables or IPVS rules. If kube-proxy has crashed or if iptables rules have been flushed — which can happen during node maintenance, OS upgrades, or when firewall management tools conflict with Kubernetes — pods cannot reach the DNS service IP even when CoreDNS pods are perfectly healthy. The symptom is DNS timing out at the network level with no log output from CoreDNS at all, because the packets never arrive.
How to Identify It
Check whether kube-proxy is running on affected nodes:
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
NAME READY STATUS RESTARTS AGE NODE
kube-proxy-9xzpk 0/1 Error 5 6m sw-infrarunbook-01
kube-proxy-m4qr2 1/1 Running 0 6d 10.0.1.22
Check whether the iptables DNAT rule for the DNS service exists on the affected node:
ssh infrarunbook-admin@sw-infrarunbook-01 \
"sudo iptables -t nat -L KUBE-SERVICES | grep 10.96.0.10"
# No output — rule is missing
A node with a healthy kube-proxy would show:
KUBE-SVC-ERIFXISQEP7F7OF4 udp -- anywhere 10.96.0.10 \
/* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
How to Fix It
kube-proxy runs as a DaemonSet. Delete the failing pod and let it be recreated automatically, which will re-sync all iptables rules:
kubectl delete pod kube-proxy-9xzpk -n kube-system
pod "kube-proxy-9xzpk" deleted
Wait for the new pod to become ready and verify the iptables rule is restored:
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
NAME READY STATUS RESTARTS AGE NODE
kube-proxy-7bnjq 1/1 Running 0 30s sw-infrarunbook-01
ssh infrarunbook-admin@sw-infrarunbook-01 \
"sudo iptables -t nat -L KUBE-SERVICES | grep 10.96.0.10"
KUBE-SVC-ERIFXISQEP7F7OF4 udp -- anywhere 10.96.0.10 \
/* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
Prevention
- Monitor CoreDNS with Prometheus. CoreDNS exposes metrics on port 9153. Alert on elevated
coredns_dns_responses_total{rcode="SERVFAIL"}
rates and highcoredns_dns_request_duration_seconds
p99 latency. Catch problems before they affect workloads. - Always include DNS egress rules in NetworkPolicy templates. Standardise on a base NetworkPolicy Helm chart that includes the port 53 UDP and TCP egress allowance for every namespace. Treat it as required boilerplate alongside the default-deny policy, never as an optional add-on.
- Validate Corefile changes before applying. Use the CoreDNS binary's
-conf
flag in a CI pipeline step to syntax-check Corefile changes against a running CoreDNS container before anykubectl apply
. A broken Corefile applied to production takes down DNS cluster-wide. - Set realistic resource limits for your cluster scale. Revisit CoreDNS CPU and memory limits whenever cluster pod count grows significantly. One CoreDNS replica per 500 pods with at least 500m CPU per replica is a safe baseline under typical load.
- Use ndots:5 by default; only override with intent and testing. If you reduce ndots for external DNS performance, test thoroughly with both short in-cluster service names and fully-qualified external names before rolling out cluster-wide.
- Pin cluster domain before first workload; never change it live. Changing
clusterDomain
after cluster creation requires coordinated kubelet restarts across every node — a high-risk operation. Decide oncluster.local
or a custom domain at bootstrap time. - Run DNS smoke tests in CI/CD pipelines. Add a post-deployment step that launches a busybox pod and resolves a known in-cluster service name. Fail the pipeline if resolution takes more than 200ms or returns an error, catching NetworkPolicy and ConfigMap regressions before they reach production.
- Use PodTopologySpread for CoreDNS. Ensure CoreDNS replicas are spread across multiple nodes so a single node failure does not take down all DNS pods simultaneously.
Frequently Asked Questions
Q: How do I quickly test if DNS is working inside a specific pod without modifying it?
A: Use
kubectl execdirectly:
kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default. If the pod's image does not include nslookup, use
cat /etc/resolv.confto check the nameserver and search configuration, then test connectivity with
wget -qO- --timeout=3 http://kubernetes.defaultto verify both resolution and routing in one step.
Q: What is the default CoreDNS service IP in a Kubernetes cluster?
A: By default it is the tenth IP in the service CIDR. For a cluster with service CIDR
10.96.0.0/12, the DNS service is at
10.96.0.10. You can confirm it with
kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'.
Q: Why does DNS work for some pods but not others in the same namespace?
A: The most common causes are per-pod NetworkPolicy selectors that allow DNS egress only for pods with specific labels, or pods that were scheduled before a Corefile reload took effect and have a stale nameserver entry. Compare the
/etc/resolv.confcontents and pod labels between working and failing pods — differences there will point to the cause.
Q: What is the difference between NXDOMAIN and SERVFAIL?
A: NXDOMAIN means the DNS server authoritative for the zone confirmed the name does not exist — typically a typo in the service name, wrong namespace, or the service was never created. SERVFAIL means the DNS server encountered an error while trying to resolve the query — typically an unreachable upstream, a loop in the Corefile, or a broken forward rule. SERVFAIL indicates a CoreDNS infrastructure problem; NXDOMAIN indicates the name itself is wrong.
Q: Is it safe to restart CoreDNS pods during production hours?
A: Yes, as long as you have more than one replica. A rolling restart replaces pods one at a time while keeping the remaining replicas serving queries. Run
kubectl rollout restart deployment/coredns -n kube-systemand monitor with
kubectl rollout status deployment/coredns -n kube-system. The restart completes with zero DNS downtime for a correctly sized deployment.
Q: Why does external DNS resolution work but in-cluster service discovery does not?
A: This pattern means CoreDNS is running and forwarding queries upstream correctly, but the
kubernetesplugin is not resolving in-cluster names. Check that the
kubernetesblock in the Corefile specifies the correct zone (
cluster.local) and that the pod's
/etc/resolv.confsearch line includes
svc.cluster.local. A mismatch between the kubelet's
clusterDomainand the CoreDNS zone is the most common cause.
Q: How many CoreDNS replicas should I run?
A: A minimum of two replicas for high availability on any production cluster. For larger clusters, scale at roughly one replica per 500 pods. Use a
PodTopologySpreadconstraint or
podAntiAffinityrule to ensure replicas land on different nodes, preventing a single node failure from eliminating all DNS capacity simultaneously.
Q: Can I use dnsPolicy: None to fully control DNS in a pod?
A: Yes. Setting
dnsPolicy: Nonedisables all automatic DNS configuration and requires you to supply the full
dnsConfigblock with explicit nameservers, search domains, and ndots options. This is useful for pods that must bypass CoreDNS entirely and query an external resolver directly, but any mistake in the
dnsConfigwill leave the pod with no working DNS at all — there is no fallback.
Q: How do I debug slow DNS resolution without disrupting running pods?
A: Deploy a dedicated debug pod:
kubectl run dns-debug --image=tutum/dnsutils --restart=Never -it --rm -- bash. Then use
digwith timing stats:
dig @10.96.0.10 kubernetes.default.svc.cluster.local +stats. The
;; Query time:line shows actual latency. Anything above 5ms for in-cluster names under normal conditions suggests CoreDNS is throttled or the iptables path has a problem.
Q: Can I add custom DNS entries for in-cluster names that do not correspond to real Services?
A: Yes. Use the CoreDNS
hostsplugin to add static A records directly in the Corefile, or deploy an internal-only zone using the
fileplugin pointing to a zone file stored in a ConfigMap. The
hostsplugin is simpler for a small number of static overrides; the
fileplugin scales better for many records and supports full zone management including PTR records.
Q: After fixing DNS, how do I verify that all existing pods have picked up the change?
A: For ConfigMap and CoreDNS-level fixes, existing pods automatically benefit because their nameserver IP (
10.96.0.10) does not change — only the CoreDNS behaviour does. For fixes that required changing the pod's own
dnsConfigor for ndots and search domain corrections, pods must be restarted to receive the updated
/etc/resolv.conf. Rolling restart the affected Deployments with
kubectl rollout restart deployment/<name> -n <namespace>.
Q: How do I find which pods are currently failing DNS resolution cluster-wide without checking each one individually?
A: Deploy a DaemonSet running a DNS check loop to every node, or use a tool such as
kubectl-debugon sampled nodes. A faster triage approach is to check CoreDNS error metrics directly:
kubectl exec -n kube-system <coredns-pod> -- wget -qO- http://localhost:9153/metrics | grep coredns_dns_responses_totaland filter for
rcode="SERVFAIL"or
rcode="NXDOMAIN". A spike in either counter cluster-wide confirms the scope of the problem before you start checking individual pods.
