Symptoms
When a Kubernetes Service becomes unreachable, engineers typically encounter one or more of the following indicators across application logs, kubectl output, and in-cluster connectivity tests:
- Running
curl http://web-frontend.production.svc.cluster.local
from inside a Pod returnscurl: (6) Could not resolve host
orcurl: (7) Failed to connect to port 8080 after 0 ms
kubectl exec
into a debug Pod and attempting to reach the Service ClusterIP yieldsConnection timed out
orNo route to host
- Application logs show repeated entries such as
dial tcp 10.96.45.12:8080: i/o timeout
orconnect: connection refused
kubectl get endpoints <service-name>
returns an empty addresses list or shows<none>
- Ingress controllers return HTTP 502 or 503 errors for requests targeting the backend Service
- Inter-service communication inside the cluster fails intermittently — some requests succeed while others time out — suggesting partial routing failure on specific nodes
- A freshly deployed application that passed staging smoke tests cannot be reached in production despite identical manifests
Each of these symptoms points to a different layer in Kubernetes networking. The sections below dissect the most common root causes, how to identify them with precision, and how to resolve them permanently.
Root Cause 1: Label Selector Mismatch
Why It Happens
A Kubernetes Service does not hold a static list of Pod IP addresses. Instead, it uses a label selector to dynamically discover backing Pods and build an Endpoints object. The endpoints controller watches for Pods whose labels match the selector and writes their IPs and ports into the Endpoints resource. When the labels on your Pods do not exactly match the selector defined in the Service spec — including key name, value, and case — the Endpoints object remains empty and the Service forwards no traffic anywhere. This is one of the most common misconfigurations, particularly after renaming labels during a refactor or copying manifests between projects without updating selectors.
How to Identify It
Start by inspecting the Endpoints object for the affected Service:
kubectl get endpoints web-frontend -n production
NAME ENDPOINTS AGE
web-frontend <none> 14mAn
<none>value confirms no Pods matched the selector. Now compare the Service selector against the labels on running Pods:
kubectl get svc web-frontend -n production -o jsonpath='{.spec.selector}'
{"app":"web-frontend","tier":"frontend"}
kubectl get pods -n production --show-labels
NAME READY STATUS LABELS
web-frontend-7d9f4b8c6-xk2rp 1/1 Running app=web-frontend,tier=uiThe Service expects
tier=frontendbut the Pod carries
tier=ui. The mismatch leaves the Endpoints list empty and the Service completely dark.
How to Fix It
Option A — patch the Deployment template labels so all future Pods carry the correct label and trigger a rollout:
kubectl patch deployment web-frontend -n production \
--type='json' \
-p='[{"op":"replace","path":"/spec/template/metadata/labels/tier","value":"frontend"}]'
kubectl rollout status deployment web-frontend -n production
deployment "web-frontend" successfully rolled outOption B — for an immediate hotfix on an existing Pod, patch the label in place:
kubectl label pod web-frontend-7d9f4b8c6-xk2rp tier=frontend --overwrite -n productionVerify that Endpoints are now populated:
kubectl get endpoints web-frontend -n production
NAME ENDPOINTS AGE
web-frontend 10.244.1.15:8080 16mRoot Cause 2: kube-proxy Not Running
Why It Happens
kube-proxyis the component responsible for maintaining the network rules — either
iptableschains or
IPVSvirtual servers — that implement Service virtual IPs on every node. It runs as a DaemonSet, meaning one instance per node. If a node's kube-proxy Pod crashes and fails to restart (due to resource exhaustion, a broken container image, a missing kernel module, or a taints mismatch preventing scheduling), that node loses the ability to route Service traffic. Requests originating from Pods on the affected node, or routed to Pods running on it, will silently time out.
How to Identify It
Check the kube-proxy DaemonSet and the status of each Pod across nodes:
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
NAME READY STATUS RESTARTS AGE NODE
kube-proxy-7xmrq 1/1 Running 0 2d sw-infrarunbook-01
kube-proxy-9plbt 0/1 CrashLoopBackOff 14 45m node-worker-02The Pod on
node-worker-02is in
CrashLoopBackOff. Inspect its logs for the underlying error:
kubectl logs kube-proxy-9plbt -n kube-system --previous
E0405 11:23:17.445001 1 proxier.go:1689] Failed to execute iptables-restore: exit status 2
E0405 11:23:17.445123 1 run.go:74] "command failed" err="exit status 2"
F0405 11:23:17.445201 1 server.go:490] "Error running ProxyServer" err="failed to run Proxier: ..."How to Fix It
Delete the failing Pod to trigger DaemonSet recreation on the node:
kubectl delete pod kube-proxy-9plbt -n kube-systemIf it continues to crash, describe the Pod to surface node-level events:
kubectl describe pod kube-proxy-9plbt -n kube-system
Events:
Warning BackOff 40s kubelet Back-off restarting failed container
Warning Failed 45m kubelet Error: failed to create containerd task: OCI runtime exec failed
Warning Failed 45m kubelet Error response from daemon: No such image: registry.k8s.io/kube-proxy:v1.29.0In this case the image cannot be pulled. Verify connectivity to the registry from the node, or pre-pull the correct image. Once the underlying problem is resolved, confirm the Pod is running and that iptables rules have been re-written:
iptables-save | grep -c KUBE
287Root Cause 3: iptables Rules Corrupted or Flushed
Why It Happens
Even when kube-proxy is running and healthy, its iptables rules can be silently wiped. This occurs when a security tool, a firewall management daemon, or an administrator runs
iptables -Fon the node. Kernel upgrades that reset netfilter state, Docker daemon restarts on older setups, or competing iptables managers such as
firewalldor
ufwcan purge or conflict with the
KUBE-*chains that kube-proxy writes. The result is that Service ClusterIPs become black holes — TCP packets reach the node's network interface but are never DNAT-translated to a real Pod IP, so they are dropped silently.
How to Identify It
SSH to the affected node and check whether the KUBE chains exist:
ssh infrarunbook-admin@sw-infrarunbook-01
iptables -L KUBE-SERVICES -n --line-numbers 2>&1 | head -5
iptables: No chain/target/match by that name.On a healthy node the output should list Service DNAT entries:
Chain KUBE-SERVICES (2 references)
num target prot opt source destination
1 KUBE-SVC-XGLOHA7QRQ3V22RZ tcp -- 0.0.0.0/0 10.96.0.1
2 KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- 0.0.0.0/0 10.96.0.10Also check whether firewalld is active on the node, which is incompatible with kube-proxy:
systemctl is-active firewalld
activeHow to Fix It
Disable and stop firewalld on all Kubernetes nodes — it must not coexist with kube-proxy:
systemctl stop firewalld
systemctl disable firewalldThen force kube-proxy to rewrite all KUBE-* chains by rolling it out or deleting the Pod on the affected node:
kubectl rollout restart daemonset kube-proxy -n kube-system
kubectl rollout status daemonset kube-proxy -n kube-system
daemon set "kube-proxy" successfully rolled outAlternatively, target only the affected node:
kubectl delete pod -n kube-system -l k8s-app=kube-proxy \
--field-selector spec.nodeName=sw-infrarunbook-01Confirm the chains are restored:
iptables-save | grep "KUBE-SERVICES" | wc -l
14Root Cause 4: CoreDNS Failure
Why It Happens
Kubernetes Services are routinely accessed by DNS name — for example
api.production.svc.cluster.local— rather than ClusterIP. CoreDNS is the in-cluster authoritative DNS resolver that translates these short names and fully-qualified names to ClusterIPs. If CoreDNS Pods are down, in
CrashLoopBackOff, or if the CoreDNS ConfigMap (the Corefile) has been accidentally modified with a syntax error or incorrect upstream, DNS resolution fails cluster-wide. Applications that rely on DNS-based service discovery will report
could not resolve hosteven though the underlying Service, Endpoints, and kube-proxy rules are perfectly healthy.
How to Identify It
Check CoreDNS Pod status in the kube-system namespace:
kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-5d78c9869d-4xm7g 0/1 CrashLoopBackOff 8 22m
coredns-5d78c9869d-9wqkf 1/1 Running 0 2dRun a DNS resolution test from a temporary debug Pod:
kubectl run dns-test --image=busybox:1.36 --restart=Never -it --rm -- nslookup kubernetes.default
;; connection timed out; no servers could be reachedInspect the logs of the failing CoreDNS Pod:
kubectl logs coredns-5d78c9869d-4xm7g -n kube-system
[ERROR] plugin/errors: 2 SERVFAIL for kubernetes.default.svc.cluster.local. A
[FATAL] Failed to initialize server: open /etc/coredns/Corefile: no such file or directoryAlso inspect the CoreDNS ConfigMap for corruption:
kubectl get configmap coredns -n kube-system -o yamlHow to Fix It
If the Corefile ConfigMap has been damaged, restore it to a valid minimal configuration:
kubectl edit configmap coredns -n kube-system
# Ensure the data.Corefile key contains:
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}Then restart the CoreDNS Deployment so it picks up the fixed ConfigMap:
kubectl rollout restart deployment coredns -n kube-system
kubectl rollout status deployment coredns -n kube-system
deployment "coredns" successfully rolled outVerify DNS resolution is restored:
kubectl run dns-test --image=busybox:1.36 --restart=Never -it --rm -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.localRoot Cause 5: NetworkPolicy Blocking Traffic
Why It Happens
Kubernetes
NetworkPolicyresources implement firewall rules at the Pod level using the cluster's CNI plugin (Calico, Cilium, Weave, and others). A critical behavior to understand: once any NetworkPolicy selects a Pod as its target, all traffic not explicitly permitted by that policy — or any other policy targeting that Pod — is denied. Engineers commonly apply a default-deny policy to a namespace for security hardening, then forget to create matching allow rules for their application's traffic. The result is that inter-service calls that worked before the policy was applied are dropped at the CNI layer, often with no log entry at the application level — only connection timeouts.
How to Identify It
List all NetworkPolicies in the affected namespace:
kubectl get networkpolicy -n production
NAME POD-SELECTOR AGE
default-deny-all <none> 3d
allow-ingress-to-api app=api 3dDescribe the deny policy to confirm it targets all Pods and blocks all traffic:
kubectl describe networkpolicy default-deny-all -n production
Name: default-deny-all
Namespace: production
Pod Selector: <none> (Selects all Pods in namespace)
Policy Types: Ingress, Egress
Allowing ingress traffic:
<none> (Selected pods are isolated for ingress connectivity)
Allowing egress traffic:
<none> (Selected pods are isolated for egress connectivity)Confirm the connection is being dropped with a direct connectivity test:
kubectl exec -it debug-pod -n production -- curl -v --max-time 5 http://10.96.45.12:8080
* Trying 10.96.45.12:8080...
* connect to 10.96.45.12 port 8080 failed: Connection timed out
curl: (28) Connection timed out after 5001 millisecondsWith Calico installed you can also inspect policy enforcement decisions:
calicoctl get networkpolicy -n production -o yaml | grep -A10 selectorHow to Fix It
Create an explicit ingress allow policy that permits the required source Pod to reach the destination Pod on the correct port:
kubectl apply -f - <<'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-api
namespace: production
spec:
podSelector:
matchLabels:
app: api
ingress:
- from:
- podSelector:
matchLabels:
app: web-frontend
ports:
- protocol: TCP
port: 8080
EOFIf a default-deny-egress policy is also present, add a matching egress allow from the source Pod:
kubectl apply -f - <<'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-egress-to-api
namespace: production
spec:
podSelector:
matchLabels:
app: web-frontend
egress:
- to:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 8080
policyTypes:
- Egress
EOFRetest connectivity and confirm you now receive an HTTP response:
kubectl exec -it debug-pod -n production -- curl -s -o /dev/null -w "%{http_code}" http://10.96.45.12:8080
200Root Cause 6: Service Port or TargetPort Misconfiguration
Why It Happens
A Service exposes a
port— the port clients use when reaching the Service — and a
targetPort— the port on the Pod where the application actually listens. If
targetPortdoes not match the container's listening port, TCP connections establish successfully to the Service ClusterIP (because kube-proxy happily creates rules for whatever targetPort you specify) but are forwarded to the wrong port on the Pod, resulting in immediate connection refused errors. This frequently occurs when a container image is updated with a different default port and the Service manifest is not updated to match.
How to Identify It
Compare the Service targetPort against the actual container port:
kubectl describe svc api -n production
Port: http 8080/TCP
TargetPort: 8080/TCP
Endpoints: 10.244.1.15:8080,10.244.2.9:8080
kubectl describe pod api-5f7d9c84b-j4rlx -n production | grep -A3 Ports
Port: 9090/TCP
Host Port: 0/TCPThe Service targets port 8080 but the container listens on 9090. All forwarded connections hit a closed port and are immediately refused.
How to Fix It
kubectl patch svc api -n production \
--type='json' \
-p='[{"op":"replace","path":"/spec/ports/0/targetPort","value":9090}]'
kubectl get endpoints api -n production
NAME ENDPOINTS AGE
api 10.244.1.15:9090 4mRoot Cause 7: Pod Readiness Probe Failure
Why It Happens
Kubernetes automatically removes a Pod from the Service's Endpoints list when its readiness probe fails. This is a safety feature designed to prevent traffic from reaching Pods that are not yet ready to serve requests. However, an incorrectly configured readiness probe — wrong HTTP path, wrong port, or a timeout too short for the application's startup time — causes healthy Pods to be continuously excluded from Endpoints. The Service exists, the Pods are running, but no traffic is ever forwarded.
How to Identify It
kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
api-5f7d9c84b-j4rlx 0/1 Running 0 8m
kubectl describe pod api-5f7d9c84b-j4rlx -n production
Readiness: http-get http://:8080/healthz delay=5s timeout=1s period=10s
Events:
Warning Unhealthy 30s kubelet Readiness probe failed: Get http://10.244.1.15:8080/healthz: dial tcp 10.244.1.15:8080: connect: connection refusedThe
0/1 READYstatus indicates the Pod has been removed from Endpoints and is invisible to the Service.
How to Fix It
Update the readiness probe in the Deployment to use the correct path and port:
kubectl patch deployment api -n production \
--type='json' \
-p='[{"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/ready"}]'
kubectl rollout status deployment api -n production
deployment "api" successfully rolled out
kubectl get endpoints api -n production
NAME ENDPOINTS AGE
api 10.244.1.15:8080 92sPrevention
Avoiding Service reachability failures requires discipline across the full lifecycle of Kubernetes deployments. The following practices eliminate the most common failure modes before they reach production:
- Validate manifests before applying. Run
kubectl diff -f manifest.yaml
to preview changes andkubectl apply --dry-run=server -f manifest.yaml
to catch structural misconfigurations against the live API server before committing them. - Enforce label conventions with admission control. Use OPA/Gatekeeper or Kyverno to reject Deployments whose Pod template labels do not include the required selector keys. This catches label mismatches at admission time, long before a Service endpoint list goes empty.
- Monitor kube-proxy health continuously. Alert on any discrepancy between the DaemonSet's desired replica count and its ready count. A single node running without kube-proxy creates a hard-to-diagnose partial routing failure where some requests succeed and others time out depending on which node the client Pod is scheduled on.
- Protect CoreDNS from resource pressure. Assign CoreDNS to a high-priority PriorityClass so it is not evicted under node memory pressure. Set resource requests and limits carefully, and monitor its Prometheus metrics (exposed on port 9153) for request latency spikes and error rate increases.
- Test NetworkPolicies in a staging namespace first. Use tools such as
netassert
or Cilium's built-in connectivity test suite to validate that allow and deny rules behave as expected before applying them to production. Always add a matching egress allow rule whenever you add an ingress allow rule between two namespaces. - Prohibit firewalld and ufw on Kubernetes nodes. These tools conflict with kube-proxy's iptables management. Disable and mask them during node provisioning via your configuration management tooling (Ansible, Chef, or cloud-init), and prevent re-installation through package management policies.
- Design readiness probes carefully. A readiness probe should check application-level readiness — for example, that a database connection pool is established and the application is serving traffic. Use generous
initialDelaySeconds
andfailureThreshold
values to avoid evicting Pods during normal slow starts. - Include connectivity smoke tests in CI/CD pipelines. After every deployment, run a
kubectl exec
test from a Pod in the same namespace to verify the Service is reachable via its DNS name before marking the deployment successful. Gate production promotion on this check.
