Symptoms
You run
helm upgrade --installand it hangs. Or it fails immediately with a cryptic error. Or — the worst variant — it reports success but your pods never come up. Helm deployment failures come in a few distinct flavors, and the one you're staring at right now usually points to one of a handful of well-known root causes.
Here's what the failure surface looks like in practice:
Error: INSTALLATION FAILED: timed out waiting for the condition
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1"
- Pods stuck in
Pending
,CrashLoopBackOff
, orImagePullBackOff
Error: INSTALLATION FAILED: Forbidden: User "system:serviceaccount:ci:helm-deployer" cannot create resource...
- The release shows as
failed
inhelm list
, leaving a lock that blocks all subsequent runs
These symptoms span several root causes. Let's walk through each one — why it happens, how to identify it, and how to fix it.
Root Cause 1: Values File Is Wrong
This is the most common cause I see in teams that are new to Helm or have recently refactored their chart structure. A values file that doesn't match what the chart expects will either cause a rendering error at install time or silently produce incorrect manifests that deploy broken workloads.
Why does it happen? Charts evolve. When someone bumps a chart version or restructures the values schema, the old values file doesn't always get updated in lockstep. Maybe a key got renamed from
image.tagto
image.version, or a nested block changed its structure entirely. Helm's templating engine will often just render an empty string or a zero-value rather than throwing an error, so the manifest looks valid but produces pods that fail at runtime. That silent failure mode is what makes this cause so insidious.
How to Identify It
Start with a dry run and inspect the rendered output:
helm upgrade --install myapp ./charts/myapp \
--values ./values/prod.yaml \
--dry-run --debug 2>&1 | head -150
If values are being ignored or misread, you'll often see default placeholders in the rendered YAML — things like an empty image repository, a replica count of zero, or a service port that maps to nothing. You can also check what Helm actually applied to the last release:
helm get values myapp --namespace production
Compare that output against what you intended to pass in. Then lint the chart with your values file explicitly specified:
helm lint ./charts/myapp --values ./values/prod.yaml
A real lint failure looks like this:
==> Linting ./charts/myapp
[ERROR] templates/deployment.yaml: image: Invalid value: "": image repository is required
1 chart(s) linted, 1 chart(s) failed
How to Fix It
Run
helm show values ./charts/myappto dump the chart's default values and compare them line by line with your overrides file. Look for keys in your file that don't appear anywhere in the defaults — those are likely stale or misspelled. If the chart ships a
values.schema.json,
helm lintwill automatically run JSON schema validation and flag type mismatches and required fields.
Once you've corrected the values file, always do a dry run before applying. Don't skip the
--debugflag — it prints the full rendered manifests and makes it obvious when a template produces unexpected output.
Root Cause 2: Chart Version Conflict
Helm tracks releases by storing versioned secrets in the target namespace. When you try to install or upgrade, it compares what you're requesting against what's currently deployed. Version conflicts surface in a few different ways — an incompatible API version between the chart and your cluster Kubernetes version, a dependency chart pinned to a version that no longer exists in the upstream repo, or a stale lock left behind by a previous failed upgrade.
In my experience, the stale lock is by far the most frustrating variant. A failed upgrade leaves the release in
pending-upgradestate, and Helm refuses to do anything with that release until the state is cleared.
How to Identify It
helm list --all-namespaces --all
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
myapp production 3 2026-04-15 14:22:05 UTC pending-upgrade myapp-1.4.2 2.1.0
The
pending-upgradestatus is the tell. For dependency version issues, inspect your
Chart.lockfile:
cat charts/myapp/Chart.lock
dependencies:
- name: postgresql
repository: https://charts.bitnami.com/bitnami
version: 12.5.6
digest: sha256:3a7f1c2d...
generated: "2025-09-10T08:14:22.331Z"
If that pinned version is no longer in the repo index,
helm dependency updatewill fail:
Error: no chart version found for postgresql-12.5.6
How to Fix It
For the stale lock, roll back to the last known good revision:
helm rollback myapp 2 --namespace production
If rollback also fails because the release state is truly corrupted, you can forcibly delete the release secret and reinstall. Helm stores release state in secrets named
sh.helm.release.v1.<release-name>.v<revision>:
kubectl get secrets -n production | grep helm.release
kubectl delete secret sh.helm.release.v1.myapp.v3 -n production
For dependency version conflicts, update your
Chart.yamlto reference a version that exists in the current upstream index, then regenerate the lock file:
helm repo update
helm dependency update ./charts/myapp
Root Cause 3: CRD Not Installed
Custom Resource Definitions have to exist in the cluster before Helm can create resources that reference them. If you're deploying something like a Prometheus stack, a cert-manager Issuer, or an Istio VirtualService, and the underlying operator or CRD set hasn't been installed yet, Helm will fail immediately with a "no matches for kind" error.
This catches teams off guard because the chart looks perfectly fine in a dry run against a cluster that already has the CRDs. Then you deploy to a fresh cluster — a new environment, a DR site, a CI ephemeral cluster — and it blows up on the very first resource. The chart didn't change. The cluster is the difference.
How to Identify It
The error is usually unambiguous:
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest:
[unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1",
unable to recognize "": no matches for kind "Issuer" in version "cert-manager.io/v1"]
Verify which CRDs are currently installed:
kubectl get crds | grep cert-manager
If that returns nothing, the CRDs aren't there. You can also audit available API resources:
kubectl api-resources | grep cert-manager.io
How to Fix It
Install the CRDs before running the chart that depends on them. Most operators ship their CRDs either as a standalone manifest or via a Helm chart values flag. For cert-manager:
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true
Note the
--set installCRDs=true. Many operator charts gate CRD installation behind a values flag that defaults to false. Forgetting it is one of the most common sources of this failure, and I've seen it bite even experienced engineers who are moving fast.
For charts that bundle CRDs in their own
crds/directory, Helm installs them before other resources automatically — but only on first install, not on upgrades. If you're upgrading a chart and its CRD schema changed, you need to apply the updated CRD manually first:
kubectl apply -f ./charts/myapp/crds/
Make this step idempotent by using
kubectl applyrather than
kubectl create, and include it as an explicit step in your bootstrap and upgrade runbooks.
Root Cause 4: RBAC Preventing the Deploy
Helm runs with the permissions of whatever service account or kubeconfig credentials you're using. In a CI/CD pipeline this is typically a dedicated service account, and if that service account doesn't have the right Role or ClusterRole bindings, the deploy will fail with a Forbidden error partway through.
What makes RBAC failures particularly annoying is that they often don't surface until Helm tries to create a specific resource type. The chart might create 15 resources successfully and then fail on the 16th because the service account can't create, say, a ClusterRoleBinding or a PersistentVolumeClaim. At that point you've got a partial deployment in the cluster — some resources created, some not — and a failed release state.
How to Identify It
The error message is usually explicit about which permission is missing:
Error: INSTALLATION FAILED: failed to create resource: clusterrolebindings.rbac.authorization.k8s.io
is forbidden: User "system:serviceaccount:ci:helm-deployer" cannot create resource
"clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
You can verify a specific permission directly:
kubectl auth can-i create clusterrolebindings \
--as=system:serviceaccount:ci:helm-deployer
no
For a full audit of what the service account can and can't do across a namespace:
kubectl auth can-i --list \
--as=system:serviceaccount:ci:helm-deployer \
-n production
How to Fix It
You have two approaches. The first is granting the deploying service account a broad ClusterRole that covers the resource types your charts create. Here's a working example:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: helm-deployer
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io", "rbac.authorization.k8s.io",
"policy", "autoscaling", "storage.k8s.io"]
resources: ["*"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: helm-deployer
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: helm-deployer
subjects:
- kind: ServiceAccount
name: helm-deployer
namespace: ci
The second approach — and the one I prefer for production — is to scope permissions down to namespace-level Role/RoleBinding pairs for each deployment namespace, with ClusterRole bindings only for the cluster-scoped resources the chart actually needs. It takes more upfront work but avoids running your CI pipeline with effectively unrestricted cluster access. Minimum viable permissions are worth the effort to figure out.
Apply the RBAC manifests first, then retry the Helm deployment.
Root Cause 5: Rollout Timeout
Helm waits for deployments to become ready before marking an install or upgrade as successful. If your pods don't reach a ready state within the timeout window — which defaults to five minutes — Helm reports the install as failed, even though the resources were technically created in the cluster.
This is one of those failures that feels like a Helm problem but is actually a Kubernetes scheduling or application problem. The chart deployed fine. The pods just never came up. You need to diagnose what's happening at the pod level, not at the Helm level.
How to Identify It
Error: INSTALLATION FAILED: timed out waiting for the condition
After this error, check the actual pod state immediately:
kubectl get pods -n production -l app.kubernetes.io/name=myapp
NAME READY STATUS RESTARTS AGE
myapp-7d8f9b4c6-x2kpj 0/1 ImagePullBackOff 0 6m
myapp-7d8f9b4c6-m9qlr 0/1 Pending 0 6m
Then describe the pod to see the events:
kubectl describe pod myapp-7d8f9b4c6-m9qlr -n production
Events:
Warning FailedScheduling 5m default-scheduler 0/3 nodes are available:
3 Insufficient cpu. preemption: 0/3 nodes are eligible
for preemption
For a
CrashLoopBackOff, pull the logs from the failing container:
kubectl logs myapp-7d8f9b4c6-x2kpj -n production --previous
A misconfigured liveness or readiness probe is another frequent culprit. The pod starts fine but the probe path returns a non-200, and Kubernetes marks it unready indefinitely. Check probe configuration in the describe output under the
Livenessand
Readinesssections.
How to Fix It
Fix the underlying pod issue first. Once the root cause is identified — whether it's a resource constraint, an image pull credential problem, a liveness probe misconfiguration, or a missing ConfigMap — address that directly.
For image pull failures specifically, ensure the pull secret is created in the right namespace and referenced in your values:
kubectl create secret docker-registry regcred \
--docker-server=registry.solvethenetwork.com \
--docker-username=infrarunbook-admin \
--docker-password='<token>' \
--namespace production
Then in your values file:
imagePullSecrets:
- name: regcred
If the application legitimately takes longer than five minutes to start — for example, an init container running a database migration — extend the Helm timeout to match reality:
helm upgrade --install myapp ./charts/myapp \
--namespace production \
--values ./values/prod.yaml \
--timeout 12m
Don't just crank up the timeout without understanding why the pods are slow to start. Bumping the number is sometimes the right call, but it shouldn't be your first move. Know what you're waiting for before deciding how long to wait for it.
Root Cause 6: Image Pull Errors
Image pull failures mean Kubernetes can't fetch the container image specified in the deployment. This might be a wrong tag, a missing pull secret, a registry authentication failure, or a network policy blocking egress from the cluster to the registry. It's closely related to rollout timeouts but the diagnosis path is different enough to call out separately.
How to Identify It
kubectl get events -n production --sort-by='.lastTimestamp' | grep -i pull
LAST SEEN TYPE REASON OBJECT MESSAGE
3m Warning Failed Pod/myapp-7d8f9b4c6-x2kpj Failed to pull image
"registry.solvethenetwork.com/myapp:v2.1.0":
unauthorized: access denied
3m Warning Failed Pod/myapp-7d8f9b4c6-x2kpj Error: ErrImagePull
How to Fix It
Verify the image tag exists in the registry before deploying. Confirm the pull secret is present in the correct namespace and either referenced in the pod spec via
imagePullSecretsor attached to the default service account in that namespace. If your cluster is airgapped or behind a firewall, check that egress network policies allow traffic from the pod's namespace to the registry host on port 443.
Root Cause 7: Resource Quota Exceeded
Namespaces in production clusters often have
ResourceQuotaobjects enforcing limits on CPU, memory, and object counts. When a Helm chart tries to create resources that would push the namespace over quota, Kubernetes rejects the request and Helm reports a failure. This one is easy to diagnose once you know to look for it.
How to Identify It
Error: INSTALLATION FAILED: failed to create resource: pods "myapp-7d8f9b4c6" is forbidden:
exceeded quota: production-quota, requested: requests.cpu=500m,
used: requests.cpu=3750m, limited: requests.cpu=4000m
Check current quota usage in the namespace:
kubectl describe resourcequota -n production
Name: production-quota
Namespace: production
Resource Used Hard
-------- ---- ----
pods 19 20
requests.cpu 3750m 4000m
requests.memory 14Gi 16Gi
limits.cpu 7500m 8000m
How to Fix It
Either reduce the resource requests in your values file to fit within the available headroom, scale down or remove other workloads in the namespace to free up capacity, or work with your cluster admin to raise the quota. Don't remove resource requests entirely to bypass the error — that creates noisy neighbor problems on shared clusters and removes the guardrails that quotas are there to enforce.
Prevention
Most Helm deployment failures are preventable if you build a few practices into your pipeline before anything hits the cluster.
Run
helm lintand
helm templatein CI against every chart change. This catches rendering errors, schema violations, and template logic bugs before they ever touch a real cluster. Follow that with a
--dry-runagainst the actual target cluster — not just locally — because a local dry run won't catch API version mismatches or RBAC gaps:
helm upgrade --install myapp ./charts/myapp \
--namespace production \
--values ./values/prod.yaml \
--dry-run --debug
Pin chart dependencies. Floating version ranges in
Chart.yamlare fine for development, but commit a
Chart.lockfile for production and treat it like a lockfile for any other package manager. This prevents upstream chart changes from silently breaking your deploys on a Tuesday morning.
Maintain an explicit CRD bootstrap step in your cluster provisioning runbook and in your CI pipeline for ephemeral environments. Document exactly which CRDs each chart depends on, and always apply them with
kubectl applyso the step is idempotent regardless of whether the CRD already exists.
Audit RBAC permissions proactively. Run
kubectl auth can-i --listagainst your CI service account in each deployment namespace as a preflight check in the pipeline. Catching a missing permission before the deploy starts is far less painful than diagnosing a partial deployment after the fact.
Set realistic timeouts for your workloads. If an application takes eight minutes to initialize, document that and set your Helm timeout to twelve minutes with some headroom. Pair this with well-configured readiness probes so Kubernetes accurately reflects whether a pod is actually ready to serve traffic — not just that the process started.
For production pipelines, use
--atomicto get automatic rollback on failure:
helm upgrade --install myapp ./charts/myapp \
--namespace production \
--values ./values/prod.yaml \
--atomic \
--timeout 10m \
--cleanup-on-fail
--atomicrolls the release back to the previous good revision automatically if the deployment fails, leaving the cluster in a known state.
--cleanup-on-failremoves any resources created during a failed install so you don't end up with orphaned objects. Together they prevent the stale lock and partial deployment problems that make debugging so frustrating. Use them in every automated pipeline deployment and you'll eliminate an entire class of cluster state issues.
