Kubernetes Helm Deployment Failing

Symptoms

You run

helm upgrade --install

and it hangs. Or it fails immediately with a cryptic error. Or — the worst variant — it reports success but your pods never come up. Helm deployment failures come in a few distinct flavors, and the one you're staring at right now usually points to one of a handful of well-known root causes.

Here's what the failure surface looks like in practice:

Error: INSTALLATION FAILED: timed out waiting for the condition
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1"
Pods stuck in
Pending
,
CrashLoopBackOff
, or
ImagePullBackOff
Error: INSTALLATION FAILED: Forbidden: User "system:serviceaccount:ci:helm-deployer" cannot create resource...
The release shows as
failed
in
helm list
, leaving a lock that blocks all subsequent runs

These symptoms span several root causes. Let's walk through each one — why it happens, how to identify it, and how to fix it.

Root Cause 1: Values File Is Wrong

This is the most common cause I see in teams that are new to Helm or have recently refactored their chart structure. A values file that doesn't match what the chart expects will either cause a rendering error at install time or silently produce incorrect manifests that deploy broken workloads.

Why does it happen? Charts evolve. When someone bumps a chart version or restructures the values schema, the old values file doesn't always get updated in lockstep. Maybe a key got renamed from

image.tag

image.version

, or a nested block changed its structure entirely. Helm's templating engine will often just render an empty string or a zero-value rather than throwing an error, so the manifest looks valid but produces pods that fail at runtime. That silent failure mode is what makes this cause so insidious.

How to Identify It

Start with a dry run and inspect the rendered output:

helm upgrade --install myapp ./charts/myapp \
  --values ./values/prod.yaml \
  --dry-run --debug 2>&1 | head -150

If values are being ignored or misread, you'll often see default placeholders in the rendered YAML — things like an empty image repository, a replica count of zero, or a service port that maps to nothing. You can also check what Helm actually applied to the last release:

helm get values myapp --namespace production

Compare that output against what you intended to pass in. Then lint the chart with your values file explicitly specified:

helm lint ./charts/myapp --values ./values/prod.yaml

A real lint failure looks like this:

==> Linting ./charts/myapp
[ERROR] templates/deployment.yaml: image: Invalid value: "": image repository is required
1 chart(s) linted, 1 chart(s) failed

How to Fix It

Run

helm show values ./charts/myapp

to dump the chart's default values and compare them line by line with your overrides file. Look for keys in your file that don't appear anywhere in the defaults — those are likely stale or misspelled. If the chart ships a

values.schema.json

helm lint

will automatically run JSON schema validation and flag type mismatches and required fields.

Once you've corrected the values file, always do a dry run before applying. Don't skip the

--debug

flag — it prints the full rendered manifests and makes it obvious when a template produces unexpected output.

Root Cause 2: Chart Version Conflict

Helm tracks releases by storing versioned secrets in the target namespace. When you try to install or upgrade, it compares what you're requesting against what's currently deployed. Version conflicts surface in a few different ways — an incompatible API version between the chart and your cluster Kubernetes version, a dependency chart pinned to a version that no longer exists in the upstream repo, or a stale lock left behind by a previous failed upgrade.

In my experience, the stale lock is by far the most frustrating variant. A failed upgrade leaves the release in

pending-upgrade

state, and Helm refuses to do anything with that release until the state is cleared.

How to Identify It

helm list --all-namespaces --all

NAME     NAMESPACE   REVISION  UPDATED                    STATUS           CHART        APP VERSION
myapp    production  3         2026-04-15 14:22:05 UTC    pending-upgrade  myapp-1.4.2  2.1.0

The

pending-upgrade

status is the tell. For dependency version issues, inspect your

Chart.lock

file:

cat charts/myapp/Chart.lock

dependencies:
- name: postgresql
  repository: https://charts.bitnami.com/bitnami
  version: 12.5.6
digest: sha256:3a7f1c2d...
generated: "2025-09-10T08:14:22.331Z"

If that pinned version is no longer in the repo index,

helm dependency update

will fail:

Error: no chart version found for postgresql-12.5.6

How to Fix It

For the stale lock, roll back to the last known good revision:

helm rollback myapp 2 --namespace production

If rollback also fails because the release state is truly corrupted, you can forcibly delete the release secret and reinstall. Helm stores release state in secrets named

sh.helm.release.v1.<release-name>.v<revision>

kubectl get secrets -n production | grep helm.release
kubectl delete secret sh.helm.release.v1.myapp.v3 -n production

For dependency version conflicts, update your

Chart.yaml

to reference a version that exists in the current upstream index, then regenerate the lock file:

helm repo update
helm dependency update ./charts/myapp

Root Cause 3: CRD Not Installed

Custom Resource Definitions have to exist in the cluster before Helm can create resources that reference them. If you're deploying something like a Prometheus stack, a cert-manager Issuer, or an Istio VirtualService, and the underlying operator or CRD set hasn't been installed yet, Helm will fail immediately with a "no matches for kind" error.

This catches teams off guard because the chart looks perfectly fine in a dry run against a cluster that already has the CRDs. Then you deploy to a fresh cluster — a new environment, a DR site, a CI ephemeral cluster — and it blows up on the very first resource. The chart didn't change. The cluster is the difference.

How to Identify It

The error is usually unambiguous:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest:
[unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1",
 unable to recognize "": no matches for kind "Issuer" in version "cert-manager.io/v1"]

Verify which CRDs are currently installed:

kubectl get crds | grep cert-manager

If that returns nothing, the CRDs aren't there. You can also audit available API resources:

kubectl api-resources | grep cert-manager.io

How to Fix It

Install the CRDs before running the chart that depends on them. Most operators ship their CRDs either as a standalone manifest or via a Helm chart values flag. For cert-manager:

helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

Note the

--set installCRDs=true

. Many operator charts gate CRD installation behind a values flag that defaults to false. Forgetting it is one of the most common sources of this failure, and I've seen it bite even experienced engineers who are moving fast.

For charts that bundle CRDs in their own

crds/

directory, Helm installs them before other resources automatically — but only on first install, not on upgrades. If you're upgrading a chart and its CRD schema changed, you need to apply the updated CRD manually first:

kubectl apply -f ./charts/myapp/crds/

Make this step idempotent by using

kubectl apply

rather than

kubectl create

, and include it as an explicit step in your bootstrap and upgrade runbooks.

Root Cause 4: RBAC Preventing the Deploy

Helm runs with the permissions of whatever service account or kubeconfig credentials you're using. In a CI/CD pipeline this is typically a dedicated service account, and if that service account doesn't have the right Role or ClusterRole bindings, the deploy will fail with a Forbidden error partway through.

What makes RBAC failures particularly annoying is that they often don't surface until Helm tries to create a specific resource type. The chart might create 15 resources successfully and then fail on the 16th because the service account can't create, say, a ClusterRoleBinding or a PersistentVolumeClaim. At that point you've got a partial deployment in the cluster — some resources created, some not — and a failed release state.

How to Identify It

The error message is usually explicit about which permission is missing:

Error: INSTALLATION FAILED: failed to create resource: clusterrolebindings.rbac.authorization.k8s.io
is forbidden: User "system:serviceaccount:ci:helm-deployer" cannot create resource
"clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope

You can verify a specific permission directly:

kubectl auth can-i create clusterrolebindings \
  --as=system:serviceaccount:ci:helm-deployer

no

For a full audit of what the service account can and can't do across a namespace:

kubectl auth can-i --list \
  --as=system:serviceaccount:ci:helm-deployer \
  -n production

How to Fix It

You have two approaches. The first is granting the deploying service account a broad ClusterRole that covers the resource types your charts create. Here's a working example:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: helm-deployer
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io", "rbac.authorization.k8s.io",
              "policy", "autoscaling", "storage.k8s.io"]
  resources: ["*"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: helm-deployer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: helm-deployer
subjects:
- kind: ServiceAccount
  name: helm-deployer
  namespace: ci

The second approach — and the one I prefer for production — is to scope permissions down to namespace-level Role/RoleBinding pairs for each deployment namespace, with ClusterRole bindings only for the cluster-scoped resources the chart actually needs. It takes more upfront work but avoids running your CI pipeline with effectively unrestricted cluster access. Minimum viable permissions are worth the effort to figure out.

Apply the RBAC manifests first, then retry the Helm deployment.

Root Cause 5: Rollout Timeout

Helm waits for deployments to become ready before marking an install or upgrade as successful. If your pods don't reach a ready state within the timeout window — which defaults to five minutes — Helm reports the install as failed, even though the resources were technically created in the cluster.

This is one of those failures that feels like a Helm problem but is actually a Kubernetes scheduling or application problem. The chart deployed fine. The pods just never came up. You need to diagnose what's happening at the pod level, not at the Helm level.

How to Identify It

Error: INSTALLATION FAILED: timed out waiting for the condition

After this error, check the actual pod state immediately:

kubectl get pods -n production -l app.kubernetes.io/name=myapp

NAME                      READY   STATUS             RESTARTS   AGE
myapp-7d8f9b4c6-x2kpj    0/1     ImagePullBackOff   0          6m
myapp-7d8f9b4c6-m9qlr    0/1     Pending            0          6m

Then describe the pod to see the events:

kubectl describe pod myapp-7d8f9b4c6-m9qlr -n production

Events:
  Warning  FailedScheduling  5m   default-scheduler  0/3 nodes are available:
                                   3 Insufficient cpu. preemption: 0/3 nodes are eligible
                                   for preemption

For a

CrashLoopBackOff

, pull the logs from the failing container:

kubectl logs myapp-7d8f9b4c6-x2kpj -n production --previous

A misconfigured liveness or readiness probe is another frequent culprit. The pod starts fine but the probe path returns a non-200, and Kubernetes marks it unready indefinitely. Check probe configuration in the describe output under the

Liveness

and

Readiness

sections.

How to Fix It

Fix the underlying pod issue first. Once the root cause is identified — whether it's a resource constraint, an image pull credential problem, a liveness probe misconfiguration, or a missing ConfigMap — address that directly.

For image pull failures specifically, ensure the pull secret is created in the right namespace and referenced in your values:

kubectl create secret docker-registry regcred \
  --docker-server=registry.solvethenetwork.com \
  --docker-username=infrarunbook-admin \
  --docker-password='<token>' \
  --namespace production

Then in your values file:

imagePullSecrets:
  - name: regcred

If the application legitimately takes longer than five minutes to start — for example, an init container running a database migration — extend the Helm timeout to match reality:

helm upgrade --install myapp ./charts/myapp \
  --namespace production \
  --values ./values/prod.yaml \
  --timeout 12m

Don't just crank up the timeout without understanding why the pods are slow to start. Bumping the number is sometimes the right call, but it shouldn't be your first move. Know what you're waiting for before deciding how long to wait for it.

Root Cause 6: Image Pull Errors

Image pull failures mean Kubernetes can't fetch the container image specified in the deployment. This might be a wrong tag, a missing pull secret, a registry authentication failure, or a network policy blocking egress from the cluster to the registry. It's closely related to rollout timeouts but the diagnosis path is different enough to call out separately.

How to Identify It

kubectl get events -n production --sort-by='.lastTimestamp' | grep -i pull

LAST SEEN   TYPE      REASON      OBJECT                          MESSAGE
3m          Warning   Failed      Pod/myapp-7d8f9b4c6-x2kpj      Failed to pull image
                                  "registry.solvethenetwork.com/myapp:v2.1.0":
                                  unauthorized: access denied
3m          Warning   Failed      Pod/myapp-7d8f9b4c6-x2kpj      Error: ErrImagePull

How to Fix It

Verify the image tag exists in the registry before deploying. Confirm the pull secret is present in the correct namespace and either referenced in the pod spec via

imagePullSecrets

or attached to the default service account in that namespace. If your cluster is airgapped or behind a firewall, check that egress network policies allow traffic from the pod's namespace to the registry host on port 443.

Root Cause 7: Resource Quota Exceeded

Namespaces in production clusters often have

ResourceQuota

objects enforcing limits on CPU, memory, and object counts. When a Helm chart tries to create resources that would push the namespace over quota, Kubernetes rejects the request and Helm reports a failure. This one is easy to diagnose once you know to look for it.

How to Identify It

Error: INSTALLATION FAILED: failed to create resource: pods "myapp-7d8f9b4c6" is forbidden:
exceeded quota: production-quota, requested: requests.cpu=500m,
used: requests.cpu=3750m, limited: requests.cpu=4000m

Check current quota usage in the namespace:

kubectl describe resourcequota -n production

Name:             production-quota
Namespace:        production
Resource          Used    Hard
--------          ----    ----
pods              19      20
requests.cpu      3750m   4000m
requests.memory   14Gi    16Gi
limits.cpu        7500m   8000m

How to Fix It

Either reduce the resource requests in your values file to fit within the available headroom, scale down or remove other workloads in the namespace to free up capacity, or work with your cluster admin to raise the quota. Don't remove resource requests entirely to bypass the error — that creates noisy neighbor problems on shared clusters and removes the guardrails that quotas are there to enforce.

Prevention

Most Helm deployment failures are preventable if you build a few practices into your pipeline before anything hits the cluster.

Run

helm lint

and

helm template

in CI against every chart change. This catches rendering errors, schema violations, and template logic bugs before they ever touch a real cluster. Follow that with a

--dry-run

against the actual target cluster — not just locally — because a local dry run won't catch API version mismatches or RBAC gaps:

helm upgrade --install myapp ./charts/myapp \
  --namespace production \
  --values ./values/prod.yaml \
  --dry-run --debug

Pin chart dependencies. Floating version ranges in

Chart.yaml

are fine for development, but commit a

Chart.lock

file for production and treat it like a lockfile for any other package manager. This prevents upstream chart changes from silently breaking your deploys on a Tuesday morning.

Maintain an explicit CRD bootstrap step in your cluster provisioning runbook and in your CI pipeline for ephemeral environments. Document exactly which CRDs each chart depends on, and always apply them with

kubectl apply

so the step is idempotent regardless of whether the CRD already exists.

Audit RBAC permissions proactively. Run

kubectl auth can-i --list

against your CI service account in each deployment namespace as a preflight check in the pipeline. Catching a missing permission before the deploy starts is far less painful than diagnosing a partial deployment after the fact.

Set realistic timeouts for your workloads. If an application takes eight minutes to initialize, document that and set your Helm timeout to twelve minutes with some headroom. Pair this with well-configured readiness probes so Kubernetes accurately reflects whether a pod is actually ready to serve traffic — not just that the process started.

For production pipelines, use

--atomic

to get automatic rollback on failure:

helm upgrade --install myapp ./charts/myapp \
  --namespace production \
  --values ./values/prod.yaml \
  --atomic \
  --timeout 10m \
  --cleanup-on-fail

--atomic

rolls the release back to the previous good revision automatically if the deployment fails, leaving the cluster in a known state.

--cleanup-on-fail

removes any resources created during a failed install so you don't end up with orphaned objects. Together they prevent the stale lock and partial deployment problems that make debugging so frustrating. Use them in every automated pipeline deployment and you'll eliminate an entire class of cluster state issues.

Kubernetes Helm Deployment Failing

Symptoms

Root Cause 1: Values File Is Wrong

How to Identify It

How to Fix It

Root Cause 2: Chart Version Conflict

How to Identify It

How to Fix It

Root Cause 3: CRD Not Installed

How to Identify It

How to Fix It

Root Cause 4: RBAC Preventing the Deploy

How to Identify It

How to Fix It

Root Cause 5: Rollout Timeout

How to Identify It

How to Fix It

Root Cause 6: Image Pull Errors

How to Identify It

How to Fix It

Root Cause 7: Resource Quota Exceeded

How to Identify It

How to Fix It

Prevention

Frequently Asked Questions

Why does my Helm upgrade get stuck in pending-upgrade status?

How do I find out which permissions my Helm CI service account is missing?

Why does helm install succeed on my dev cluster but fail on a fresh cluster with a CRD error?

How do I extend the Helm deployment timeout for slow-starting applications?

What does 'helm lint' actually check and should I run it in CI?

Related Articles