Symptoms
The most obvious sign is the red OutOfSync badge sitting next to your application name in the ArgoCD UI. It's deceptively simple — that one badge can mean a dozen different things. You might also see the app flip to Degraded health at the same time, which usually means the sync didn't just drift, it actively failed. If auto-sync is enabled, you'll watch the retry counter climb. Notifications fire. People start pinging you in Slack.
From the CLI,
argocd app listis your first stop:
NAME CLUSTER NAMESPACE PROJECT STATUS HEALTH SYNCPOLICY CONDITIONS
my-api-service https://kubernetes.default.svc production default OutOfSync Healthy Auto <none>
Running
argocd app get my-api-servicegives you the detail you actually need:
Name: my-api-service
Project: default
Server: https://kubernetes.default.svc
Namespace: production
URL: https://argocd.solvethenetwork.com/applications/my-api-service
Repo: git@github.com:solvethenetwork/k8s-manifests.git
Target: main
Path: apps/my-api-service
SyncWindow: Sync Allowed
Sync Policy: Automated
Sync Status: OutOfSync from main (a3f1c29)
Health Status: Healthy
GROUP KIND NAMESPACE NAME STATUS HEALTH HOOK MESSAGE
apps Deployment production my-api-service OutOfSync Healthy ...
Pay close attention to the CONDITION block. A ComparisonError means ArgoCD couldn't even generate the desired state to compare against — it never got that far. An actual diff listed under the resource table means ArgoCD can see what it wants, it just doesn't match what's live. Those two scenarios have completely different root causes, and this distinction will save you a lot of time.
Root Cause 1: Git Repository Not Accessible
In my experience, this is the one that catches teams off guard months into a smooth-running deployment. A security scan rotates SSH deploy keys across the organization. A personal access token hits its 90-day expiry. Someone renames the GitHub repository and forgets to update the ArgoCD repo registration. Or a network policy gets tightened and now blocks the
argocd-repo-serverpod from reaching out to your Git host.
The tell-tale sign is a ComparisonError in the conditions block of
argocd app get:
CONDITION MESSAGE LAST TRANSITION
ComparisonError rpc error: code = Unknown desc = error testing repository connectivity: 2026-04-18 09:12:34 +0000 UTC
ssh: handshake failed: ssh: unable to authenticate, attempted methods [none
publickey], no supported methods remain
You can also check the repo-server logs directly:
kubectl logs -n argocd deploy/argocd-repo-server | grep -i error
And list all registered repositories to see their current status:
argocd repo list
TYPE NAME REPO INSECURE STATUS MESSAGE
git git@github.com:solvethenetwork/k8s-manifests.git false Failed ssh: handshake failed: ...
To fix it: regenerate the deploy key in GitHub under repository Settings → Deploy Keys, then patch the Kubernetes secret and restart the repo-server. If you're using HTTPS with a token, update the password field in the repo secret:
kubectl -n argocd patch secret argocd-repo-creds-solvethenetwork \
--type='json' \
-p='[{"op":"replace","path":"/data/password","value":"'$(echo -n "ghp_newtoken_abc123" | base64)'"}]'
kubectl -n argocd rollout restart deploy/argocd-repo-server
Verify the fix by running
argocd repo listagain — the STATUS column should flip from Failed to Successful within a few seconds of the repo-server coming back up.
Root Cause 2: RBAC Preventing Sync
ArgoCD has its own RBAC layer that lives entirely inside the
argocd-rbac-cmConfigMap and operates independently of Kubernetes RBAC. It controls who can do what to which applications, and it's common for these policies to drift out of alignment with how teams are actually structured — especially after an SSO reconfiguration or a project reorganization.
When RBAC blocks a sync, you'll see a PermissionDenied error immediately when you try to run the sync manually:
argocd app sync my-api-service
FATA[0001] rpc error: code = PermissionDenied desc = permission denied: applications, sync, default/my-api-service
Check the current policy to understand what's allowed:
kubectl -n argocd get configmap argocd-rbac-cm -o yaml
apiVersion: v1
data:
policy.csv: |
p, role:readonly, applications, get, */*, allow
p, role:developer, applications, sync, staging/*, allow
g, solvethenetwork:platform-team, role:admin
g, solvethenetwork:developers, role:developer
policy.default: role:readonly
kind: ConfigMap
In that example, the developer role can only sync staging applications. Anyone in the developers group who tries to touch production gets denied. The fix is to edit the ConfigMap and add the appropriate policy line:
kubectl -n argocd edit configmap argocd-rbac-cm
Add the required permission under
policy.csv:
p, role:developer, applications, sync, production/*, allow
Or for a specific user account:
p, infrarunbook-admin, applications, sync, production/my-api-service, allow
ArgoCD watches the ConfigMap and picks up changes immediately — no restart required. Test the fix by re-running
argocd app sync my-api-servicewith the affected account.
Root Cause 3: Resource Health Check Failing
ArgoCD won't mark a sync complete — and won't report the application as Synced — while managed resources are in a Degraded health state. This design is intentional. It prevents ArgoCD from declaring victory when the cluster is actually broken. But it also means that a CrashLoopBackOff pod or a pending PVC can hold your sync status hostage even when the manifests themselves are perfectly correct.
The symptom here is that the resource table shows Synced (meaning the manifest was applied) but the health column shows Degraded:
argocd app get my-api-service
GROUP KIND NAMESPACE NAME STATUS HEALTH HOOK MESSAGE
apps Deployment production my-api-service Synced Degraded Deployment does not have minimum availability.
ReplicaSet production my-api-service-7f4d9b Synced Degraded
Pod production my-api-service-7f4d9b-x9p Synced Degraded Back-off restarting failed container
Drill in with standard kubectl commands to find the actual problem:
kubectl -n production describe deployment my-api-service
kubectl -n production get pods -l app=my-api-service
kubectl -n production logs my-api-service-7f4d9b-x9p --previous
Fix the underlying issue — wrong image tag, missing environment variable, insufficient memory limits, whatever kubectl logs tells you. Once the pods stabilize and the Deployment reaches minimum availability, ArgoCD updates the health status automatically and the OutOfSync condition resolves.
For Custom Resource Definitions that ArgoCD doesn't have built-in health checks for, you can define Lua health check scripts in
argocd-cm. Without these, CRDs default to Progressing indefinitely and block your sync just the same:
kubectl -n argocd edit configmap argocd-cm
data:
resource.customizations.health.batch_v1_Job: |
hs = {}
if obj.status ~= nil then
if obj.status.succeeded ~= nil and obj.status.succeeded > 0 then
hs.status = "Healthy"
return hs
end
end
hs.status = "Progressing"
return hs
Root Cause 4: Hook Job Failing
ArgoCD sync hooks are Kubernetes Jobs annotated with
argocd.argoproj.io/hookset to PreSync, Sync, PostSync, or SyncFail. When one of these jobs fails, ArgoCD marks the entire sync operation as failed and the application stays OutOfSync. The deployment never reaches the cluster. Everything stops.
I've seen this bite teams hard when they add a database migration job as a PreSync hook. The migration fails — DB not reachable from the new network segment, schema conflict with a prior half-applied migration, whatever — and now no subsequent deployments can happen until someone clears the blockage. People stare at the ArgoCD UI convinced it's a GitOps problem when it's actually an application operations problem.
The resource table makes the failure clear:
argocd app get my-api-service
GROUP KIND NAMESPACE NAME STATUS HEALTH HOOK MESSAGE
batch Job production my-api-service-pre-sync-hook Failed Degraded PreSync Job has reached the specified backoff limit
apps Deployment production my-api-service OutOfSync Healthy
Get the logs from the hook job pod to understand the actual failure:
kubectl -n production get pods -l job-name=my-api-service-pre-sync-hook
NAME READY STATUS RESTARTS AGE
my-api-service-pre-sync-hook-z7m2x 0/1 Error 0 8m
kubectl -n production logs my-api-service-pre-sync-hook-z7m2x
Error: dial tcp 10.0.1.45:5432: connect: connection refused
Database migration failed. Exiting.
Fix the root cause first — in this case, restore connectivity to the Postgres instance at 10.0.1.45. Then delete the failed job to clear the blockage and re-trigger the sync:
kubectl -n production delete job my-api-service-pre-sync-hook
argocd app sync my-api-service
To prevent stale failed jobs from accumulating and causing this problem repeatedly, always set a hook deletion policy in your hook manifest annotations:
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
HookSucceededcleans up the job only on success, leaving failed jobs in place so you can debug them.
BeforeHookCreation(the other common option) deletes any existing hook resource before creating a new one — useful when you want a clean slate on every sync without manual intervention.
Root Cause 5: IgnoreDifferences Not Configured
This is, without question, the most common cause of a persistent OutOfSync status on an otherwise healthy application. External controllers mutate resources after ArgoCD syncs them. ArgoCD then sees a difference between what Git says and what lives in the cluster, marks the app OutOfSync, and — if auto-sync is enabled — tries to revert those mutations. This puts ArgoCD in a war with the external controller, and nobody wins.
The most frequent offenders: HPA scaling up replica counts beyond what the Deployment manifest specifies, cert-manager injecting annotations or rotating certificate secrets, Istio and Linkerd admission webhooks inserting sidecar containers, and Kubernetes itself normalizing fields like
defaultModeon volume mounts from
0644to
420(decimal vs octal — a fun one to debug at midnight).
Run
argocd app diffto see exactly what ArgoCD thinks is wrong:
argocd app diff my-api-service
===== apps/Deployment production/my-api-service ======
30c30
< replicas: 3
---
> replicas: 1
===== /ConfigMap production/my-api-service-config ======
15c15
< defaultMode: 420
---
> defaultMode: 0644
Git has
replicas: 1. The HPA scaled it to 3. ArgoCD wants to set it back to 1. If auto-sync is on, it will — and then the HPA will immediately scale it back to 3. You'll see this loop in the ArgoCD event history.
Fix it by adding
ignoreDifferencesto the Application spec:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-api-service
namespace: argocd
spec:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
- group: ""
kind: ConfigMap
name: my-api-service-config
jsonPointers:
- /data/defaultMode
source:
repoURL: git@github.com:solvethenetwork/k8s-manifests.git
targetRevision: main
path: apps/my-api-service
destination:
server: https://kubernetes.default.svc
namespace: production
Apply it and the OutOfSync status resolves within seconds:
kubectl -n argocd apply -f my-api-service-app.yaml
argocd app get my-api-service | grep "Sync Status"
Sync Status: Synced to main (a3f1c29)
For fields you want to ignore globally across all applications — like sidecar injection mutations — configure
resource.customizations.ignoreDifferencesin
argocd-cmrather than duplicating the same ignoreDifferences block in every Application manifest.
Root Cause 6: Webhook Not Configured
This one isn't technically an error — it's a misconfiguration that causes confusion. Without a Git webhook pointing at your ArgoCD instance, ArgoCD falls back to polling the repository every 3 minutes (controlled by the
timeout.reconciliationsetting in
argocd-cm, defaulting to 180 seconds). A commit lands in main. The developer checks ArgoCD. The app still shows the old commit hash. It looks stuck. They ping you.
Check when ArgoCD last reconciled versus when the commit landed:
argocd app get my-api-service | grep -A2 "Sync Status"
Sync Status: OutOfSync from main (a3f1c29)
git log --oneline -3
b7d2e91 (HEAD -> main, origin/main) fix: correct image tag for v2.4.1
a3f1c29 feat: add readiness probe
8c1b043 chore: bump resource limits
ArgoCD is still on
a3f1c29even though
b7d2e91is on main. Check whether a webhook is configured in your Git provider by looking at the ArgoCD server logs for incoming webhook events — if you see none at all, polling is the only mechanism in play.
Set up the webhook in GitHub under repository Settings → Webhooks → Add webhook. Set the payload URL to
https://argocd.solvethenetwork.com/api/webhook, content type to
application/json, and select the push event. Generate a random secret and configure it in ArgoCD:
kubectl -n argocd edit secret argocd-secret
Add the key
webhook.github.secretwith the base64-encoded value of your webhook secret. After that, ArgoCD will receive push notifications within milliseconds of a commit landing on the target branch — no more polling lag.
Root Cause 7: Manifest Rendering Error
If you're using Helm or Kustomize, a broken values file or an invalid
kustomization.yamlcan prevent ArgoCD from generating manifests at all. There's nothing to compare against the live cluster state, so the app shows OutOfSync — but the real condition is a ComparisonError buried underneath.
argocd app get my-api-service
CONDITION MESSAGE
ComparisonError rpc error: code = Unknown desc = helm template . --name-template my-api-service
--namespace production ... exit status 1:
Error: render error in "my-api-service/templates/deployment.yaml":
template: my-api-service/templates/deployment.yaml:23:18:
executing "my-api-service/templates/deployment.yaml"
at <.Values.image.tag>: nil pointer evaluating interface {}.tag
Test rendering locally before pushing to catch these early:
# For Helm
helm template my-api-service ./chart -f values-production.yaml
# For Kustomize
kustomize build overlays/production
In this case,
image.tagis missing from the production values file. Add it, push to main, and ArgoCD picks it up on the next poll or webhook trigger. The ComparisonError clears and the diff appears normally.
Prevention
Most of these causes are preventable with a bit of upfront work. Here's what I'd put in place on any new ArgoCD installation before it touches production.
- Configure Git webhooks on day one. Polling is a fallback, not an architecture. Webhooks give you sub-second sync triggers and eliminate the confusion of apparent drift that's really just a polling delay.
- Audit ignoreDifferences before go-live. After your first sync in a staging environment, run
argocd app diff
on each application and catalog every field that external controllers modify. Add those fields toignoreDifferences
before they become incidents. - Set hook deletion policies on every hook resource. Never ship a hook Job without
argocd.argoproj.io/hook-delete-policy
. Stale failed jobs will block your next sync at the worst possible moment. - Alert on repo credential expiry before it happens. If you're using short-lived tokens or SSH keys with expiry dates, set a calendar reminder or a CronJob that checks expiry and fires a Slack alert with enough lead time to rotate without an incident.
- Add
argocd app wait
to your CI pipelines. After pushing a change, runargocd app wait --sync --health --timeout 300 my-api-service
and fail the pipeline if it doesn't converge. This catches sync failures before you declare the deployment successful. - Define Lua health checks for all CRDs your operators use. Without them, custom resources default to Progressing indefinitely and will hold up syncs just like a degraded Deployment would.
- Set up Prometheus alerts on sync status. The
argocd_app_info
metric exposes sync and health labels. Alert on OutOfSync conditions that persist beyond a few minutes:
groups:
- name: argocd.rules
rules:
- alert: ArgoCDAppOutOfSync
expr: argocd_app_info{sync_status="OutOfSync"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "ArgoCD app {{ $labels.name }} stuck OutOfSync"
description: "{{ $labels.name }} in project {{ $labels.project }} has been OutOfSync for over 5 minutes."
OutOfSync is ArgoCD telling you something doesn't match. Sometimes that's fine — a legitimate drift you're aware of. Often it's a signal that something quietly broke. The teams that catch it fast are the ones who built observability around it from the start, not after the first production incident.
