Symptoms
You've deployed Traefik, pointed a domain at your server, and instead of a clean padlock you're staring at a certificate warning. Or worse — Traefik is silently failing to renew, and you only find out when a user emails to say the site is broken. The symptoms vary depending on where in the ACME lifecycle things went wrong:
- Browser shows "Your connection is not private" with
NET::ERR_CERT_AUTHORITY_INVALID
- Traefik logs contain
Unable to obtain ACME certificate for domains
orError obtaining certificate
- The
acme.json
file is empty or has no certificate entries for your domain - Traefik dashboard shows the router active, but TLS is falling back to the default self-signed cert
- A freshly deployed service shows
Certificate is not yet valid
immediately after startup - Renewals fail silently and the certificate expires without warning
In my experience, nearly every one of these failures traces back to a handful of predictable causes. Let's walk through each one systematically — why it happens, how to confirm it, and how to get past it.
Root Cause 1: ACME Challenge Failing
Why It Happens
Traefik uses the ACME protocol to request certificates from Let's Encrypt. With the default HTTP-01 challenge, Let's Encrypt fires a request to
http://solvethenetwork.com/.well-known/acme-challenge/<token>and expects Traefik to serve back the correct token. If that request fails for any reason — a routing misconfiguration, middleware intercepting it, or a redirect chain eating the request before Traefik can respond — the challenge fails and you get no certificate.
I've seen this happen repeatedly on stacks where someone added a global redirect-to-HTTPS middleware on the entrypoint level. The ACME challenge arrives on port 80, gets redirected to 443, but 443 doesn't have a valid cert yet. It's a circular dependency. Traefik handles this with a special internal router called
acme-http@internalthat intercepts challenge requests before any user-defined middleware — but only if you haven't accidentally overridden it with a catch-all rule.
How to Identify It
Enable debug logging and watch the output during a certificate request:
traefik --log.level=DEBUG 2>&1 | grep -i acme
A failed HTTP-01 challenge looks like this in the logs:
time="2026-04-12T10:14:32Z" level=error msg="Unable to obtain ACME certificate for domains \"solvethenetwork.com\""
reason="acme: Error -> One or more domains had a problem:
[solvethenetwork.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized ::
Invalid response from http://solvethenetwork.com/.well-known/acme-challenge/Abc123XYZ..."
You can simulate exactly what Let's Encrypt does from any external machine:
curl -v http://solvethenetwork.com/.well-known/acme-challenge/test
If you get a redirect to HTTPS, a 404, or a connection error, the challenge path isn't reachable the way Let's Encrypt needs it to be.
How to Fix It
Don't apply your HTTPS redirect middleware at the entrypoint level. Apply it only to individual service routers. Traefik's
acme-http@internalrouter handles challenge requests on port 80 automatically — the problem is when you define a global HTTP-to-HTTPS redirect in the static config that catches everything first. This config pattern is the culprit:
# traefik.yml — this blocks ACME challenges when applied globally
entryPoints:
web:
address: ":80"
http:
redirections:
entryPoint:
to: websecure
scheme: https
Remove the
redirectionsblock from the entrypoint and move it to the router level instead. Define a dedicated redirect middleware and attach it only to your service routers via Docker labels:
traefik.http.middlewares.redirect-https.redirectscheme.scheme=https
traefik.http.middlewares.redirect-https.redirectscheme.permanent=true
traefik.http.routers.myapp-http.middlewares=redirect-https
This leaves port 80 open for ACME while still redirecting real user traffic to HTTPS.
Root Cause 2: DNS Not Propagated
Why It Happens
Let's Encrypt resolves your domain's A record from multiple geographic vantage points before issuing a certificate. If you just pointed
solvethenetwork.comat the new IP of
sw-infrarunbook-01and immediately triggered Traefik's ACME flow, there's a real chance Let's Encrypt is still seeing the old IP or getting NXDOMAIN from at least one of its resolvers.
DNS TTLs are the obvious culprit, but there's a less obvious one: some DNS providers have internal propagation delays that exceed their published TTL. I've seen providers advertise a 60-second TTL but take 5–10 minutes to push changes globally. Let's Encrypt will reject the challenge if even one of its resolvers can't reach your server at the resolved address.
How to Identify It
Check what different public resolvers currently see for your domain:
dig @8.8.8.8 solvethenetwork.com A +short
dig @1.1.1.1 solvethenetwork.com A +short
dig @9.9.9.9 solvethenetwork.com A +short
If you get different answers from different resolvers, propagation isn't complete yet. For a more authoritative check, query the nameservers directly:
# First find your authoritative nameservers:
dig solvethenetwork.com NS +short
# Then query one directly:
dig @ns1.provider.net solvethenetwork.com A +short
Also check the remaining TTL on the current record to estimate how long you have to wait:
dig solvethenetwork.com A | grep -i "IN.*A"
In Traefik logs, a DNS-related ACME failure usually appears as a timeout or an authorization error where Let's Encrypt reports it couldn't validate the domain at the expected IP.
How to Fix It
Wait for propagation. That's the real answer. Don't fight the TTL. The best mitigation is proactive: before making DNS changes, lower your TTL to 60 seconds well in advance — ideally 24 hours before the cutover — so that when you do switch the record, propagation completes quickly.
If you're already in this situation, confirm propagation is complete across all three resolvers above before restarting Traefik to trigger a fresh ACME request. Restarting before DNS is ready just wastes rate limit attempts, which brings us to the next problem.
Root Cause 3: Rate Limit Hit
Why It Happens
Let's Encrypt enforces rate limits that are surprisingly easy to hit in test environments or during repeated failed deployments. The limits that bite most often are: 5 duplicate certificate orders per week for the same set of domains, and 50 certificates per registered domain per week. If you've been iterating on a Traefik config — restarting the container, watching it fail, adjusting, restarting again — you can burn through all 5 duplicate-certificate attempts within an hour.
The registered domain limit is per eTLD+1, meaning all subdomains of
solvethenetwork.comshare the same weekly quota of 50 certificates. If you're managing many services under a single domain, you can approach this ceiling without realizing it.
How to Identify It
The rate limit error in Traefik logs is unmistakable:
time="2026-04-12T11:02:17Z" level=error msg="Unable to obtain ACME certificate"
reason="acme: Error -> One or more domains had a problem:
[solvethenetwork.com] acme: error: 429 :: urn:ietf:params:acme:error:rateLimited ::
Error finalizing order :: too many certificates already issued for exact set of domains"
You can also audit how many certificates Traefik has already obtained by inspecting
acme.json:
cat /etc/traefik/acme.json | python3 -m json.tool | grep -c '"domain"'
Cross-reference that count with Let's Encrypt's published rate limit thresholds. The 429 HTTP status code in the ACME error is the definitive signal — once you see it, you're done until the weekly window rolls over.
How to Fix It
Switch to the Let's Encrypt staging environment while you're testing. It has much higher limits and uses a separate CA — you'll get an untrusted certificate, but that's exactly what you want during troubleshooting. Update your resolver in
traefik.yml:
certificatesResolvers:
letsencrypt:
acme:
email: infrarunbook-admin@solvethenetwork.com
storage: /etc/traefik/acme.json
caServer: https://acme-staging-v02.api.letsencrypt.org/directory
httpChallenge:
entryPoint: web
Once the staging certificate appears in the browser (even as untrusted), your entire ACME flow is working correctly. Then swap back to the production CA URL, delete
acme.jsonto force a fresh issuance, and restart Traefik. If you're already rate-limited in production, there's no shortcut — you have to wait. The window is rolling and tied to the timestamps of the failed requests in your logs, so check those to estimate when you'll be clear.
Root Cause 4: Port 80 Not Accessible
Why It Happens
The HTTP-01 challenge requires port 80 to be publicly reachable from the internet. This sounds obvious but fails in environments where port 80 is blocked at the cloud firewall, not published from the Docker container, or simply not bound because the Traefik entrypoint was never defined. Cloud providers don't open inbound ports by default. Security groups and firewall rules have to be explicitly configured, and this step gets skipped more often than you'd think.
The Docker publishing issue is another common one. The container is listening on port 80 internally, but because the
portsmapping is missing from the Compose file, traffic from the internet never reaches it. Traefik appears to be running fine from inside the host, which makes this confusing to diagnose without external testing.
How to Identify It
Test port 80 connectivity from a machine that isn't
sw-infrarunbook-01:
nc -zv solvethenetwork.com 80
curl -v --max-time 10 http://solvethenetwork.com/
A connection timeout confirms port 80 isn't reachable externally. On the server itself, verify Traefik is actually listening:
ss -tlnp | grep :80
You should see output like this:
LISTEN 0 128 0.0.0.0:80 0.0.0.0:* users:(("traefik",pid=12345,fd=10))
If nothing appears, Traefik isn't binding port 80 at all. Check the Docker port mappings:
docker inspect traefik | python3 -m json.tool | grep -A5 '"Ports"'
How to Fix It
Make sure your Docker Compose has both ports explicitly published:
services:
traefik:
image: traefik:v3.0
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /etc/traefik/traefik.yml:/etc/traefik/traefik.yml
- /etc/traefik/acme.json:/etc/traefik/acme.json
Then open the firewall on the host. If you're using
ufw:
ufw allow 80/tcp
ufw allow 443/tcp
ufw reload
ufw status verbose
On cloud providers, update the security group or VPC firewall rule to allow inbound TCP port 80 from
0.0.0.0/0. After making these changes, re-run the external connectivity test before triggering another ACME request. Don't waste another rate-limit attempt until you've confirmed port 80 responds.
Root Cause 5: Wrong Email in Config
Why It Happens
The email address in your ACME configuration is used by Let's Encrypt for expiry notifications and account registration. A malformed address causes account registration to fail outright. A valid-looking but wrong address won't block issuance, but it means you'll never receive expiry warnings — which is how certificates quietly expire in production.
This one is subtle because Let's Encrypt doesn't verify email by sending a confirmation link. But it does validate the format during account registration, and I've seen teams copy-paste a template config that still has a placeholder like
admin@changeme.localor an empty string, then wonder why the ACME flow fails at the very first step.
How to Identify It
Grep your config for the email field:
grep -i email /etc/traefik/traefik.yml
Expected output:
email: infrarunbook-admin@solvethenetwork.com
Also inspect
acme.jsonto see what email was used when the ACME account was originally registered — this can differ from what's currently in the config if the file predates a config change:
python3 -c "
import json, sys
with open('/etc/traefik/acme.json') as f:
data = json.load(f)
for resolver, content in data.items():
reg = content.get('Account', {}).get('Registration', {})
print(resolver, ':', reg.get('body', {}).get('contact', 'no contact found'))
"
A malformed email causes an explicit error during account registration:
time="2026-04-12T09:45:11Z" level=error msg="Unable to obtain ACME certificate"
reason="acme: error: 400 :: urn:ietf:params:acme:error:invalidEmail ::
Error creating new account :: contact email \"admin@\" is invalid"
How to Fix It
Correct the email in
traefik.yml, then wipe
acme.jsonand restart Traefik to force a fresh ACME account registration with the correct address. The file must be truncated rather than deleted if your volume mount expects it to exist:
sudo truncate -s 0 /etc/traefik/acme.json
sudo chmod 600 /etc/traefik/acme.json
Or if you prefer to recreate it cleanly:
sudo rm /etc/traefik/acme.json
sudo touch /etc/traefik/acme.json
sudo chmod 600 /etc/traefik/acme.json
The
chmod 600step isn't optional. Traefik logs a warning and may refuse to use
acme.jsonif the file is world-readable, because it contains private key material. This is correct behavior — treat it as a feature, not an obstacle.
Root Cause 6: acme.json Permission or Ownership Issues
Why It Happens
Even if everything else is configured correctly, Traefik will fail to persist certificates if
acme.jsonhas wrong permissions or ownership. Traefik refuses to write to a file that's world-readable because of the private keys it stores. Conversely, if the file is owned by root but Traefik runs as a non-root UID inside the container, write attempts fail silently and the ACME flow appears to succeed in logs but produces nothing on disk.
How to Identify It
ls -la /etc/traefik/acme.json
# Correct output:
-rw------- 1 root root 4096 Apr 12 10:00 /etc/traefik/acme.json
The warning in Traefik logs when permissions are too open:
level=warning msg="The ACME certificate storage file /etc/traefik/acme.json
has been created with permissions 644, please use chmod 600"
How to Fix It
chmod 600 /etc/traefik/acme.json
chown root:root /etc/traefik/acme.json
If Traefik runs as a specific non-root user inside the container, find that UID and set ownership accordingly:
docker exec traefik id
# uid=65532(nonroot) gid=65532(nonroot)
sudo chown 65532:65532 /etc/traefik/acme.json
Restart Traefik after correcting permissions and verify the file size grows as certificates are written.
Root Cause 7: Resolver Name Mismatch
Why It Happens
Traefik requires that the
certResolvervalue on your router label or dynamic config exactly matches the resolver name defined under
certificatesResolversin your static config. It's case-sensitive. Define the resolver as
letsencryptin
traefik.ymlbut label your container with
certresolver=letsEncrypt, and Traefik will skip certificate issuance for that router entirely without logging a meaningful error. The router shows as active in the dashboard, but TLS falls back to the self-signed default cert.
How to Identify It
Query the Traefik API to inspect the router's TLS config:
curl -s http://sw-infrarunbook-01:8080/api/http/routers | python3 -m json.tool | grep -A10 '"tls"'
A correctly configured router shows:
"tls": {
"certResolver": "letsencrypt"
}
If
certResolveris an empty string or the field is absent, the label value didn't match any defined resolver.
How to Fix It
Align the label value exactly with the resolver name in your static config:
# traefik.yml defines:
certificatesResolvers:
letsencrypt:
acme:
email: infrarunbook-admin@solvethenetwork.com
storage: /etc/traefik/acme.json
httpChallenge:
entryPoint: web
# Docker Compose label must match exactly:
traefik.http.routers.myapp-secure.tls.certresolver=letsencrypt
After fixing the label, redeploy the container. Traefik picks up the change and triggers a certificate request on the next router reload.
Prevention
Most of these failures are entirely preventable with a consistent deployment checklist. Here's what I build into every Traefik setup from the start.
Always validate with staging first. Before pointing production traffic anywhere near a new Traefik instance, set the
caServerto the Let's Encrypt staging URL and confirm the full ACME flow completes. You'll get an untrusted certificate, but if the browser shows a cert issued by "Fake LE Intermediate X1," every critical path — DNS resolution, port 80 routing, challenge serving, acme.json writes — has been validated without spending production rate limits.
Confirm DNS propagation before deployment. Make it a formal step in your runbook. Query at least three public resolvers and verify they all return the correct IP for
solvethenetwork.combefore starting Traefik. Lower your TTL 24 hours ahead of a DNS cutover if you have that luxury.
Set acme.json permissions in your provisioning scripts. Don't rely on Traefik to create the file with correct permissions. Create it yourself during host setup:
install -m 600 -o root -g root /dev/null /etc/traefik/acme.json
Monitor certificate expiry proactively. Don't rely on Let's Encrypt expiry emails as your only alert. If Traefik's Prometheus metrics endpoint is enabled, alert on the
traefik_tls_certs_not_aftergauge:
# Alert when any cert expires in fewer than 14 days:
(traefik_tls_certs_not_after - time()) / 86400 < 14
This gives you visibility across every domain Traefik manages, and you'll catch renewal failures before they become production incidents.
Consider DNS-01 challenges for wildcard certificates. If you're managing many subdomains under
solvethenetwork.com, switching to DNS-01 challenges eliminates the port 80 dependency entirely and lets you issue wildcard certs (
*.solvethenetwork.com) that cover all subdomains under a single certificate. Most major DNS providers have Traefik-compatible plugins available. The trade-off is that you need API credentials for your DNS provider in the Traefik config, so store those in a secret manager rather than directly in the Compose file.
Review the Traefik changelog after every major upgrade. Certificate resolver configuration has changed between v1, v2, and v3. Fields have moved, defaults have changed, and deprecated keys sometimes stop working silently. After any major Traefik upgrade, validate your static config against the new schema before assuming certificates will continue to renew cleanly in the background.
Certificate failures are frustrating precisely because they're often silent until something breaks for users. Staging validation, pre-deployment DNS checks, locked-down
acme.jsonpermissions, and proactive expiry monitoring together eliminate nearly every surprise I've seen in production Traefik deployments.
