Symptoms
You run
systemctl start dockerand it hangs, then fails. Or maybe Docker was running fine until a kernel update, a reboot, or someone edited
/etc/docker/daemon.json. Either way, the daemon is down and nothing Docker-related works.
Here's what you typically see on the surface:
systemctl status docker
shows Active: failed or spins in activating (start) indefinitely/var/run/docker.sock
either doesn't exist or sits there dead with no process behind itdocker ps
returns: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?- Dependent services — docker-compose stacks, container monitoring agents, CI runners — are all dead alongside it
The journal logs are your first stop. Don't spend time guessing. Pull the logs and read the actual error before you touch anything else:
journalctl -u docker.service --no-pager -n 80
Almost every case I've worked through was diagnosed within the first ten lines of that output. The categories below map directly to the errors you'll actually see there. Read the error, match the pattern, fix the cause.
Cause 1: Overlay Filesystem Error
Why It Happens
Docker's default storage driver is
overlay2, which depends on the
overlaykernel module. If that module isn't loaded — or if the underlying filesystem on
/var/lib/dockerdoesn't support overlayfs — the daemon won't start. This comes up more often than you'd think. I've seen it happen after in-place OS upgrades where the kernel changed but
/etc/moduleswasn't preserved, and after migrations to a new storage backend where someone formatted
/var/lib/dockeron XFS without enabling
ftype=1. Both cases look like a storage driver failure on the surface, but the root cause is one layer lower.
How to Identify It
In your journal output you'll see something like this:
Apr 20 09:14:22 sw-infrarunbook-01 dockerd[3812]: time="2026-04-20T09:14:22.441Z" level=error msg="failed to start daemon" error="error initializing graphdriver: driver not supported"
Apr 20 09:14:22 sw-infrarunbook-01 dockerd[3812]: time="2026-04-20T09:14:22.441Z" level=error msg="[graphdriver] prior storage driver overlay2 failed: driver not supported"
Or, if the XFS d_type support is missing:
Apr 20 09:14:22 sw-infrarunbook-01 dockerd[3812]: time="2026-04-20T09:14:22.510Z" level=error msg="failed to start daemon" error="error initializing graphdriver: overlay2: the backing xfs filesystem is formatted without d_type support, which leads to incorrect behavior. Reformat the filesystem with ftype=1 to enable d_type support."
Check whether the kernel module is loaded:
lsmod | grep overlay
No output means the module isn't loaded. If you're on XFS, also verify d_type support:
xfs_info /var/lib/docker | grep ftype
You want
ftype=1. If it shows
ftype=0, that's your problem and it can't be fixed without reformatting.
How to Fix It
If the module is just not loaded, load it immediately and make it persistent across reboots:
# Load it now
modprobe overlay
# Persist across reboots
echo "overlay" >> /etc/modules
# Then restart Docker
systemctl start docker
If the XFS filesystem was formatted without
ftype=1, you have to reformat it — there's no in-place fix. Back up anything you need from
/var/lib/docker(pulled images will need to be re-pulled anyway), unmount the volume, reformat with
mkfs.xfs -n ftype=1 /dev/sdX, remount, and start Docker fresh. If you're on ext4, this specific issue won't apply — ext4 supports d_type natively.
Cause 2: iptables Conflict
Why It Happens
Docker manages its own iptables rules to handle container networking — NAT for outbound traffic, forwarding between containers, port mapping. When the host is also running firewalld, nftables, or another network management tool that owns the same chains, conflicts arise. The most disruptive scenario is modern Linux distributions — RHEL 9, Debian 11+, Ubuntu 22.04+ — where
iptablesnow points to
iptables-nftby default, but Docker is still writing legacy iptables rules via the old backend. The two sets of rules don't coexist cleanly, and the daemon fails during network controller initialization.
How to Identify It
The journal output looks like this:
Apr 20 09:31:05 sw-infrarunbook-01 dockerd[4102]: time="2026-04-20T09:31:05.821Z" level=warning msg="could not change the host's network settings: could not create ip table rule in docker-forward"
Apr 20 09:31:05 sw-infrarunbook-01 dockerd[4102]: time="2026-04-20T09:31:05.901Z" level=error msg="failed to start daemon" error="network controller initialization failed: error creating default \"bridge\" network: Failed to Setup IP tables: Unable to enable SKIP DNAT rule: (iptables failed: iptables --wait -t nat -I DOCKER -i docker0 -j RETURN: iptables: No chain/target/match by that name."
Check which iptables backend is active and whether firewalld is in the picture:
update-alternatives --display iptables
systemctl is-active firewalld
How to Fix It
The cleanest fix on systems where
iptables-nftis the default is to switch to
iptables-legacy, which Docker handles reliably:
update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
systemctl restart docker
If you need to keep
iptables-nftand firewalld is the conflict, configure Docker's bridge interface in a trusted firewalld zone:
firewall-cmd --permanent --zone=trusted --add-interface=docker0
firewall-cmd --reload
systemctl restart docker
There's also the option of setting
"iptables": falsein
/etc/docker/daemon.jsonto tell Docker to skip managing iptables entirely, but only do that if you're prepared to write and maintain all the NAT and forwarding rules yourself. That's rarely worth it outside of very specialized environments.
Cause 3: Cgroup v2 Issue
Why It Happens
Linux kernels since 5.2 support cgroup v2 (the unified hierarchy), and many distributions have made it the default. Docker and containerd work fine with cgroup v2, but only when both are configured to use the
systemdcgroup driver. The classic failure scenario: Docker was installed on a cgroup v1 system, the kernel was upgraded, the system booted into cgroup v2 mode, and now Docker's internal configuration refers to a cgroup structure that no longer exists as expected. The daemon tries to initialize cgroup paths that simply aren't there.
I've also hit this when running Docker inside a VM or LXC container where the host's cgroup configuration doesn't match what the guest expects — particularly in nested virtualization setups.
How to Identify It
First, check which cgroup version is active:
stat -fc %T /sys/fs/cgroup/
If it returns
cgroup2fs, you're on v2. If it returns
tmpfs, you're on v1. Then look at the journal:
Apr 20 10:02:11 sw-infrarunbook-01 dockerd[4451]: time="2026-04-20T10:02:11.200Z" level=error msg="failed to start daemon" error="Devices cgroup isn't mounted"
Apr 20 10:02:11 sw-infrarunbook-01 containerd[4389]: time="2026-04-20T10:02:11.198Z" level=error msg="failed to handle event" error="failed to get OOM score for pid 4451: failed to read /proc/4451/oom_score_adj: no such process"
Check what cgroup driver is currently configured in both Docker and containerd:
cat /etc/docker/daemon.json
grep -A5 'runc' /etc/containerd/config.toml | grep -i cgroup
How to Fix It
On a cgroup v2 system with systemd as init, set the cgroup driver to
systemdin
/etc/docker/daemon.json:
{
"exec-opts": ["native.cgroupdriver=systemd"]
}
In
/etc/containerd/config.toml, enable the systemd cgroup driver for the runc runtime:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
Always restart containerd before Docker — Docker's runtime depends on containerd being configured correctly first:
systemctl restart containerd
systemctl restart docker
Rolling back to cgroup v1 by adding
systemd.unified_cgroup_hierarchy=0to your GRUB kernel parameters is technically possible, but it's a band-aid. Fix the driver configuration instead. Cgroup v2 is where the ecosystem is going, and fighting it costs more effort over time than embracing it now.
Cause 4: Storage Driver Misconfigured
Why It Happens
Docker supports multiple storage drivers:
overlay2,
devicemapper,
btrfs,
zfs, and
vfs. The daemon reads its storage driver from
/etc/docker/daemon.json, and if that config specifies a driver that isn't available — because the required kernel module is absent, a binary dependency isn't installed, or the filesystem doesn't support it — the daemon fails to initialize the graph driver and exits.
In my experience, this comes up most often when someone copies a
daemon.jsonfrom one server to another without checking that the target has the same capabilities. It also surfaces when the storage driver config is technically valid, but doesn't match the existing data under
/var/lib/docker. If you change the storage driver after Docker has been running, existing image layers become inaccessible and the daemon may refuse to start or come up in a broken state.
How to Identify It
The journal error is usually explicit about what failed:
Apr 20 10:45:33 sw-infrarunbook-01 dockerd[5012]: time="2026-04-20T10:45:33.812Z" level=error msg="failed to start daemon" error="error initializing graphdriver: prior storage driver devicemapper failed: driver not supported"
Apr 20 10:45:33 sw-infrarunbook-01 dockerd[5012]: time="2026-04-20T10:45:33.901Z" level=error msg="failed to start daemon" error="error initializing graphdriver: unknown graphdriver: btrfs"
Check what's configured versus what's actually on disk:
cat /etc/docker/daemon.json
ls /var/lib/docker/
The subdirectory inside
/var/lib/dockernamed after the driver —
overlay2/,
devicemapper/,
btrfs/— tells you what driver wrote the existing state. If that doesn't match what's in
daemon.json, you have a mismatch.
How to Fix It
If the configured driver is wrong and you want to use
overlay2, update
/etc/docker/daemon.json:
{
"storage-driver": "overlay2"
}
Always validate the JSON syntax before restarting — a malformed
daemon.jsonis one of the most common self-inflicted Docker failures and gives a completely opaque error:
python3 -m json.tool /etc/docker/daemon.json
If you're switching drivers and want to preserve your existing images, export them first with
docker save, clear
/var/lib/docker, update the config, start Docker, and re-import with
docker load. There's no lossless in-place driver migration — the layer formats are incompatible across drivers.
Cause 5: Corrupted Docker State
Why It Happens
Docker maintains its own internal state database under
/var/lib/docker. This includes layer metadata, container state, network configuration, and volume references — most of it stored in boltDB files. If the daemon is killed mid-write — power loss, hard reboot, an OOM killer that takes out
dockerdat the wrong moment — you end up with corrupted boltDB files, half-written image layers, or a broken containerd content store. I've also seen this happen when a host ran out of inodes rather than disk space, causing Docker to write partial or garbage data into its metadata files before hitting the inode ceiling.
How to Identify It
The journal will have errors referencing database operations failing or metadata that can't be parsed:
Apr 20 11:12:44 sw-infrarunbook-01 dockerd[5890]: time="2026-04-20T11:12:44.111Z" level=error msg="failed to start daemon" error="error loading config file: unexpected end of JSON input"
Apr 20 11:12:44 sw-infrarunbook-01 dockerd[5890]: time="2026-04-20T11:12:44.330Z" level=error msg="containerd: deleting container" error="bolt: invalid argument"
Apr 20 11:12:44 sw-infrarunbook-01 dockerd[5890]: time="2026-04-20T11:12:44.401Z" level=error msg="failed to start daemon" error="failed to create new content store: bolt DB /var/lib/docker/containerd/daemon/io.containerd.metadata.v1.bolt/meta.db: invalid database"
The key phrases:
bolt: invalid argument,
invalid database,
unexpected end of JSON inputwhen reading Docker's own files, and
failed to load containerduring startup. Also check inode exhaustion — it's easy to miss:
df -ih /var/lib/docker
If inode usage is at 100%, Docker can't create new metadata entries and will silently corrupt what it can't finish writing.
How to Fix It
The reliable fix is to stop both services, back up anything irreplaceable, and clear the state directory:
systemctl stop docker
systemctl stop containerd
# Back up daemon config
cp /etc/docker/daemon.json /root/daemon.json.bak
# Remove corrupted state
rm -rf /var/lib/docker
rm -rf /var/lib/containerd
systemctl start containerd
systemctl start docker
Be clear with yourself about what this wipes: all pulled images, all stopped containers, and any named volumes stored at the default location. Images can be re-pulled. If you have stateful named volumes that weren't backed up externally, they're gone. In production environments running workloads from an orchestrator, this is usually fine — the orchestrator will reschedule containers and pull images fresh. But audit your volumes before you delete anything.
A more surgical approach involves deleting only the corrupted boltDB files while leaving image layer directories intact, but pinpointing exactly which files are corrupted is time-consuming and error-prone. In most cases, starting clean is faster and more reliable than trying to salvage a partially broken state tree.
Cause 6: Stale or Broken Docker Socket
Why It Happens
Docker listens on a Unix socket at
/var/run/docker.sock. The daemon creates this socket at startup, and it should be owned by
root:dockerwith mode
0660. Sometimes this breaks: a security hardening script changed the socket permissions, someone manually modified it during debugging, or an incomplete previous startup left behind a stale socket file that the new process can't overwrite because it's owned differently. The daemon sees the address as already in use and refuses to bind.
How to Identify It
ls -la /var/run/docker.sock
Normal output looks like this:
srw-rw---- 1 root docker 0 Apr 20 09:00 /var/run/docker.sock
If the ownership or permissions are wrong, or if a socket file exists while the daemon is stopped, you'll see journal errors like:
Apr 20 11:44:01 sw-infrarunbook-01 dockerd[6201]: time="2026-04-20T11:44:01.301Z" level=error msg="failed to start daemon" error="can't create unix socket /var/run/docker.sock: listen unix /var/run/docker.sock: bind: address already in use"
How to Fix It
If a stale socket file is blocking the bind, remove it and start fresh — Docker will recreate it cleanly:
rm -f /var/run/docker.sock
systemctl start docker
If permissions are wrong after Docker is running, correct them directly:
chown root:docker /var/run/docker.sock
chmod 660 /var/run/docker.sock
Cause 7: Disk Space Exhaustion
Why It Happens
This one feels obvious, but I've watched experienced engineers overlook it for far too long. Docker writes a lot of data to
/var/lib/docker— image layers, container writable layers, build cache, and container log files. When the partition hosting that directory fills up, Docker can't write its state files or update metadata, and the daemon either refuses to start or crashes shortly after. The same problem hits when you've exhausted inodes without running out of raw disk space, which is especially common on hosts running many small containers that each create dozens of small files.
How to Identify It
# Check disk space
df -h /var/lib/docker
# Check inode usage
df -ih /var/lib/docker
# If Docker is partially functional
docker system df
Journal output for a full disk often surfaces as generic write errors rather than a clean "disk full" message:
Apr 20 12:01:33 sw-infrarunbook-01 dockerd[6601]: time="2026-04-20T12:01:33.402Z" level=error msg="Handler for POST /v1.44/containers/create returned error: write /var/lib/docker/overlay2/abc123/merged/tmp: no space left on device"
How to Fix It
If Docker is partially up, prune unused data first:
docker system prune -af --volumes
If the daemon won't start at all, identify what's consuming space and free it manually before attempting a restart:
du -sh /var/lib/docker/overlay2/* | sort -rh | head -20
Once you have enough headroom to start the daemon, run
docker system pruneto clean up properly. Long-term, mount
/var/lib/dockeron a dedicated volume, and cap container log sizes in
daemon.jsonso you don't hit this again.
Prevention
Most Docker daemon failures are entirely preventable. Here's what I put in place on every host running Docker in production.
Pin kernel modules at boot. Add
overlayto
/etc/modulesso it always loads regardless of what the initrd does during a kernel upgrade. It takes five seconds and eliminates an entire class of startup failures.
Validate daemon.json before applying changes. Run
python3 -m json.tool /etc/docker/daemon.jsonafter every edit. JSON syntax errors are silent until Docker reads the file at startup, and a missing comma will take the daemon down just as effectively as a kernel bug. Make validation part of your change procedure, not an afterthought.
Put /var/lib/docker on a dedicated volume. This protects the root filesystem from being filled by Docker's data and makes capacity expansion trivial. In any cloud environment, attach a separate block device and mount it at
/var/lib/dockerbefore Docker is ever installed or started on the host.
Configure log rotation in daemon.json. Without it, container logs grow unbounded. This is one of the most common causes of disk exhaustion on long-running hosts. Always include this in your baseline configuration:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
Schedule regular pruning. Unused images, stopped containers, and dangling build cache accumulate faster than most teams realize. A weekly cron job handles it without any manual effort:
0 3 * * 0 root docker system prune -f --filter "until=168h" >> /var/log/docker-prune.log 2>&1
Alert on the service state. At minimum, alert on
systemctl is-active dockerreturning anything other than
active. If you're running Prometheus with node_exporter, the systemd collector exposes
node_systemd_unit_state{name="docker.service",state="active"}as a metric you can alert on directly — no custom scripting needed.
Validate Docker after kernel upgrades in staging. A kernel upgrade that flips the system from cgroup v1 to v2 will break Docker if the cgroup driver isn't set to
systemd. Catching this on a staging host before rolling it to production costs minutes. Catching it in production at 2am costs much more. Make "does Docker start cleanly after kernel upgrade" a standard validation step in your patching runbook.
Docker daemon failures aren't mysterious. They're almost always caused by one of the issues above, and the journal logs will tell you which one within the first few lines. The real discipline is reading the error before you start changing things. Build that habit and you'll cut your mean time to resolution dramatically.
