Symptoms
You log in to sw-infrarunbook-01 and
df -hshows
/var/lib/dockersitting at 94% utilization. Maybe a
docker pulljust failed mid-layer with
write /var/lib/docker/overlay2/...: no space left on device. Containers are refusing to start. A build pipeline that was green yesterday is now dying. Your monitoring alert fired at 3 AM and you need to recover fast.
Docker disk usage is one of those problems that creeps up slowly and then hits you all at once. The daemon doesn't aggressively reclaim space on its own — it's designed to cache for speed and leave cleanup to the operator. If you haven't built cleanup into your workflow, the disk will fill. Every time.
Common things you'll see when this happens:
docker pull
fails withwrite /var/lib/docker/overlay2/...: no space left on device
- Container fails to start with
Error response from daemon: mkdir ...: no space left on device
- Builds die mid-layer during a
RUN
step with no obvious error in the Dockerfile df -h
shows/
or a dedicated Docker partition at 90% or higherdu -sh /var/lib/docker/*
shows several gigabytes spread acrossoverlay2
,containers
, andvolumes
Before chasing individual causes, start with Docker's own accounting command. It gives you a breakdown by category and shows you exactly where to focus:
infrarunbook-admin@sw-infrarunbook-01:~$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 47 12 18.4GB 14.1GB (76%)
Containers 23 6 1.2GB 980MB (81%)
Local Volumes 31 8 9.7GB 6.3GB (64%)
Build Cache 186 0 3.2GB 3.2GBThat output tells a story. Over 33 GB is sitting on disk right now, and the vast majority of it is reclaimable. Let's go through each root cause systematically.
Root Cause 1: Unused Images Not Cleaned Up
This is the most common offender on any long-running Docker host. Every time you pull a new version of an image, run a build, or have a CI pipeline push updated tags, the old image layers stay on disk. Docker doesn't delete them automatically. The old layers are kept because Docker doesn't know whether another image or container might still reference them — but in practice, on a busy host, most of those layers are completely orphaned.
There are two categories to care about. Dangling images are untagged intermediate images — the kind produced when you rebuild an image and the old one loses its tag. Unreferenced images are fully tagged images that no running or stopped container is actually using. In my experience, a host that's been running a daily build pipeline for a few months can accumulate 30–50 image versions. That's easily 15–25 GB of layer data sitting idle.
How to Identify
infrarunbook-admin@sw-infrarunbook-01:~$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
registry.solvethenetwork.com/app/api v1.42 a3b4c5d6e7f8 2 hours ago 1.1GB
registry.solvethenetwork.com/app/api v1.41 9f8e7d6c5b4a 2 days ago 1.1GB
registry.solvethenetwork.com/app/api v1.40 1a2b3c4d5e6f 5 days ago 1.0GB
<none> <none> deadbeef1234 6 days ago 980MB
nginx 1.25 abcdef123456 1 week ago 192MB
nginx 1.24 fedcba654321 3 weeks ago 190MB
infrarunbook-admin@sw-infrarunbook-01:~$ docker images -f dangling=true
REPOSITORY TAG IMAGE ID CREATED SIZE
<none> <none> deadbeef1234 6 days ago 980MBHow to Fix
To remove only dangling images:
infrarunbook-admin@sw-infrarunbook-01:~$ docker image prune
WARNING! This will remove all dangling images.
Are you sure you want to continue? [y/N] y
Deleted Images:
deleted: sha256:deadbeef1234...
Total reclaimed space: 980MBThe more useful option — removing all images not referenced by any container, running or stopped:
infrarunbook-admin@sw-infrarunbook-01:~$ docker image prune -a
WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
untagged: registry.solvethenetwork.com/app/api:v1.40
deleted: sha256:1a2b3c4d5e6f...
untagged: nginx:1.24
deleted: sha256:fedcba654321...
Total reclaimed space: 14.1GBIf you need to be selective — for example, keeping images from the last 24 hours — you can filter by age:
docker image prune -a --filter until=24h.
Root Cause 2: Orphaned Volumes
Docker volumes are intentionally persistent. When a container is removed, its volume is not — that's by design, so you don't lose your database data just because a container restarted. But this means every
docker rmwithout the
-vflag leaves a volume behind. Over time, especially on hosts running docker-compose stacks that get torn down and rebuilt regularly, orphaned volumes accumulate quietly.
I've seen this happen constantly with teams that run short-lived compose stacks for testing or staging. They do
docker-compose up -d, test something, then
docker-compose down— not realizing that
downdoes not remove named volumes by default. Do that a hundred times over a few months and you have a hundred orphaned volumes, some of which might be holding gigabytes of database files that will never be read again.
How to Identify
infrarunbook-admin@sw-infrarunbook-01:~$ docker volume ls
DRIVER VOLUME NAME
local postgres_data_20240301
local postgres_data_20240315
local postgres_data_20240401
local redis_cache_old
local app_uploads_backup
local 7f3a1b2c4d5e6f7a8b9c0d1e2f3a4b5c
infrarunbook-admin@sw-infrarunbook-01:~$ docker volume ls -f dangling=true
DRIVER VOLUME NAME
local postgres_data_20240301
local postgres_data_20240315
local redis_cache_old
local 7f3a1b2c4d5e6f7a8b9c0d1e2f3a4b5c
infrarunbook-admin@sw-infrarunbook-01:~$ du -sh /var/lib/docker/volumes/*
4.1G /var/lib/docker/volumes/postgres_data_20240301
4.0G /var/lib/docker/volumes/postgres_data_20240315
210M /var/lib/docker/volumes/redis_cache_old
12K /var/lib/docker/volumes/7f3a1b2c4d5e6f7a8b9c0d1e2f3a4b5cHow to Fix
Before pruning volumes, verify the dangling ones are genuinely unused. Check what containers were using them and whether that data has been backed up or migrated. Named volumes that look like
postgres_data_20240301should be treated carefully — old doesn't mean unneeded. Once you're certain:
infrarunbook-admin@sw-infrarunbook-01:~$ docker volume prune
WARNING! This will remove anonymous local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
Deleted Volumes:
redis_cache_old
7f3a1b2c4d5e6f7a8b9c0d1e2f3a4b5c
Total reclaimed space: 210MBTo remove specific named volumes you've verified as safe:
infrarunbook-admin@sw-infrarunbook-01:~$ docker volume rm postgres_data_20240301 postgres_data_20240315
postgres_data_20240301
postgres_data_20240315Going forward, use
docker-compose down -vwhenever you want volumes cleaned up along with the stack. Build that habit into your teardown scripts from the start.
Root Cause 3: Build Cache Not Pruned
The Docker build cache is where intermediate image layers live during and after a build. Docker keeps these around so subsequent builds can reuse layers that haven't changed — this is what makes rebuilds fast when only your application code changes. The cache isn't free, though. On an active build host, it can quietly grow to 20–40 GB because it doesn't show up prominently in
docker images.
BuildKit, which is the default builder since Docker 23.x, maintains its own separate cache on top of the classic layer cache. If you're running BuildKit builds — which is likely if you're on a modern Docker version — you need to account for both when investigating disk usage.
How to Identify
infrarunbook-admin@sw-infrarunbook-01:~$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 47 12 18.4GB 14.1GB (76%)
Containers 23 6 1.2GB 980MB (81%)
Local Volumes 31 8 9.7GB 6.3GB (64%)
Build Cache 186 0 3.2GB 3.2GB
infrarunbook-admin@sw-infrarunbook-01:~$ docker builder du
ID RECLAIMABLE SIZE LAST ACCESSED
s3b4hk9f5qlxf2a1c7d8e9 true 1.2GB 2 hours ago
l1m2n3o4p5q6r7s8t9u0v1 true 890MB 6 hours ago
w2x3y4z5a6b7c8d9e0f1g2 true 780MB 1 day ago
...
Total: 3.2GBHow to Fix
To prune only dangling build cache — layers with no references to current images:
infrarunbook-admin@sw-infrarunbook-01:~$ docker builder prune
WARNING! This will remove all dangling build cache.
Are you sure you want to continue? [y/N] y
Deleted build cache objects:
s3b4hk9f5qlxf2a1c7d8e9
l1m2n3o4p5q6r7s8t9u0v1
Total reclaimed space: 2.09GBTo wipe the entire build cache — including layers that could theoretically speed up future builds:
infrarunbook-admin@sw-infrarunbook-01:~$ docker builder prune --all
WARNING! This will remove all build cache.
Are you sure you want to continue? [y/N] y
Total reclaimed space: 3.2GBThe tradeoff is that your next build will be slower — all layers pull fresh. On a CI host that rebuilds from scratch on every run anyway, this is no loss at all. On a developer workstation where you rebuild frequently, be more selective with
--filter until=24hto preserve recent cache entries.
Root Cause 4: Container Logs Not Rotated
By default, Docker's
json-filelog driver writes container stdout and stderr to
/var/lib/docker/containers/<container-id>/<container-id>-json.logwith no size limit and no rotation. A container that logs aggressively — a web server recording every HTTP request, a service with a runaway debug logger, or an application caught in an error loop — will write indefinitely until the disk is full.
I've seen this take down production hosts. A single Java service that had its log level accidentally set to DEBUG wrote 40 GB in under 18 hours. The container appeared perfectly healthy from a process standpoint. The disk did not.
How to Identify
infrarunbook-admin@sw-infrarunbook-01:~$ du -sh /var/lib/docker/containers/*/*-json.log | sort -rh | head -10
38G /var/lib/docker/containers/a1b2c3d4e5f6.../a1b2c3d4e5f6...-json.log
2.1G /var/lib/docker/containers/7f8e9d0c1b2a.../7f8e9d0c1b2a...-json.log
450M /var/lib/docker/containers/3c4d5e6f7a8b.../3c4d5e6f7a8b...-json.log
infrarunbook-admin@sw-infrarunbook-01:~$ docker ps --format "{{.ID}} {{.Names}}"
a1b2c3d4e5f6 api-service
7f8e9d0c1b2a nginx-proxy
3c4d5e6f7a8b workerYou can also look up the log path for a specific container directly:
infrarunbook-admin@sw-infrarunbook-01:~$ docker inspect --format='{{.LogPath}}' api-service
/var/lib/docker/containers/a1b2c3d4e5f6.../a1b2c3d4e5f6...-json.log
infrarunbook-admin@sw-infrarunbook-01:~$ ls -lh /var/lib/docker/containers/a1b2c3d4e5f6.../
total 38G
-rw-r----- 1 root root 38G Apr 18 03:22 a1b2c3d4e5f6...-json.logHow to Fix
For immediate disk recovery, truncate the log file without restarting the container. Don't delete it — the container process holds the file descriptor open and the space won't be freed until the process releases the handle:
infrarunbook-admin@sw-infrarunbook-01:~$ truncate -s 0 /var/lib/docker/containers/a1b2c3d4e5f6.../a1b2c3d4e5f6...-json.logThen fix the root cause. Configure log rotation globally in
/etc/docker/daemon.json:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5"
}
}Restart the Docker daemon to apply. Note this restarts all containers, so plan accordingly:
infrarunbook-admin@sw-infrarunbook-01:~$ systemctl restart dockerYou can also set log options per-container in your compose file, which takes precedence over the daemon default and is useful when specific services need tighter or looser limits:
services:
api-service:
image: registry.solvethenetwork.com/app/api:v1.42
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"Existing containers don't inherit daemon.json changes retroactively. You need to recreate them for new log settings to apply — a rolling restart works fine if you're using Compose.
Root Cause 5: Overlay Filesystem Fragmentation and Inode Exhaustion
The
overlay2storage driver maintains a directory per image layer and per container writable layer under
/var/lib/docker/overlay2/. On a host that has created and destroyed many containers over time, two distinct and often misdiagnosed problems emerge: filesystem fragmentation where allocated blocks are scattered inefficiently, and inode exhaustion where the filesystem runs out of directory entries even though block space appears available.
The inode problem is the one that catches people off guard. You run
docker pulland it fails. You check
df -hand see 40% free space. The host looks fine. Then you check
df -i:
How to Identify
infrarunbook-admin@sw-infrarunbook-01:~$ df -h /var/lib/docker
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 200G 120G 80G 60% /
infrarunbook-admin@sw-infrarunbook-01:~$ df -i /var/lib/docker
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 12582912 12580344 2568 100% /
infrarunbook-admin@sw-infrarunbook-01:~$ ls /var/lib/docker/overlay2 | wc -l
18453There it is. 100% inode usage, over 18,000 overlay2 directories, and plenty of block space. Docker can't create any new directories — every new container or layer creation fails. The host is effectively out of disk from Docker's perspective even though
df -hlooks fine.
For the deleted-but-open-file problem — where disk usage appears high but you can't account for it with
du:
infrarunbook-admin@sw-infrarunbook-01:~$ lsof | grep deleted | grep docker
dockerd 1234 root 45u REG 8,1 4294967296 1234567 /var/lib/docker/containers/a1b2c3.../a1b2c3...-json.log (deleted)
infrarunbook-admin@sw-infrarunbook-01:~$ df -h /var/lib/docker
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 200G 198G 2.0G 99% /The log file was deleted from the filesystem, but the container process still has it open. The kernel won't free those blocks until the file descriptor is closed — meaning until that container stops or restarts. This is a classic discrepancy between
dfand
du.
How to Fix
For inode exhaustion, the primary fix is removing unused Docker objects to free directory entries. A full system prune is often the fastest path to recovery:
infrarunbook-admin@sw-infrarunbook-01:~$ docker system prune -a --volumes
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all volumes not used by at least one container
- all images without at least one container associated to them
- all build cache
Are you sure you want to continue? [y/N] y
Total reclaimed space: 31.4GB
infrarunbook-admin@sw-infrarunbook-01:~$ df -i /var/lib/docker
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 12582912 421083 12161829 4% /For the deleted-open-file scenario, restart the offending container to release its file descriptors:
infrarunbook-admin@sw-infrarunbook-01:~$ docker restart api-serviceIf you're persistently hitting inode limits, consider moving
/var/lib/dockerto a dedicated filesystem formatted with a higher inode density. With
mkfs.ext4, the
-iflag controls bytes-per-inode — a smaller value like
-i 4096gives you more inodes for the same block count, at the cost of slightly less usable space.
Root Cause 6: Exited Containers Accumulating
Every stopped container retains its writable layer on disk until it's explicitly removed. This writable layer exists even if the container wrote nothing at all during its lifetime — it's allocated at container creation and holds any filesystem changes the container made. On a host running many short-lived jobs, cron containers, one-off migrations, or test runners, these writable layers pile up fast and silently.
How to Identify
infrarunbook-admin@sw-infrarunbook-01:~$ docker ps -a --filter status=exited
CONTAINER ID IMAGE COMMAND CREATED STATUS
b1c2d3e4f5a6 registry.solvethenetwork.com/app/migrate:v12 "./migrate up" 2 days ago Exited (0) 2 days ago
c2d3e4f5a6b7 registry.solvethenetwork.com/app/migrate:v11 "./migrate up" 4 days ago Exited (0) 4 days ago
...
infrarunbook-admin@sw-infrarunbook-01:~$ docker ps -a --filter status=exited | wc -l
89How to Fix
infrarunbook-admin@sw-infrarunbook-01:~$ docker container prune
WARNING! This will remove all stopped containers.
Are you sure you want to continue? [y/N] y
Deleted Containers:
b1c2d3e4f5a6...
c2d3e4f5a6b7...
...
Total reclaimed space: 1.2GBFor any container you run as a one-off job, pass
--rmto
docker runso the container is removed automatically when it exits. This single habit eliminates the accumulation entirely:
infrarunbook-admin@sw-infrarunbook-01:~$ docker run --rm registry.solvethenetwork.com/app/migrate:v12 ./migrate upPrevention
Set daemon-level log rotation before you run a single container in production. Put this in
/etc/docker/daemon.jsonon every Docker host during provisioning. A 100 MB limit with five rotated files gives you 500 MB of log retention per container — more than enough for debugging, not enough to kill a disk:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5"
}
}Schedule a nightly prune job. A systemd timer or cron entry that runs
docker system prune -fat 3 AM keeps things from accumulating. The
-fskips the confirmation prompt for automated execution. If your workload allows removing unused images too, run the more aggressive variant:
# /etc/cron.d/docker-cleanup
0 3 * * * root docker system prune -af --volumes >> /var/log/docker-prune.log 2>&1Always use --rm
for one-off containers. Any container running a job and exiting — migrations, backups, test runners, data imports — should be launched with
docker run --rm. Make it a team standard. It costs nothing and eliminates an entire category of disk accumulation with zero operational overhead.
Tag cleanup in your CI/CD pipeline. After pushing a new image and deploying it, have your pipeline explicitly prune old images on the build host. Don't rely on manual cleanup. A post-deploy
docker image prune -afstep takes a few seconds and keeps the build host lean indefinitely.
Use docker-compose down -v
for ephemeral stacks. When tearing down compose stacks that you don't intend to reuse — test environments, staging stacks spun up for a review — always include
-v. Build it into your teardown scripts from day one and it's never a problem.
Put Docker on its own partition. If you're provisioning new hosts, move
/var/lib/dockerto a dedicated LVM volume or block device. When Docker fills its disk, the host OS, SSH daemon, and system logs are all unaffected. Recovery becomes a Docker problem, not a full host recovery problem. You can also resize that volume independently without touching the root filesystem — a much lower-stakes operation at 3 AM.
Monitor with alert thresholds. Set alerts at 70% and 85% disk utilization on the Docker host's filesystem. 70% is your early warning — plenty of time to schedule a prune during business hours. 85% is your page-me-now threshold. Don't wait for 100%, because by then you're already in recovery mode.
Docker disk usage growing is a solved problem. It just requires intentional habits and a bit of automation. The daemon won't clean up after itself, so the operator has to. Build the prune job, set the log rotation, use
--rm, and this stops being an incident.
