Symptoms
You're watching a server slowly die. Available memory shrinks over hours or days. Processes start getting killed. The OOM killer fires at 3 AM. Users complain about slow response times or outright service failures. The server feels sluggish even though CPU isn't particularly busy. You run
free -hand see barely any free memory left, yet nothing obvious jumps out in
topor
htop.
A typical memory leak scenario looks something like this when you first start investigating:
$ free -h
total used free shared buff/cache available
Mem: 15Gi 14Gi 128Mi 256Mi 800Mi 512Mi
Swap: 8Gi 7Gi 1Gi
That's a 16 GB server with 14 GB consumed, 7 GB of swap already in use, and only 512 MB actually available. Something is eating memory and not letting go. The worst cases are the ones where nothing obvious appears in
top— the memory is just gone. Let me walk through how I approach this methodically, covering the most common root causes I've encountered in production.
Root Cause 1: Process Not Freeing Memory
This is the canonical memory leak — a long-running process allocates heap memory through
malloc(),
calloc(), or equivalent higher-level abstractions, and simply never calls
free(). In languages without garbage collection like C and C++, this is a programming bug. In languages with garbage collection like Java, Python, or Go, it's subtler: objects are still reachable through some reference chain so the GC won't collect them, but the application logic never actually uses them again.
I've seen this happen in daemon processes more than anywhere else. A service starts clean, handles requests for a few hours, then starts consuming more and more RSS. Restarts fix it temporarily — that's the first tell-tale sign you're dealing with a process-level leak rather than a kernel issue.
How to identify it: Start with
pssorted by memory to find the heaviest consumers, then watch RSS over time.
$ ps aux --sort=-%mem | head -20
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1842 2.1 18.3 3145728 2971328 ? Ssl Mar10 124:32 /usr/bin/python3 /opt/app/worker.py
root 2103 0.8 9.1 1572864 1478652 ? Ssl Mar10 48:17 /usr/bin/java -jar /opt/app/service.jar
That Python worker at 18.3% memory on a 16 GB system is consuming nearly 3 GB of RSS. Watch it grow in real time:
$ watch -n 5 'ps -p 1842 -o pid,rss,vsz,comm'
PID RSS VSZ COMMAND
1842 3045760 3145728 python3
If RSS climbs every few minutes, you've got a leak. For deeper introspection,
/proc/<pid>/smapsbreaks down memory by region. The heap entry is particularly telling:
$ grep -A 6 "\[heap\]" /proc/1842/smaps
[heap]
Size: 2621440 kB
Rss: 2097152 kB
Pss: 2097152 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 2097152 kB
A 2 GB private dirty heap that keeps growing is a strong indicator of a heap leak. For C/C++ processes, Valgrind with Massif gives you a detailed allocation profile. For Python,
tracemallocpinpoints which objects are accumulating. For Java, a heap dump analyzed with Eclipse MAT or VisualVM will show you the object retention graph.
How to fix it: The permanent fix is in the application code — patch the leak. The short-term mitigation, while developers work on it, is to configure systemd to restart the service periodically before it grows too large:
# /etc/systemd/system/app-worker.service
[Service]
RuntimeMaxSec=86400
Restart=always
RestartSec=5
This restarts the process after 24 hours regardless of state. It's a band-aid, not a cure, but it keeps the server alive while the real fix ships.
Root Cause 2: Kernel Slab Cache Growing
The Linux kernel uses a slab allocator to efficiently manage memory for frequently used internal objects — dentries (directory cache entries), inodes, network socket buffers, and hundreds of other structures. Under normal conditions, the slab cache grows and shrinks dynamically. But in some situations it grows without bound and the kernel doesn't reclaim it aggressively enough.
In my experience, this most commonly appears with workloads that create and destroy huge numbers of filesystem objects — log processors, recursive directory scanners, backup agents running repeated traversals. NFS and overlayfs configurations can also trigger unbounded dentry cache growth. The tricky part: this memory doesn't belong to any userspace process, so
topand
pswon't help you here at all.
How to identify it: Check
/proc/meminfofor the Slab lines:
$ cat /proc/meminfo
MemTotal: 16384000 kB
MemFree: 131072 kB
MemAvailable: 524288 kB
Buffers: 65536 kB
Cached: 819200 kB
SwapCached: 204800 kB
Slab: 8388608 kB
SReclaimable: 204800 kB
SUnreclaim: 8183808 kB
That Slab line at 8 GB is the problem. Notice SReclaimable is only 200 MB — the kernel can reclaim only a fraction of what it's holding. SUnreclaim at 8 GB is memory the kernel has decided it cannot give back without breaking things. Drill into which specific caches are the culprits:
$ slabtop -o | head -20
Active / Total Objects (% used) : 42165234 / 43008512 (98.0%)
Active / Total Slabs (% used) : 1048576 / 1048576 (100.0%)
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
35840000 35612800 99% 0.19K 1706667 21 6826668K dentry
4096000 4063232 99% 0.59K 204800 20 1638400K inode_cache
There it is — 35 million dentry objects consuming nearly 7 GB. This is a dentry cache explosion, common with any process that stats or opens huge numbers of unique file paths repeatedly. The kernel caches the resolved path components for performance, but under these workloads it never gets a chance to evict them before more arrive.
How to fix it: Trigger a manual slab reclaim immediately:
$ echo 2 > /proc/sys/vm/drop_caches
This drops dentries and inode caches without affecting the page cache. Use
echo 3to also drop the page cache if you need more aggressive reclaim — but be aware this will cause a temporary spike in disk I/O as cached data is re-read. For a longer-term fix, tune how aggressively the kernel reclaims slab memory relative to the page cache:
$ sysctl -w vm.vfs_cache_pressure=200
The default is 100. At 200, the kernel reclaims dentry and inode caches twice as aggressively. Make it permanent in
/etc/sysctl.d/99-memory.confand reload with
sysctl --system.
Root Cause 3: Memory Fragmentation
Memory fragmentation is one of those issues that looks like a memory leak but technically isn't. The system shows plenty of memory as "used", but processes start failing to allocate large contiguous blocks. You'll see allocation failures, OOM kills for processes that seem relatively small, and kernel log messages about order-N allocation failures.
It happens because the kernel allocates and frees memory in pages (4 KB), but certain operations — huge pages, DMA transfers, some network operations with large packet buffers — need physically contiguous multi-page blocks. After hours or days of mixed-size allocations and frees, physical memory becomes swiss cheese: plenty of free pages scattered around, but none adjacent enough to satisfy a large contiguous request.
How to identify it: Check the buddy allocator's free list via
/proc/buddyinfo:
$ cat /proc/buddyinfo
Node 0, zone DMA 1 0 0 0 0 0 0 0 0 1 3
Node 0, zone DMA32 840 628 312 189 94 42 18 7 2 0 0
Node 0, zone Normal 1024 512 256 128 32 4 1 0 0 0 0
Each column represents a power-of-two block size from order-0 (4 KB) through order-10 (4 MB). The Normal zone shows zero free blocks at orders 7, 8, 9, and 10. That means the kernel can't satisfy any physically contiguous allocation larger than order-6 (256 KB) from the Normal zone. The kernel will log failures that look like this:
Apr 10 03:14:22 sw-infrarunbook-01 kernel: kworker/u8:2: page allocation failure: order:4, mode:0x40c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
Apr 10 03:14:22 sw-infrarunbook-01 kernel: Mem-Info:
Apr 10 03:14:22 sw-infrarunbook-01 kernel: active_anon:3145728 inactive_anon:524288 isolated_anon:0
Apr 10 03:14:22 sw-infrarunbook-01 kernel: Node 0 Normal free:204800kB min:65536kB low:81920kB high:98304kB
An order:4 failure means the kernel couldn't find 16 contiguous 4 KB pages — just 64 KB. On a system with hundreds of MB of free memory, that's fragmentation at work.
How to fix it: Trigger a one-shot memory compaction, which physically moves pages to defragment them:
$ echo 1 > /proc/sys/vm/compact_memory
For ongoing prevention, enable proactive background compaction:
$ sysctl -w vm.compaction_proactiveness=20
The value ranges from 0 (disabled) to 100 (aggressive). Starting at 20 avoids wasting CPU while keeping fragmentation manageable for workloads that occasionally need large allocations. If your workload uses Transparent Huge Pages, switch from
alwaysto
madvisemode to prevent the kernel from over-eagerly promoting pages that won't benefit from it — this reduces the fragmentation pressure significantly:
$ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
Root Cause 4: OOM Killer Not Triggering
Here's a scenario I've encountered more than once: the system is completely out of memory, everything is grinding to a halt, processes hang trying to allocate, new spawns fail with ENOMEM — but the OOM killer never fires. The server becomes unresponsive rather than recovering.
The root cause almost always traces back to overcommit settings or OOM score protection. Linux default overcommit (
vm.overcommit_memory=0) uses heuristics to allow some overcommit. Mode 1 (
always overcommit) allows unlimited virtual memory allocation regardless of physical availability — useful for Redis and some fork-heavy applications, but it means the kernel grants allocations even when there's no realistic chance of backing them. When physical memory is genuinely exhausted under mode 1, recovery is messier.
The other common cause: every critical process has been given an
oom_score_adjof -1000 (the "never kill me" value). If every significant process is protected, the OOM killer has nothing safe to target and the system deadlocks.
How to identify it:
$ cat /proc/sys/vm/overcommit_memory
1
$ cat /proc/sys/vm/overcommit_ratio
50
Mode 1 confirmed. Now check how many processes are protecting themselves from the OOM killer:
$ for pid in /proc/[0-9]*; do
score=$(cat $pid/oom_score_adj 2>/dev/null)
comm=$(cat $pid/comm 2>/dev/null)
if [ "$score" = "-1000" ]; then
echo "PID $(basename $pid): $comm (oom_score_adj=$score)"
fi
done
PID 1: systemd (oom_score_adj=-1000)
PID 842: sshd (oom_score_adj=-1000)
PID 1842: worker.py (oom_score_adj=-1000)
PID 2103: java (oom_score_adj=-1000)
PID 3401: nginx (oom_score_adj=-1000)
Everything critical and the leaking worker are all at -1000. The OOM killer is completely neutered. Check if any OOM events did fire recently despite this:
$ dmesg | grep -iE "oom|killed process|out of memory"
[1234567.123] Out of memory: Kill process 4521 (python3) score 892 or sacrifice child
[1234567.456] Killed process 4521 (python3) total-vm:3145728kB, anon-rss:2097152kB, file-rss:0kB
How to fix it: Switch to a rational overcommit mode. Mode 2 with an 80% ratio is a solid default for most servers:
$ sysctl -w vm.overcommit_memory=2
$ sysctl -w vm.overcommit_ratio=80
This caps total committed memory at 80% of RAM plus all of swap. Allocations that would exceed this fail immediately at
malloc()time rather than succeeding and then triggering OOM chaos later. For the score protection problem, audit which processes genuinely need it. The leaking application worker definitely shouldn't be protected:
$ echo 0 > /proc/1842/oom_score_adj
Also consider enabling this sysctl, which tells the OOM killer to target the task that triggered the allocation failure rather than hunting for the highest-score process:
$ sysctl -w vm.oom_kill_allocating_task=1
This often results in faster recovery because you kill the actual offender directly rather than a bystander that happened to accumulate a high OOM score.
Root Cause 5: Swap Misconfigured
Swap is often treated as an afterthought — either disabled entirely because "we have enough RAM", or left at its default configuration without considering swappiness. Both extremes cause problems during memory leak scenarios.
Without swap, the moment physical memory is exhausted the OOM killer fires immediately. That might be fine, but often you'd prefer a brief buffer period to detect the leak and respond. On the other side, if
vm.swappinessis set too high — say 60 or above on a server workload — the kernel starts paging out warm anonymous memory to disk at the first sign of memory pressure, causing severe latency degradation well before you'd even notice a leak. You end up chasing a performance problem that's actually a memory problem masked by premature swapping.
I've also seen cases where swap is configured on an LVM thin-provision that auto-expands, silently masking the underlying memory exhaustion until the storage layer also runs out — at which point both memory and disk fail simultaneously, which is a much worse situation to recover from.
How to identify it:
$ swapon --show
NAME TYPE SIZE USED PRIO
/dev/sda3 partition 8G 7.8G -2
$ cat /proc/sys/vm/swappiness
60
7.8 GB of an 8 GB swap in use means you're nearly out of overflow capacity. Identify which processes have been swapped out significantly:
$ for file in /proc/*/status; do
pid=$(echo $file | cut -d/ -f3)
comm=$(cat /proc/$pid/comm 2>/dev/null)
vmswap=$(grep VmSwap $file 2>/dev/null | awk '{print $2}')
if [ -n "$vmswap" ] && [ "$vmswap" -gt "102400" ] 2>/dev/null; then
echo "$comm (PID $pid): ${vmswap} kB in swap"
fi
done
worker.py (PID 1842): 2097152 kB in swap
java (PID 2103): 1048576 kB in swap
The leaking worker has pushed 2 GB into swap, which is why the system hasn't OOM-killed it yet — but performance is terrible because every page fault for that process requires a disk read. Confirm active swap I/O with vmstat:
$ vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 1 7864320 65536 32768 409600 128 512 640 2048 1024 2048 12 8 72 8 0
3 2 7921664 49152 32768 401408 256 768 512 3072 1248 2312 15 10 63 12 0
The
socolumn (swap out) climbing to 768 means 768 KB/s is actively being written to swap. That's the leak in slow motion — you can watch memory drain in real time.
How to fix it: Lower swappiness to reduce how eagerly the kernel reaches for swap:
$ sysctl -w vm.swappiness=10
A value of 10 tells the kernel to strongly prefer keeping anonymous pages in RAM and only start swapping under serious pressure. For latency-sensitive servers where you'd rather take an OOM kill than suffer disk I/O, use swappiness=1 rather than 0 — setting it to 0 can cause issues on some kernel versions where it interacts poorly with memory-mapped file handling. If swap is nearly exhausted and you need emergency breathing room without a reboot, add a temporary swap file:
$ fallocate -l 4G /swapfile-emergency
$ chmod 600 /swapfile-emergency
$ mkswap /swapfile-emergency
$ swapon /swapfile-emergency
This buys you time to identify and kill the leaking process. Don't leave emergency swap files lying around permanently — they consume disk space and become a crutch rather than a fix.
Root Cause 6: tmpfs and Shared Memory Consuming RAM
tmpfs mounts use physical RAM as their backing store. If something is writing large amounts of data to a tmpfs mount — application caches, IPC buffers, temp files from a runaway job — it consumes real RAM and won't show up in per-process RSS accounting. This is a surprisingly common source of mysterious memory consumption because
topand
psgive you no indication of what's happening.
How to identify it:
$ df -h | grep tmpfs
tmpfs 7.8G 7.5G 300M 97% /run/shm
tmpfs 1.6G 1.2G 400M 75% /tmp
tmpfs 2.0G 1.8G 200M 90% /dev/shm
$ du -sh /run/shm/*
3.2G /run/shm/app_cache_segment
4.1G /run/shm/ipc_queue_buffer
Two shared memory segments consuming 7+ GB in
/run/shm. Applications using POSIX shared memory (
shm_open()) or System V shared memory (
shmget()) allocate from here. If they don't clean up on exit, the memory persists until explicitly released or the system reboots. Identify orphaned System V segments:
$ ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 32768 infrarunbook-admin 600 4294967296 0 dest
0x1a2b3c4d 32769 infrarunbook-admin 600 1073741824 1
The segment with
nattch=0and
deststatus is marked for deletion but hasn't been cleaned up. If the owning process is gone, remove it manually:
$ ipcrm -m 32768
How to fix it: Application-level fix — ensure cleanup on exit using proper signal handlers and atexit routines. At the OS level, set size limits on your tmpfs mounts so a single runaway process can't exhaust all memory. In
/etc/fstab:
tmpfs /dev/shm tmpfs defaults,size=2g,noexec,nosuid 0 0
The
size=2gparameter caps that mount at 2 GB regardless of what applications write to it. They'll get ENOSPC rather than silently consuming all available RAM.
Root Cause 7: Memory-Mapped Files Not Released
Applications using
mmap()map files directly into virtual address space. This is efficient — reads go through the page cache, writes are handled by the kernel's writeback mechanism — but the memory accounting is split. Mapped file pages count against the page cache, not the process's anonymous RSS. When many processes map the same large files, or a process maps a large file and never calls
munmap(), the RAM consumption can be enormous while remaining largely invisible in standard tools.
How to identify it: Use
smaps_rollupfor a fast summary of a process's memory composition:
$ cat /proc/1842/smaps_rollup
Rss: 2621440 kB
Pss: 2097152 kB
Shared_Clean: 1048576 kB
Shared_Dirty: 0 kB
Private_Clean: 524288 kB
Private_Dirty: 1048576 kB
Anonymous: 1572864 kB
Shared_Clean at 1 GB indicates file-backed mmapped pages. To see which specific files are mapped, and how many times:
$ awk '{print $6}' /proc/1842/maps | grep '/' | sort | uniq -c | sort -rn | head -10
24 /opt/app/data/large_dataset.bin
18 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
8 /opt/app/data/index.db
If
large_dataset.binis several gigabytes and mapped 24 times across multiple worker processes, that's real memory pressure. The fix is application-level: call
munmap()after use, or redesign to read-and-release rather than keeping entire large files mapped throughout the process lifetime.
Prevention
Memory leak investigation shouldn't be purely reactive. Several practices I put in place on production Linux systems dramatically reduce both the frequency and severity of incidents.
Alert on trends, not just thresholds. A static alert at "90% memory used" fires after the damage is done. Set up trend-based alerting: if available memory has dropped by more than 20% over the last hour, something is likely leaking. Prometheus with a
predict_linear(node_memory_MemAvailable_bytes[2h], 3600)query gives you a projected time-to-exhaustion rather than a reactive threshold breach.
Baseline your slab cache. Know what normal looks like. After a fresh restart under representative load, capture:
$ slabtop -o > /root/slab_baseline_$(date +%Y%m%d).txt
When anomalies appear,
diffagainst the baseline. Unexplained growth in specific caches points directly at the subsystem or workload driving the issue, cutting investigation time significantly.
Use systemd memory limits for every service. Even if your application doesn't leak today, a cgroup memory ceiling prevents one runaway process from taking down the whole server. In every service unit file:
# /etc/systemd/system/app-worker.service
[Service]
MemoryMax=4G
MemorySwapMax=1G
OOMPolicy=kill
MemoryMaxhard-limits the cgroup.
OOMPolicy=killensures the OOM killer targets this service's processes when the cgroup limit is hit rather than trying to find victims elsewhere on the system. This is far more surgical than letting a leak exhaust all available memory before anything gets killed.
Consolidate your memory tuning into a single persistent sysctl file so the configuration survives reboots and is easy to audit:
# /etc/sysctl.d/99-memory.conf
vm.swappiness = 10
vm.vfs_cache_pressure = 200
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.compaction_proactiveness = 20
vm.oom_kill_allocating_task = 1
Catch leaks in development, not production. If you own the application code, integrate memory leak detection into CI. For C/C++ services, Valgrind or AddressSanitizer with leak detection enabled catches most leaks before they ship. For Python services, run a test suite under
tracemallocand assert that memory growth after N requests stays within bounds. The cost of catching a leak in CI is minutes. The cost of catching it during a 3 AM production incident is orders of magnitude higher.
Document your swap strategy explicitly. Make a deliberate decision: do you want swap as an emergency pressure valve (low swappiness, 1–2x RAM, tolerant of some I/O degradation) or do you want no swap (prefer fast OOM kills over slow degradation)? Neither is universally right — it depends on your workload and your SLA. Write that decision down in your runbook so the next engineer doesn't accidentally reconfigure it without understanding why it was set that way.
Memory leak investigation on Linux rewards methodical thinking over intuition. Start with the tools that give you the big picture —
free,
vmstat,
/proc/meminfo— establish whether you're dealing with userspace memory or kernel memory, then drill into per-process smaps or per-slab accounting as appropriate. Most leaks aren't mysterious once you know exactly where to look.
