Linux Memory Leak Investigation

Symptoms

You're watching a server slowly die. Available memory shrinks over hours or days. Processes start getting killed. The OOM killer fires at 3 AM. Users complain about slow response times or outright service failures. The server feels sluggish even though CPU isn't particularly busy. You run

free -h

and see barely any free memory left, yet nothing obvious jumps out in

top

htop

A typical memory leak scenario looks something like this when you first start investigating:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi        14Gi       128Mi       256Mi       800Mi       512Mi
Swap:           8Gi         7Gi         1Gi

That's a 16 GB server with 14 GB consumed, 7 GB of swap already in use, and only 512 MB actually available. Something is eating memory and not letting go. The worst cases are the ones where nothing obvious appears in

top

— the memory is just gone. Let me walk through how I approach this methodically, covering the most common root causes I've encountered in production.

Root Cause 1: Process Not Freeing Memory

This is the canonical memory leak — a long-running process allocates heap memory through

malloc()

calloc()

, or equivalent higher-level abstractions, and simply never calls

free()

. In languages without garbage collection like C and C++, this is a programming bug. In languages with garbage collection like Java, Python, or Go, it's subtler: objects are still reachable through some reference chain so the GC won't collect them, but the application logic never actually uses them again.

I've seen this happen in daemon processes more than anywhere else. A service starts clean, handles requests for a few hours, then starts consuming more and more RSS. Restarts fix it temporarily — that's the first tell-tale sign you're dealing with a process-level leak rather than a kernel issue.

How to identify it: Start with

ps

sorted by memory to find the heaviest consumers, then watch RSS over time.

$ ps aux --sort=-%mem | head -20
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1842  2.1 18.3 3145728 2971328 ?    Ssl  Mar10 124:32 /usr/bin/python3 /opt/app/worker.py
root      2103  0.8  9.1 1572864 1478652 ?    Ssl  Mar10  48:17 /usr/bin/java -jar /opt/app/service.jar

That Python worker at 18.3% memory on a 16 GB system is consuming nearly 3 GB of RSS. Watch it grow in real time:

$ watch -n 5 'ps -p 1842 -o pid,rss,vsz,comm'
  PID   RSS    VSZ COMMAND
 1842 3045760 3145728 python3

If RSS climbs every few minutes, you've got a leak. For deeper introspection,

/proc/<pid>/smaps

breaks down memory by region. The heap entry is particularly telling:

$ grep -A 6 "\[heap\]" /proc/1842/smaps
[heap]
Size:           2621440 kB
Rss:            2097152 kB
Pss:            2097152 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:   2097152 kB

A 2 GB private dirty heap that keeps growing is a strong indicator of a heap leak. For C/C++ processes, Valgrind with Massif gives you a detailed allocation profile. For Python,

tracemalloc

pinpoints which objects are accumulating. For Java, a heap dump analyzed with Eclipse MAT or VisualVM will show you the object retention graph.

How to fix it: The permanent fix is in the application code — patch the leak. The short-term mitigation, while developers work on it, is to configure systemd to restart the service periodically before it grows too large:

# /etc/systemd/system/app-worker.service
[Service]
RuntimeMaxSec=86400
Restart=always
RestartSec=5

This restarts the process after 24 hours regardless of state. It's a band-aid, not a cure, but it keeps the server alive while the real fix ships.

Root Cause 2: Kernel Slab Cache Growing

The Linux kernel uses a slab allocator to efficiently manage memory for frequently used internal objects — dentries (directory cache entries), inodes, network socket buffers, and hundreds of other structures. Under normal conditions, the slab cache grows and shrinks dynamically. But in some situations it grows without bound and the kernel doesn't reclaim it aggressively enough.

In my experience, this most commonly appears with workloads that create and destroy huge numbers of filesystem objects — log processors, recursive directory scanners, backup agents running repeated traversals. NFS and overlayfs configurations can also trigger unbounded dentry cache growth. The tricky part: this memory doesn't belong to any userspace process, so

top

and

ps

won't help you here at all.

How to identify it: Check

/proc/meminfo

for the Slab lines:

$ cat /proc/meminfo
MemTotal:       16384000 kB
MemFree:          131072 kB
MemAvailable:     524288 kB
Buffers:           65536 kB
Cached:           819200 kB
SwapCached:       204800 kB
Slab:            8388608 kB
SReclaimable:     204800 kB
SUnreclaim:      8183808 kB

That Slab line at 8 GB is the problem. Notice SReclaimable is only 200 MB — the kernel can reclaim only a fraction of what it's holding. SUnreclaim at 8 GB is memory the kernel has decided it cannot give back without breaking things. Drill into which specific caches are the culprits:

$ slabtop -o | head -20
 Active / Total Objects (% used)    : 42165234 / 43008512 (98.0%)
 Active / Total Slabs (% used)      : 1048576 / 1048576 (100.0%)

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
35840000 35612800  99%    0.19K 1706667       21   6826668K dentry
 4096000  4063232  99%    0.59K  204800       20   1638400K inode_cache

There it is — 35 million dentry objects consuming nearly 7 GB. This is a dentry cache explosion, common with any process that stats or opens huge numbers of unique file paths repeatedly. The kernel caches the resolved path components for performance, but under these workloads it never gets a chance to evict them before more arrive.

How to fix it: Trigger a manual slab reclaim immediately:

$ echo 2 > /proc/sys/vm/drop_caches

This drops dentries and inode caches without affecting the page cache. Use

echo 3

to also drop the page cache if you need more aggressive reclaim — but be aware this will cause a temporary spike in disk I/O as cached data is re-read. For a longer-term fix, tune how aggressively the kernel reclaims slab memory relative to the page cache:

$ sysctl -w vm.vfs_cache_pressure=200

The default is 100. At 200, the kernel reclaims dentry and inode caches twice as aggressively. Make it permanent in

/etc/sysctl.d/99-memory.conf

and reload with

sysctl --system

Root Cause 3: Memory Fragmentation

Memory fragmentation is one of those issues that looks like a memory leak but technically isn't. The system shows plenty of memory as "used", but processes start failing to allocate large contiguous blocks. You'll see allocation failures, OOM kills for processes that seem relatively small, and kernel log messages about order-N allocation failures.

It happens because the kernel allocates and frees memory in pages (4 KB), but certain operations — huge pages, DMA transfers, some network operations with large packet buffers — need physically contiguous multi-page blocks. After hours or days of mixed-size allocations and frees, physical memory becomes swiss cheese: plenty of free pages scattered around, but none adjacent enough to satisfy a large contiguous request.

How to identify it: Check the buddy allocator's free list via

/proc/buddyinfo

$ cat /proc/buddyinfo
Node 0, zone      DMA      1      0      0      0      0      0      0      0      0      1      3
Node 0, zone    DMA32    840    628    312    189     94     42     18      7      2      0      0
Node 0, zone   Normal   1024    512    256    128     32      4      1      0      0      0      0

Each column represents a power-of-two block size from order-0 (4 KB) through order-10 (4 MB). The Normal zone shows zero free blocks at orders 7, 8, 9, and 10. That means the kernel can't satisfy any physically contiguous allocation larger than order-6 (256 KB) from the Normal zone. The kernel will log failures that look like this:

Apr 10 03:14:22 sw-infrarunbook-01 kernel: kworker/u8:2: page allocation failure: order:4, mode:0x40c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
Apr 10 03:14:22 sw-infrarunbook-01 kernel: Mem-Info:
Apr 10 03:14:22 sw-infrarunbook-01 kernel: active_anon:3145728 inactive_anon:524288 isolated_anon:0
Apr 10 03:14:22 sw-infrarunbook-01 kernel: Node 0 Normal free:204800kB min:65536kB low:81920kB high:98304kB

An order:4 failure means the kernel couldn't find 16 contiguous 4 KB pages — just 64 KB. On a system with hundreds of MB of free memory, that's fragmentation at work.

How to fix it: Trigger a one-shot memory compaction, which physically moves pages to defragment them:

$ echo 1 > /proc/sys/vm/compact_memory

For ongoing prevention, enable proactive background compaction:

$ sysctl -w vm.compaction_proactiveness=20

The value ranges from 0 (disabled) to 100 (aggressive). Starting at 20 avoids wasting CPU while keeping fragmentation manageable for workloads that occasionally need large allocations. If your workload uses Transparent Huge Pages, switch from

always

madvise

mode to prevent the kernel from over-eagerly promoting pages that won't benefit from it — this reduces the fragmentation pressure significantly:

$ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

Root Cause 4: OOM Killer Not Triggering

Here's a scenario I've encountered more than once: the system is completely out of memory, everything is grinding to a halt, processes hang trying to allocate, new spawns fail with ENOMEM — but the OOM killer never fires. The server becomes unresponsive rather than recovering.

The root cause almost always traces back to overcommit settings or OOM score protection. Linux default overcommit (

vm.overcommit_memory=0

) uses heuristics to allow some overcommit. Mode 1 (

always overcommit

) allows unlimited virtual memory allocation regardless of physical availability — useful for Redis and some fork-heavy applications, but it means the kernel grants allocations even when there's no realistic chance of backing them. When physical memory is genuinely exhausted under mode 1, recovery is messier.

The other common cause: every critical process has been given an

oom_score_adj

of -1000 (the "never kill me" value). If every significant process is protected, the OOM killer has nothing safe to target and the system deadlocks.

How to identify it:

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /proc/sys/vm/overcommit_ratio
50

Mode 1 confirmed. Now check how many processes are protecting themselves from the OOM killer:

$ for pid in /proc/[0-9]*; do
    score=$(cat $pid/oom_score_adj 2>/dev/null)
    comm=$(cat $pid/comm 2>/dev/null)
    if [ "$score" = "-1000" ]; then
        echo "PID $(basename $pid): $comm (oom_score_adj=$score)"
    fi
  done

PID 1: systemd (oom_score_adj=-1000)
PID 842: sshd (oom_score_adj=-1000)
PID 1842: worker.py (oom_score_adj=-1000)
PID 2103: java (oom_score_adj=-1000)
PID 3401: nginx (oom_score_adj=-1000)

Everything critical and the leaking worker are all at -1000. The OOM killer is completely neutered. Check if any OOM events did fire recently despite this:

$ dmesg | grep -iE "oom|killed process|out of memory"
[1234567.123] Out of memory: Kill process 4521 (python3) score 892 or sacrifice child
[1234567.456] Killed process 4521 (python3) total-vm:3145728kB, anon-rss:2097152kB, file-rss:0kB

How to fix it: Switch to a rational overcommit mode. Mode 2 with an 80% ratio is a solid default for most servers:

$ sysctl -w vm.overcommit_memory=2
$ sysctl -w vm.overcommit_ratio=80

This caps total committed memory at 80% of RAM plus all of swap. Allocations that would exceed this fail immediately at

malloc()

time rather than succeeding and then triggering OOM chaos later. For the score protection problem, audit which processes genuinely need it. The leaking application worker definitely shouldn't be protected:

$ echo 0 > /proc/1842/oom_score_adj

Also consider enabling this sysctl, which tells the OOM killer to target the task that triggered the allocation failure rather than hunting for the highest-score process:

$ sysctl -w vm.oom_kill_allocating_task=1

This often results in faster recovery because you kill the actual offender directly rather than a bystander that happened to accumulate a high OOM score.

Root Cause 5: Swap Misconfigured

Swap is often treated as an afterthought — either disabled entirely because "we have enough RAM", or left at its default configuration without considering swappiness. Both extremes cause problems during memory leak scenarios.

Without swap, the moment physical memory is exhausted the OOM killer fires immediately. That might be fine, but often you'd prefer a brief buffer period to detect the leak and respond. On the other side, if

vm.swappiness

is set too high — say 60 or above on a server workload — the kernel starts paging out warm anonymous memory to disk at the first sign of memory pressure, causing severe latency degradation well before you'd even notice a leak. You end up chasing a performance problem that's actually a memory problem masked by premature swapping.

I've also seen cases where swap is configured on an LVM thin-provision that auto-expands, silently masking the underlying memory exhaustion until the storage layer also runs out — at which point both memory and disk fail simultaneously, which is a much worse situation to recover from.

How to identify it:

$ swapon --show
NAME      TYPE      SIZE   USED PRIO
/dev/sda3 partition   8G   7.8G   -2

$ cat /proc/sys/vm/swappiness
60

7.8 GB of an 8 GB swap in use means you're nearly out of overflow capacity. Identify which processes have been swapped out significantly:

$ for file in /proc/*/status; do
    pid=$(echo $file | cut -d/ -f3)
    comm=$(cat /proc/$pid/comm 2>/dev/null)
    vmswap=$(grep VmSwap $file 2>/dev/null | awk '{print $2}')
    if [ -n "$vmswap" ] && [ "$vmswap" -gt "102400" ] 2>/dev/null; then
        echo "$comm (PID $pid): ${vmswap} kB in swap"
    fi
  done

worker.py (PID 1842): 2097152 kB in swap
java (PID 2103): 1048576 kB in swap

The leaking worker has pushed 2 GB into swap, which is why the system hasn't OOM-killed it yet — but performance is terrible because every page fault for that process requires a disk read. Confirm active swap I/O with vmstat:

$ vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1 7864320  65536  32768 409600  128  512   640  2048 1024 2048 12  8 72  8  0
 3  2 7921664  49152  32768 401408  256  768   512  3072 1248 2312 15 10 63 12  0

The

so

column (swap out) climbing to 768 means 768 KB/s is actively being written to swap. That's the leak in slow motion — you can watch memory drain in real time.

How to fix it: Lower swappiness to reduce how eagerly the kernel reaches for swap:

$ sysctl -w vm.swappiness=10

A value of 10 tells the kernel to strongly prefer keeping anonymous pages in RAM and only start swapping under serious pressure. For latency-sensitive servers where you'd rather take an OOM kill than suffer disk I/O, use swappiness=1 rather than 0 — setting it to 0 can cause issues on some kernel versions where it interacts poorly with memory-mapped file handling. If swap is nearly exhausted and you need emergency breathing room without a reboot, add a temporary swap file:

$ fallocate -l 4G /swapfile-emergency
$ chmod 600 /swapfile-emergency
$ mkswap /swapfile-emergency
$ swapon /swapfile-emergency

This buys you time to identify and kill the leaking process. Don't leave emergency swap files lying around permanently — they consume disk space and become a crutch rather than a fix.

Root Cause 6: tmpfs and Shared Memory Consuming RAM

tmpfs mounts use physical RAM as their backing store. If something is writing large amounts of data to a tmpfs mount — application caches, IPC buffers, temp files from a runaway job — it consumes real RAM and won't show up in per-process RSS accounting. This is a surprisingly common source of mysterious memory consumption because

top

and

ps

give you no indication of what's happening.

How to identify it:

$ df -h | grep tmpfs
tmpfs            7.8G  7.5G  300M  97% /run/shm
tmpfs            1.6G  1.2G  400M  75% /tmp
tmpfs            2.0G  1.8G  200M  90% /dev/shm

$ du -sh /run/shm/*
3.2G    /run/shm/app_cache_segment
4.1G    /run/shm/ipc_queue_buffer

Two shared memory segments consuming 7+ GB in

/run/shm

. Applications using POSIX shared memory (

shm_open()

) or System V shared memory (

shmget()

) allocate from here. If they don't clean up on exit, the memory persists until explicitly released or the system reboots. Identify orphaned System V segments:

$ ipcs -m
------ Shared Memory Segments --------
key        shmid      owner              perms      bytes      nattch     status
0x00000000 32768      infrarunbook-admin 600        4294967296 0          dest
0x1a2b3c4d 32769      infrarunbook-admin 600        1073741824 1

The segment with

nattch=0

and

dest

status is marked for deletion but hasn't been cleaned up. If the owning process is gone, remove it manually:

$ ipcrm -m 32768

How to fix it: Application-level fix — ensure cleanup on exit using proper signal handlers and atexit routines. At the OS level, set size limits on your tmpfs mounts so a single runaway process can't exhaust all memory. In

/etc/fstab

tmpfs  /dev/shm  tmpfs  defaults,size=2g,noexec,nosuid  0  0

The

size=2g

parameter caps that mount at 2 GB regardless of what applications write to it. They'll get ENOSPC rather than silently consuming all available RAM.

Root Cause 7: Memory-Mapped Files Not Released

Applications using

mmap()

map files directly into virtual address space. This is efficient — reads go through the page cache, writes are handled by the kernel's writeback mechanism — but the memory accounting is split. Mapped file pages count against the page cache, not the process's anonymous RSS. When many processes map the same large files, or a process maps a large file and never calls

munmap()

, the RAM consumption can be enormous while remaining largely invisible in standard tools.

How to identify it: Use

smaps_rollup

for a fast summary of a process's memory composition:

$ cat /proc/1842/smaps_rollup
Rss:             2621440 kB
Pss:             2097152 kB
Shared_Clean:    1048576 kB
Shared_Dirty:          0 kB
Private_Clean:    524288 kB
Private_Dirty:   1048576 kB
Anonymous:       1572864 kB

Shared_Clean at 1 GB indicates file-backed mmapped pages. To see which specific files are mapped, and how many times:

$ awk '{print $6}' /proc/1842/maps | grep '/' | sort | uniq -c | sort -rn | head -10
     24 /opt/app/data/large_dataset.bin
     18 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
      8 /opt/app/data/index.db

large_dataset.bin

is several gigabytes and mapped 24 times across multiple worker processes, that's real memory pressure. The fix is application-level: call

munmap()

after use, or redesign to read-and-release rather than keeping entire large files mapped throughout the process lifetime.

Prevention

Memory leak investigation shouldn't be purely reactive. Several practices I put in place on production Linux systems dramatically reduce both the frequency and severity of incidents.

Alert on trends, not just thresholds. A static alert at "90% memory used" fires after the damage is done. Set up trend-based alerting: if available memory has dropped by more than 20% over the last hour, something is likely leaking. Prometheus with a

predict_linear(node_memory_MemAvailable_bytes[2h], 3600)

query gives you a projected time-to-exhaustion rather than a reactive threshold breach.

Baseline your slab cache. Know what normal looks like. After a fresh restart under representative load, capture:

$ slabtop -o > /root/slab_baseline_$(date +%Y%m%d).txt

When anomalies appear,

diff

against the baseline. Unexplained growth in specific caches points directly at the subsystem or workload driving the issue, cutting investigation time significantly.

Use systemd memory limits for every service. Even if your application doesn't leak today, a cgroup memory ceiling prevents one runaway process from taking down the whole server. In every service unit file:

# /etc/systemd/system/app-worker.service
[Service]
MemoryMax=4G
MemorySwapMax=1G
OOMPolicy=kill

MemoryMax

hard-limits the cgroup.

OOMPolicy=kill

ensures the OOM killer targets this service's processes when the cgroup limit is hit rather than trying to find victims elsewhere on the system. This is far more surgical than letting a leak exhaust all available memory before anything gets killed.

Consolidate your memory tuning into a single persistent sysctl file so the configuration survives reboots and is easy to audit:

# /etc/sysctl.d/99-memory.conf
vm.swappiness = 10
vm.vfs_cache_pressure = 200
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.compaction_proactiveness = 20
vm.oom_kill_allocating_task = 1

Catch leaks in development, not production. If you own the application code, integrate memory leak detection into CI. For C/C++ services, Valgrind or AddressSanitizer with leak detection enabled catches most leaks before they ship. For Python services, run a test suite under

tracemalloc

and assert that memory growth after N requests stays within bounds. The cost of catching a leak in CI is minutes. The cost of catching it during a 3 AM production incident is orders of magnitude higher.

Document your swap strategy explicitly. Make a deliberate decision: do you want swap as an emergency pressure valve (low swappiness, 1–2x RAM, tolerant of some I/O degradation) or do you want no swap (prefer fast OOM kills over slow degradation)? Neither is universally right — it depends on your workload and your SLA. Write that decision down in your runbook so the next engineer doesn't accidentally reconfigure it without understanding why it was set that way.

Memory leak investigation on Linux rewards methodical thinking over intuition. Start with the tools that give you the big picture —

free

vmstat

/proc/meminfo

— establish whether you're dealing with userspace memory or kernel memory, then drill into per-process smaps or per-slab accounting as appropriate. Most leaks aren't mysterious once you know exactly where to look.

Linux Memory Leak Investigation

Symptoms

Root Cause 1: Process Not Freeing Memory

Root Cause 2: Kernel Slab Cache Growing

Root Cause 3: Memory Fragmentation

Root Cause 4: OOM Killer Not Triggering

Root Cause 5: Swap Misconfigured

Root Cause 6: tmpfs and Shared Memory Consuming RAM

Root Cause 7: Memory-Mapped Files Not Released

Prevention

Related Articles

Frequently Asked Questions

How do I quickly find which process is causing a memory leak on Linux?

Why is my Linux server's memory almost full even though no single process looks large?

What does vm.swappiness=10 mean and is it safe to set?

Why isn't the Linux OOM killer firing even though the server is out of memory?

How do I fix kernel slab cache memory that keeps growing on Linux?

What is memory fragmentation in Linux and how do I fix it?

Related Articles