Linux High IO Wait Troubleshooting

Symptoms

High IO wait (iowait) is one of those performance issues that can quietly kill a system. You'll typically notice it first through sluggish application response times — a database query that normally takes 50ms starts taking 5 seconds, SSH sessions lag noticeably, and cron jobs pile up waiting to run. The system load average climbs, sometimes dramatically, even though CPU utilization looks normal or even low.

Running

top

htop

, you'll see the

wa

(IO wait) percentage in the CPU summary line jump above 5–10%, and in bad cases it can sit at 50–80% or higher. Meanwhile, processes accumulate in the

D

(uninterruptible sleep) state — the kernel's way of signaling that something is blocked on IO and cannot be interrupted, even by a signal. A growing count of D-state processes combined with elevated load is almost always an IO problem.

top - 14:23:01 up 12 days,  2:14,  3 users,  load average: 18.42, 16.21, 14.87
%Cpu(s):  2.1 us,  0.8 sy,  0.0 ni, 12.3 id, 83.7 wa,  0.1 hi,  1.0 si,  0.0 st

That 83.7% iowait is a five-alarm fire. The system isn't CPU-bound at all — it's spending the vast majority of its time waiting for storage to respond. Let's break down the most common causes and how to resolve each one.

Root Cause 1: Slow or Failing Disk

The most straightforward cause of high iowait is a disk that simply can't keep up. This might be a drive failing mechanically, a SATA spinning disk being overwhelmed by workloads that demand NVMe throughput, or a drive experiencing repeated sector read errors that force the kernel to retry operations multiple times before giving up.

In my experience, this sneaks up on you. A drive will quietly accumulate reallocated sectors for weeks before anything obvious breaks. It keeps working — just slowly — because every IO on a bad sector takes 10x longer as the read head skips over it and retries. The workload that ran fine six months ago now generates constant iowait because the disk is fighting itself on every access.

Start diagnosis with

iostat

iostat -xz 1 5

Focus on three columns:

await

(average time in milliseconds for IO requests to complete),

svctm

(average service time per request), and

%util

(how busy the device is). On a healthy NVMe, await should be under 0.5ms. On a spinning disk doing sequential reads, under 10ms is reasonable. If you're seeing 200ms or 500ms, something is wrong.

Device     r/s    w/s   rkB/s   wkB/s  await  svctm  %util
sda       12.00  45.00  384.00 1440.00 423.18  22.14  99.80

A 423ms average await with 99.8% utilization means the disk is completely saturated and every new request is queuing behind the backlog. Next, pull SMART data to look for hardware-level failure indicators:

smartctl -a /dev/sda

The attributes to watch are

Reallocated_Sector_Ct

Current_Pending_Sector

, and

Offline_Uncorrectable

. Any non-zero value for the latter two means there are sectors the drive can't reliably read. Also check

dmesg

for kernel-level ATA errors:

dmesg | grep -E "error|ata[0-9]|exception|reset"

Output like this confirms hardware failure in progress:

[12345.678901] ata1.00: exception Emask 0x10 SAct 0x800000 SErr 0x400000 action 0x6 frozen
[12345.679012] ata1.00: failed command: READ FPDMA QUEUED
[12345.679123] ata1.00: status: { DRDY ERR }
[12345.679234] ata1.00: error: { UNC }

The fix is direct: replace the disk. If you're on a RAID array, initiate a replacement before the drive fully fails. For a standalone drive, evacuate data immediately and swap the hardware. You can sometimes reduce iowait temporarily by switching the IO scheduler to

mq-deadline

bfq

, but that's just buying time — the hardware needs to go.

Root Cause 2: RAID Array in Degraded State

A degraded software RAID array — particularly RAID-5 or RAID-6 managed with

mdadm

— will hammer iowait in ways that aren't immediately obvious. When a drive drops out of a RAID-5 array, every read operation now requires parity reconstruction from the remaining drives. What was a single-disk sequential read becomes a multi-disk parity calculation. If you're also running a background array rebuild at the same time, it compounds: normal application IO competes directly with resync IO for the same physical devices.

Check RAID status immediately:

cat /proc/mdstat

A healthy array looks like this:

md0 : active raid5 sdb[1] sdc[2] sdd[3]
      2929398784 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

A degraded array shows something like:

md0 : active raid5 sdb[1] sdc[2] sdd[4](F)
      2929398784 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]

That

[UU_]

tells you one drive position is empty, and

(F)

on sdd confirms it was marked failed. The system logs will corroborate this:

kernel: md/raid:md0: Disk failure on sdd, disabling device.
kernel: md/raid:md0: Operation continuing on 2 devices.

To recover, formally remove the failed device and add a replacement drive:

mdadm /dev/md0 --remove /dev/sdd
mdadm /dev/md0 --add /dev/sde

Watch the rebuild progress in real time:

watch -n2 cat /proc/mdstat

During the rebuild, iowait will remain elevated because resync is IO-intensive by nature. You can limit the resync speed to protect running services from IO starvation:

echo 50000 > /proc/sys/dev/raid/speed_limit_max

The default is often 200000 KB/s, which will saturate available disk bandwidth on a busy host. Throttling to 50000 during business hours and raising it back overnight is a sensible tradeoff. Once the rebuild completes and

/proc/mdstat

shows all drives as

[UUU]

, iowait should return to its normal baseline.

Root Cause 3: NFS Mount Hanging

NFS is deceptive. When it works, it works well. When the NFS server is slow, saturated, or unreachable, every process that touches a file on that mount enters uninterruptible sleep — the

D

state — and stays there until the server responds or the mount times out. On a system where many processes share NFS mounts for home directories, configuration files, or shared application data, you can have dozens of D-state processes and iowait pegged even though every local disk on the machine is perfectly healthy.

I've watched this take down entire application stacks. A single NFS server at 192.168.10.50 hosting shared configuration directories became saturated during an overnight backup job that wasn't throttled. By morning, every application process that tried to read its config file blocked. Load average climbed into the hundreds, and the hosts became completely unresponsive — all traceable back to one overwhelmed NFS server.

To identify NFS as the culprit, first look for D-state processes:

ps aux | awk '$8 ~ /D/ {print}'

Then check what kernel function those processes are sleeping in:

cat /proc/$(pgrep stuck_process)/wchan

If you see

nfs4_wait_bit_killable

rpc_wait_bit_killable

, that's your confirmation. Also inspect mount statistics directly:

mountstats /mnt/nfs_share

Elevated RTT values confirm a slow or unresponsive NFS server:

NFS mount on /mnt/nfs_share from 192.168.10.50:/exports/data
  Stats since server mount:
  Transport protocol: TCP
  avg RTT (ms): 4832.5
  avg exe (ms): 4901.2

Normal RTT on a local LAN should be under 5ms. A value of 4832ms means the NFS server is essentially not responding. Verify reachability from the client:

ping 192.168.10.50
showmount -e 192.168.10.50

If the server is unreachable or not responding to RPC calls, the only immediate option on the client side is a forced lazy unmount:

umount -f -l /mnt/nfs_share

The

-l

flag performs a lazy unmount, detaching the filesystem from the namespace immediately while allowing existing open file handles to close naturally. This unblocks D-state processes. For future NFS mounts, always configure

soft

with a sensible timeout so processes don't block indefinitely:

192.168.10.50:/exports/data  /mnt/nfs_share  nfs  soft,timeo=30,retrans=3,_netdev  0 0

soft

mounts return an error to the calling process after the timeout rather than blocking forever. Some applications don't handle IO errors gracefully, but a recoverable error is a vastly better failure mode than a frozen host.

Root Cause 4: Database Heavy Writes

Databases are IO-hungry, and under certain conditions they can saturate disk bandwidth entirely. The most common scenarios are a bulk data import, a PostgreSQL VACUUM or ANALYZE operation consuming all write throughput, a MySQL index rebuild, or a checkpoint flush event that dumps a massive dirty buffer to disk in a sudden burst. These aren't hardware failures — they're workload-driven IO storms, and they're fixable through configuration.

PostgreSQL's checkpoint behavior is the pattern I've seen catch teams off guard most often. When

max_wal_size

is too small relative to your write volume, PostgreSQL triggers checkpoints very frequently. Each checkpoint flushes a large amount of dirty shared buffers to disk in a concentrated burst. If you watch iowait over time and see it spike sharply every few minutes in a regular pattern, that rhythm almost always points to checkpointing.

First, confirm what process is doing the writing with

iotop

iotop -o -P

The

-o

flag shows only processes with active IO. Output like this tells the story immediately:

  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 4821 be/4 postgres    0.00 B/s  147.23 M/s  0.00 %  98.72 %  postgres: checkpointer
 4822 be/4 postgres    0.00 B/s   23.41 M/s  0.00 %   1.10 %  postgres: walwriter

The checkpointer writing 147 MB/s is saturating the disk. In

postgresql.conf

, tune checkpoint behavior to spread writes over a longer interval:

checkpoint_completion_target = 0.9
max_wal_size = 4GB
checkpoint_warning = 30s

Setting

checkpoint_completion_target

to 0.9 instructs PostgreSQL to spread dirty page writes over 90% of the checkpoint interval, smoothing out the burst considerably. Increasing

max_wal_size

reduces how frequently checkpoints are triggered in the first place. Reload the config with

SELECT pg_reload_conf();

— no restart required.

For MySQL InnoDB, tune the buffer pool flush settings:

innodb_io_capacity = 400
innodb_io_capacity_max = 2000
innodb_flush_method = O_DIRECT
innodb_flush_neighbors = 0

innodb_flush_method = O_DIRECT

bypasses the OS page cache for InnoDB data files, eliminating double-buffering and reducing effective IO overhead. On SSDs, set

innodb_flush_neighbors = 0

— there's no benefit to flushing adjacent pages on flash storage the way there is on spinning disks where spatial locality matters.

For a one-time bulk import that's causing iowait, throttle it with

ionice

so other services stay responsive:

ionice -c 3 -p $(pidof import_process)

IO class 3 (idle) means the import only consumes disk bandwidth when no other process needs it — a clean way to run a heavy job in the background without impacting production workloads.

Root Cause 5: Swap Thrashing

Swap thrashing feels different from the other causes because it's fundamentally a memory problem that manifests as a disk problem. When a system exhausts physical RAM, the kernel starts paging memory out to the swap device. Swap reads and writes are orders of magnitude slower than RAM access. If the system is actively thrashing — evicting pages to swap, then immediately needing them back in — you'll see severe iowait coming from what looks like an ordinary disk but is actually your swap partition consuming all available IO bandwidth.

Thrashing happens when the active working set of running processes exceeds available physical RAM. The kernel evicts pages to make room, those evicted pages are needed again almost immediately, they're swapped back in, other pages are evicted in their place, and the cycle repeats endlessly. The system burns all its time doing IO instead of useful work, and CPU stays idle waiting for page-ins to complete.

Diagnose with

vmstat

vmstat 1 10

Watch the

si

(swap in) and

so

(swap out) columns carefully:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 14 2097152  12288   1024  98304 4821 3201  5432  4120  892 1241  3  2  1 94  0

si=4821 and so=3201 KB/s means the system is swapping roughly 5 MB/s in and out simultaneously. Combined with 94% iowait, this is a system in full thrash. Confirm with

free

free -h

              total        used        free      shared  buff/cache   available
Mem:           7.8G        7.6G        12M       324M       178M        -32M
Swap:          2.0G        2.0G          0B

Zero available memory, swap completely exhausted. This system is in serious trouble. The immediate relief is identifying and addressing the memory hog:

ps aux --sort=-%mem | head -20

For more precise RSS accounting across the process tree:

smem -r -s rss | head -20

Kill or restart the offending process. For longer-term relief, tune

vm.swappiness

to make the kernel less aggressive about preemptive swapping:

sysctl -w vm.swappiness=10

The default value of 60 permits aggressive preemptive page eviction. Dropping to 10 tells the kernel to strongly prefer keeping pages in RAM. Persist the change so it survives reboots:

echo "vm.swappiness=10" >> /etc/sysctl.d/99-vm-tuning.conf
sysctl -p /etc/sysctl.d/99-vm-tuning.conf

This tuning won't save you if you're genuinely out of RAM — it just reduces unnecessary preemptive swapping on systems that are pressured but not fully exhausted. The real fix is adding RAM or reducing the memory footprint of the services running on the host. Swap is a safety net, not a long-term performance strategy.

Root Cause 6: Runaway Log Writes and Bulk Filesystem Operations

A less obvious iowait source is automated filesystem operations that nobody explicitly planned for. Log rotation with

logrotate

, unthrottled

rsync

backup jobs,

find

operations scanning large directory trees, or

updatedb

running via cron can all generate IO bursts that catch you off guard at 2 AM.

These are usually identifiable by correlating iowait spikes with your cron schedule. Check

journalctl

around the time of the spike and cross-reference with

iotop -o -P

output the next time a spike occurs. For rsync jobs that run on a schedule, add bandwidth limiting to prevent them from saturating the link:

rsync --bwlimit=50000 -av /source/ infrarunbook-admin@192.168.10.20:/destination/

For

logrotate

, if you're compressing large log files synchronously, add

delaycompress

to the relevant stanza. This defers compression until the next rotation cycle, spreading the IO cost over time rather than concentrating it at the moment of rotation when the log files are largest.

Prevention

Preventing high iowait is largely about monitoring before things degrade and designing systems so they fail gracefully rather than catastrophically. Start by setting up meaningful alerts. In Prometheus, alerting when

rate(node_cpu_seconds_total{mode="iowait"}[2m])

exceeds 0.15 for more than two minutes catches problems early. Catching iowait at 15% is dramatically easier to diagnose and fix than waiting until it's at 80% with users screaming.

Enable SMART monitoring with

smartd

and configure it to send alerts on predictive failure attributes. Most drive failures announce themselves through SMART data days or weeks before the drive actually dies. Pair that with

mdadm --monitor --scan --daemonize

for software RAID arrays, which will notify you the moment a drive is marked failed rather than letting the degraded state persist unnoticed.

For NFS infrastructure, always include

_netdev

in mount options so that NFS mounts don't block the boot sequence, and consider

soft

mounts with appropriate timeout values for any share that isn't absolutely critical for write consistency. Have a documented runbook procedure for forcibly unmounting stuck NFS mounts — when it happens at 3 AM, the on-call engineer shouldn't be improvising.

Review database checkpoint and flush configurations before you're under production load. Use

pg_stat_bgwriter

to monitor PostgreSQL checkpoint frequency and buffer write patterns over time. Set

checkpoint_warning

low enough that you receive log entries when checkpoints happen more frequently than expected — treat those log lines as early warnings, not noise.

Track memory utilization trends, not just instantaneous values. If available memory is consistently below 10% of total RAM during normal business hours on sw-infrarunbook-01, you're one traffic spike away from swap thrashing. Either add RAM, move services to another host, or reduce the memory footprint of the applications running there.

Finally, establish baseline IO performance numbers for every production host before it goes live. Run

fio

on freshly provisioned systems:

fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 --group_reporting

Store those baseline numbers somewhere permanent. When something feels slow six months later, you'll have real data to compare against instead of relying on gut feel. The combination of proactive monitoring, hardware health checks, properly tuned workloads, and documented response procedures will keep iowait out of the danger zone for the vast majority of systems you manage.

Linux High IO Wait Troubleshooting

Symptoms

Root Cause 1: Slow or Failing Disk

Root Cause 2: RAID Array in Degraded State

Root Cause 3: NFS Mount Hanging

Root Cause 4: Database Heavy Writes

Root Cause 5: Swap Thrashing

Root Cause 6: Runaway Log Writes and Bulk Filesystem Operations

Prevention

Frequently Asked Questions

What is a normal iowait percentage on Linux?

How do I quickly find which process is causing high iowait?

Can high iowait cause high load average even with low CPU usage?

How do I reduce iowait during a RAID rebuild without stopping the rebuild?

Is vm.swappiness=0 a good idea to prevent swap thrashing?

Related Articles