Symptoms
High IO wait (iowait) is one of those performance issues that can quietly kill a system. You'll typically notice it first through sluggish application response times — a database query that normally takes 50ms starts taking 5 seconds, SSH sessions lag noticeably, and cron jobs pile up waiting to run. The system load average climbs, sometimes dramatically, even though CPU utilization looks normal or even low.
Running
topor
htop, you'll see the
wa(IO wait) percentage in the CPU summary line jump above 5–10%, and in bad cases it can sit at 50–80% or higher. Meanwhile, processes accumulate in the
D(uninterruptible sleep) state — the kernel's way of signaling that something is blocked on IO and cannot be interrupted, even by a signal. A growing count of D-state processes combined with elevated load is almost always an IO problem.
top - 14:23:01 up 12 days, 2:14, 3 users, load average: 18.42, 16.21, 14.87
%Cpu(s): 2.1 us, 0.8 sy, 0.0 ni, 12.3 id, 83.7 wa, 0.1 hi, 1.0 si, 0.0 st
That 83.7% iowait is a five-alarm fire. The system isn't CPU-bound at all — it's spending the vast majority of its time waiting for storage to respond. Let's break down the most common causes and how to resolve each one.
Root Cause 1: Slow or Failing Disk
The most straightforward cause of high iowait is a disk that simply can't keep up. This might be a drive failing mechanically, a SATA spinning disk being overwhelmed by workloads that demand NVMe throughput, or a drive experiencing repeated sector read errors that force the kernel to retry operations multiple times before giving up.
In my experience, this sneaks up on you. A drive will quietly accumulate reallocated sectors for weeks before anything obvious breaks. It keeps working — just slowly — because every IO on a bad sector takes 10x longer as the read head skips over it and retries. The workload that ran fine six months ago now generates constant iowait because the disk is fighting itself on every access.
Start diagnosis with
iostat:
iostat -xz 1 5
Focus on three columns:
await(average time in milliseconds for IO requests to complete),
svctm(average service time per request), and
%util(how busy the device is). On a healthy NVMe, await should be under 0.5ms. On a spinning disk doing sequential reads, under 10ms is reasonable. If you're seeing 200ms or 500ms, something is wrong.
Device r/s w/s rkB/s wkB/s await svctm %util
sda 12.00 45.00 384.00 1440.00 423.18 22.14 99.80
A 423ms average await with 99.8% utilization means the disk is completely saturated and every new request is queuing behind the backlog. Next, pull SMART data to look for hardware-level failure indicators:
smartctl -a /dev/sda
The attributes to watch are
Reallocated_Sector_Ct,
Current_Pending_Sector, and
Offline_Uncorrectable. Any non-zero value for the latter two means there are sectors the drive can't reliably read. Also check
dmesgfor kernel-level ATA errors:
dmesg | grep -E "error|ata[0-9]|exception|reset"
Output like this confirms hardware failure in progress:
[12345.678901] ata1.00: exception Emask 0x10 SAct 0x800000 SErr 0x400000 action 0x6 frozen
[12345.679012] ata1.00: failed command: READ FPDMA QUEUED
[12345.679123] ata1.00: status: { DRDY ERR }
[12345.679234] ata1.00: error: { UNC }
The fix is direct: replace the disk. If you're on a RAID array, initiate a replacement before the drive fully fails. For a standalone drive, evacuate data immediately and swap the hardware. You can sometimes reduce iowait temporarily by switching the IO scheduler to
mq-deadlineor
bfq, but that's just buying time — the hardware needs to go.
Root Cause 2: RAID Array in Degraded State
A degraded software RAID array — particularly RAID-5 or RAID-6 managed with
mdadm— will hammer iowait in ways that aren't immediately obvious. When a drive drops out of a RAID-5 array, every read operation now requires parity reconstruction from the remaining drives. What was a single-disk sequential read becomes a multi-disk parity calculation. If you're also running a background array rebuild at the same time, it compounds: normal application IO competes directly with resync IO for the same physical devices.
Check RAID status immediately:
cat /proc/mdstat
A healthy array looks like this:
md0 : active raid5 sdb[1] sdc[2] sdd[3]
2929398784 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
A degraded array shows something like:
md0 : active raid5 sdb[1] sdc[2] sdd[4](F)
2929398784 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
That
[UU_]tells you one drive position is empty, and
(F)on sdd confirms it was marked failed. The system logs will corroborate this:
kernel: md/raid:md0: Disk failure on sdd, disabling device.
kernel: md/raid:md0: Operation continuing on 2 devices.
To recover, formally remove the failed device and add a replacement drive:
mdadm /dev/md0 --remove /dev/sdd
mdadm /dev/md0 --add /dev/sde
Watch the rebuild progress in real time:
watch -n2 cat /proc/mdstat
During the rebuild, iowait will remain elevated because resync is IO-intensive by nature. You can limit the resync speed to protect running services from IO starvation:
echo 50000 > /proc/sys/dev/raid/speed_limit_max
The default is often 200000 KB/s, which will saturate available disk bandwidth on a busy host. Throttling to 50000 during business hours and raising it back overnight is a sensible tradeoff. Once the rebuild completes and
/proc/mdstatshows all drives as
[UUU], iowait should return to its normal baseline.
Root Cause 3: NFS Mount Hanging
NFS is deceptive. When it works, it works well. When the NFS server is slow, saturated, or unreachable, every process that touches a file on that mount enters uninterruptible sleep — the
Dstate — and stays there until the server responds or the mount times out. On a system where many processes share NFS mounts for home directories, configuration files, or shared application data, you can have dozens of D-state processes and iowait pegged even though every local disk on the machine is perfectly healthy.
I've watched this take down entire application stacks. A single NFS server at 192.168.10.50 hosting shared configuration directories became saturated during an overnight backup job that wasn't throttled. By morning, every application process that tried to read its config file blocked. Load average climbed into the hundreds, and the hosts became completely unresponsive — all traceable back to one overwhelmed NFS server.
To identify NFS as the culprit, first look for D-state processes:
ps aux | awk '$8 ~ /D/ {print}'
Then check what kernel function those processes are sleeping in:
cat /proc/$(pgrep stuck_process)/wchan
If you see
nfs4_wait_bit_killableor
rpc_wait_bit_killable, that's your confirmation. Also inspect mount statistics directly:
mountstats /mnt/nfs_share
Elevated RTT values confirm a slow or unresponsive NFS server:
NFS mount on /mnt/nfs_share from 192.168.10.50:/exports/data
Stats since server mount:
Transport protocol: TCP
avg RTT (ms): 4832.5
avg exe (ms): 4901.2
Normal RTT on a local LAN should be under 5ms. A value of 4832ms means the NFS server is essentially not responding. Verify reachability from the client:
ping 192.168.10.50
showmount -e 192.168.10.50
If the server is unreachable or not responding to RPC calls, the only immediate option on the client side is a forced lazy unmount:
umount -f -l /mnt/nfs_share
The
-lflag performs a lazy unmount, detaching the filesystem from the namespace immediately while allowing existing open file handles to close naturally. This unblocks D-state processes. For future NFS mounts, always configure
softwith a sensible timeout so processes don't block indefinitely:
192.168.10.50:/exports/data /mnt/nfs_share nfs soft,timeo=30,retrans=3,_netdev 0 0
softmounts return an error to the calling process after the timeout rather than blocking forever. Some applications don't handle IO errors gracefully, but a recoverable error is a vastly better failure mode than a frozen host.
Root Cause 4: Database Heavy Writes
Databases are IO-hungry, and under certain conditions they can saturate disk bandwidth entirely. The most common scenarios are a bulk data import, a PostgreSQL VACUUM or ANALYZE operation consuming all write throughput, a MySQL index rebuild, or a checkpoint flush event that dumps a massive dirty buffer to disk in a sudden burst. These aren't hardware failures — they're workload-driven IO storms, and they're fixable through configuration.
PostgreSQL's checkpoint behavior is the pattern I've seen catch teams off guard most often. When
max_wal_sizeis too small relative to your write volume, PostgreSQL triggers checkpoints very frequently. Each checkpoint flushes a large amount of dirty shared buffers to disk in a concentrated burst. If you watch iowait over time and see it spike sharply every few minutes in a regular pattern, that rhythm almost always points to checkpointing.
First, confirm what process is doing the writing with
iotop:
iotop -o -P
The
-oflag shows only processes with active IO. Output like this tells the story immediately:
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
4821 be/4 postgres 0.00 B/s 147.23 M/s 0.00 % 98.72 % postgres: checkpointer
4822 be/4 postgres 0.00 B/s 23.41 M/s 0.00 % 1.10 % postgres: walwriter
The checkpointer writing 147 MB/s is saturating the disk. In
postgresql.conf, tune checkpoint behavior to spread writes over a longer interval:
checkpoint_completion_target = 0.9
max_wal_size = 4GB
checkpoint_warning = 30s
Setting
checkpoint_completion_targetto 0.9 instructs PostgreSQL to spread dirty page writes over 90% of the checkpoint interval, smoothing out the burst considerably. Increasing
max_wal_sizereduces how frequently checkpoints are triggered in the first place. Reload the config with
SELECT pg_reload_conf();— no restart required.
For MySQL InnoDB, tune the buffer pool flush settings:
innodb_io_capacity = 400
innodb_io_capacity_max = 2000
innodb_flush_method = O_DIRECT
innodb_flush_neighbors = 0
innodb_flush_method = O_DIRECTbypasses the OS page cache for InnoDB data files, eliminating double-buffering and reducing effective IO overhead. On SSDs, set
innodb_flush_neighbors = 0— there's no benefit to flushing adjacent pages on flash storage the way there is on spinning disks where spatial locality matters.
For a one-time bulk import that's causing iowait, throttle it with
ioniceso other services stay responsive:
ionice -c 3 -p $(pidof import_process)
IO class 3 (idle) means the import only consumes disk bandwidth when no other process needs it — a clean way to run a heavy job in the background without impacting production workloads.
Root Cause 5: Swap Thrashing
Swap thrashing feels different from the other causes because it's fundamentally a memory problem that manifests as a disk problem. When a system exhausts physical RAM, the kernel starts paging memory out to the swap device. Swap reads and writes are orders of magnitude slower than RAM access. If the system is actively thrashing — evicting pages to swap, then immediately needing them back in — you'll see severe iowait coming from what looks like an ordinary disk but is actually your swap partition consuming all available IO bandwidth.
Thrashing happens when the active working set of running processes exceeds available physical RAM. The kernel evicts pages to make room, those evicted pages are needed again almost immediately, they're swapped back in, other pages are evicted in their place, and the cycle repeats endlessly. The system burns all its time doing IO instead of useful work, and CPU stays idle waiting for page-ins to complete.
Diagnose with
vmstat:
vmstat 1 10
Watch the
si(swap in) and
so(swap out) columns carefully:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 14 2097152 12288 1024 98304 4821 3201 5432 4120 892 1241 3 2 1 94 0
si=4821 and so=3201 KB/s means the system is swapping roughly 5 MB/s in and out simultaneously. Combined with 94% iowait, this is a system in full thrash. Confirm with
free:
free -h
total used free shared buff/cache available
Mem: 7.8G 7.6G 12M 324M 178M -32M
Swap: 2.0G 2.0G 0B
Zero available memory, swap completely exhausted. This system is in serious trouble. The immediate relief is identifying and addressing the memory hog:
ps aux --sort=-%mem | head -20
For more precise RSS accounting across the process tree:
smem -r -s rss | head -20
Kill or restart the offending process. For longer-term relief, tune
vm.swappinessto make the kernel less aggressive about preemptive swapping:
sysctl -w vm.swappiness=10
The default value of 60 permits aggressive preemptive page eviction. Dropping to 10 tells the kernel to strongly prefer keeping pages in RAM. Persist the change so it survives reboots:
echo "vm.swappiness=10" >> /etc/sysctl.d/99-vm-tuning.conf
sysctl -p /etc/sysctl.d/99-vm-tuning.conf
This tuning won't save you if you're genuinely out of RAM — it just reduces unnecessary preemptive swapping on systems that are pressured but not fully exhausted. The real fix is adding RAM or reducing the memory footprint of the services running on the host. Swap is a safety net, not a long-term performance strategy.
Root Cause 6: Runaway Log Writes and Bulk Filesystem Operations
A less obvious iowait source is automated filesystem operations that nobody explicitly planned for. Log rotation with
logrotate, unthrottled
rsyncbackup jobs,
findoperations scanning large directory trees, or
updatedbrunning via cron can all generate IO bursts that catch you off guard at 2 AM.
These are usually identifiable by correlating iowait spikes with your cron schedule. Check
journalctlaround the time of the spike and cross-reference with
iotop -o -Poutput the next time a spike occurs. For rsync jobs that run on a schedule, add bandwidth limiting to prevent them from saturating the link:
rsync --bwlimit=50000 -av /source/ infrarunbook-admin@192.168.10.20:/destination/
For
logrotate, if you're compressing large log files synchronously, add
delaycompressto the relevant stanza. This defers compression until the next rotation cycle, spreading the IO cost over time rather than concentrating it at the moment of rotation when the log files are largest.
Prevention
Preventing high iowait is largely about monitoring before things degrade and designing systems so they fail gracefully rather than catastrophically. Start by setting up meaningful alerts. In Prometheus, alerting when
rate(node_cpu_seconds_total{mode="iowait"}[2m])exceeds 0.15 for more than two minutes catches problems early. Catching iowait at 15% is dramatically easier to diagnose and fix than waiting until it's at 80% with users screaming.
Enable SMART monitoring with
smartdand configure it to send alerts on predictive failure attributes. Most drive failures announce themselves through SMART data days or weeks before the drive actually dies. Pair that with
mdadm --monitor --scan --daemonizefor software RAID arrays, which will notify you the moment a drive is marked failed rather than letting the degraded state persist unnoticed.
For NFS infrastructure, always include
_netdevin mount options so that NFS mounts don't block the boot sequence, and consider
softmounts with appropriate timeout values for any share that isn't absolutely critical for write consistency. Have a documented runbook procedure for forcibly unmounting stuck NFS mounts — when it happens at 3 AM, the on-call engineer shouldn't be improvising.
Review database checkpoint and flush configurations before you're under production load. Use
pg_stat_bgwriterto monitor PostgreSQL checkpoint frequency and buffer write patterns over time. Set
checkpoint_warninglow enough that you receive log entries when checkpoints happen more frequently than expected — treat those log lines as early warnings, not noise.
Track memory utilization trends, not just instantaneous values. If available memory is consistently below 10% of total RAM during normal business hours on sw-infrarunbook-01, you're one traffic spike away from swap thrashing. Either add RAM, move services to another host, or reduce the memory footprint of the applications running there.
Finally, establish baseline IO performance numbers for every production host before it goes live. Run
fioon freshly provisioned systems:
fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 --group_reporting
Store those baseline numbers somewhere permanent. When something feels slow six months later, you'll have real data to compare against instead of relying on gut feel. The combination of proactive monitoring, hardware health checks, properly tuned workloads, and documented response procedures will keep iowait out of the danger zone for the vast majority of systems you manage.
