InfraRunBook
    Back to articles

    Linux High IO Wait Troubleshooting

    Linux
    Published: Apr 20, 2026
    Updated: Apr 20, 2026

    Learn how to diagnose and resolve high IO wait on Linux systems, covering slow disks, degraded RAID arrays, hung NFS mounts, database write storms, and swap thrashing with real CLI commands and output.

    Linux High IO Wait Troubleshooting

    Symptoms

    High IO wait (iowait) is one of those performance issues that can quietly kill a system. You'll typically notice it first through sluggish application response times — a database query that normally takes 50ms starts taking 5 seconds, SSH sessions lag noticeably, and cron jobs pile up waiting to run. The system load average climbs, sometimes dramatically, even though CPU utilization looks normal or even low.

    Running

    top
    or
    htop
    , you'll see the
    wa
    (IO wait) percentage in the CPU summary line jump above 5–10%, and in bad cases it can sit at 50–80% or higher. Meanwhile, processes accumulate in the
    D
    (uninterruptible sleep) state — the kernel's way of signaling that something is blocked on IO and cannot be interrupted, even by a signal. A growing count of D-state processes combined with elevated load is almost always an IO problem.

    top - 14:23:01 up 12 days,  2:14,  3 users,  load average: 18.42, 16.21, 14.87
    %Cpu(s):  2.1 us,  0.8 sy,  0.0 ni, 12.3 id, 83.7 wa,  0.1 hi,  1.0 si,  0.0 st

    That 83.7% iowait is a five-alarm fire. The system isn't CPU-bound at all — it's spending the vast majority of its time waiting for storage to respond. Let's break down the most common causes and how to resolve each one.

    Root Cause 1: Slow or Failing Disk

    The most straightforward cause of high iowait is a disk that simply can't keep up. This might be a drive failing mechanically, a SATA spinning disk being overwhelmed by workloads that demand NVMe throughput, or a drive experiencing repeated sector read errors that force the kernel to retry operations multiple times before giving up.

    In my experience, this sneaks up on you. A drive will quietly accumulate reallocated sectors for weeks before anything obvious breaks. It keeps working — just slowly — because every IO on a bad sector takes 10x longer as the read head skips over it and retries. The workload that ran fine six months ago now generates constant iowait because the disk is fighting itself on every access.

    Start diagnosis with

    iostat
    :

    iostat -xz 1 5

    Focus on three columns:

    await
    (average time in milliseconds for IO requests to complete),
    svctm
    (average service time per request), and
    %util
    (how busy the device is). On a healthy NVMe, await should be under 0.5ms. On a spinning disk doing sequential reads, under 10ms is reasonable. If you're seeing 200ms or 500ms, something is wrong.

    Device     r/s    w/s   rkB/s   wkB/s  await  svctm  %util
    sda       12.00  45.00  384.00 1440.00 423.18  22.14  99.80

    A 423ms average await with 99.8% utilization means the disk is completely saturated and every new request is queuing behind the backlog. Next, pull SMART data to look for hardware-level failure indicators:

    smartctl -a /dev/sda

    The attributes to watch are

    Reallocated_Sector_Ct
    ,
    Current_Pending_Sector
    , and
    Offline_Uncorrectable
    . Any non-zero value for the latter two means there are sectors the drive can't reliably read. Also check
    dmesg
    for kernel-level ATA errors:

    dmesg | grep -E "error|ata[0-9]|exception|reset"

    Output like this confirms hardware failure in progress:

    [12345.678901] ata1.00: exception Emask 0x10 SAct 0x800000 SErr 0x400000 action 0x6 frozen
    [12345.679012] ata1.00: failed command: READ FPDMA QUEUED
    [12345.679123] ata1.00: status: { DRDY ERR }
    [12345.679234] ata1.00: error: { UNC }

    The fix is direct: replace the disk. If you're on a RAID array, initiate a replacement before the drive fully fails. For a standalone drive, evacuate data immediately and swap the hardware. You can sometimes reduce iowait temporarily by switching the IO scheduler to

    mq-deadline
    or
    bfq
    , but that's just buying time — the hardware needs to go.

    Root Cause 2: RAID Array in Degraded State

    A degraded software RAID array — particularly RAID-5 or RAID-6 managed with

    mdadm
    — will hammer iowait in ways that aren't immediately obvious. When a drive drops out of a RAID-5 array, every read operation now requires parity reconstruction from the remaining drives. What was a single-disk sequential read becomes a multi-disk parity calculation. If you're also running a background array rebuild at the same time, it compounds: normal application IO competes directly with resync IO for the same physical devices.

    Check RAID status immediately:

    cat /proc/mdstat

    A healthy array looks like this:

    md0 : active raid5 sdb[1] sdc[2] sdd[3]
          2929398784 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

    A degraded array shows something like:

    md0 : active raid5 sdb[1] sdc[2] sdd[4](F)
          2929398784 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]

    That

    [UU_]
    tells you one drive position is empty, and
    (F)
    on sdd confirms it was marked failed. The system logs will corroborate this:

    kernel: md/raid:md0: Disk failure on sdd, disabling device.
    kernel: md/raid:md0: Operation continuing on 2 devices.

    To recover, formally remove the failed device and add a replacement drive:

    mdadm /dev/md0 --remove /dev/sdd
    mdadm /dev/md0 --add /dev/sde

    Watch the rebuild progress in real time:

    watch -n2 cat /proc/mdstat

    During the rebuild, iowait will remain elevated because resync is IO-intensive by nature. You can limit the resync speed to protect running services from IO starvation:

    echo 50000 > /proc/sys/dev/raid/speed_limit_max

    The default is often 200000 KB/s, which will saturate available disk bandwidth on a busy host. Throttling to 50000 during business hours and raising it back overnight is a sensible tradeoff. Once the rebuild completes and

    /proc/mdstat
    shows all drives as
    [UUU]
    , iowait should return to its normal baseline.

    Root Cause 3: NFS Mount Hanging

    NFS is deceptive. When it works, it works well. When the NFS server is slow, saturated, or unreachable, every process that touches a file on that mount enters uninterruptible sleep — the

    D
    state — and stays there until the server responds or the mount times out. On a system where many processes share NFS mounts for home directories, configuration files, or shared application data, you can have dozens of D-state processes and iowait pegged even though every local disk on the machine is perfectly healthy.

    I've watched this take down entire application stacks. A single NFS server at 192.168.10.50 hosting shared configuration directories became saturated during an overnight backup job that wasn't throttled. By morning, every application process that tried to read its config file blocked. Load average climbed into the hundreds, and the hosts became completely unresponsive — all traceable back to one overwhelmed NFS server.

    To identify NFS as the culprit, first look for D-state processes:

    ps aux | awk '$8 ~ /D/ {print}'

    Then check what kernel function those processes are sleeping in:

    cat /proc/$(pgrep stuck_process)/wchan

    If you see

    nfs4_wait_bit_killable
    or
    rpc_wait_bit_killable
    , that's your confirmation. Also inspect mount statistics directly:

    mountstats /mnt/nfs_share

    Elevated RTT values confirm a slow or unresponsive NFS server:

    NFS mount on /mnt/nfs_share from 192.168.10.50:/exports/data
      Stats since server mount:
      Transport protocol: TCP
      avg RTT (ms): 4832.5
      avg exe (ms): 4901.2

    Normal RTT on a local LAN should be under 5ms. A value of 4832ms means the NFS server is essentially not responding. Verify reachability from the client:

    ping 192.168.10.50
    showmount -e 192.168.10.50

    If the server is unreachable or not responding to RPC calls, the only immediate option on the client side is a forced lazy unmount:

    umount -f -l /mnt/nfs_share

    The

    -l
    flag performs a lazy unmount, detaching the filesystem from the namespace immediately while allowing existing open file handles to close naturally. This unblocks D-state processes. For future NFS mounts, always configure
    soft
    with a sensible timeout so processes don't block indefinitely:

    192.168.10.50:/exports/data  /mnt/nfs_share  nfs  soft,timeo=30,retrans=3,_netdev  0 0

    soft
    mounts return an error to the calling process after the timeout rather than blocking forever. Some applications don't handle IO errors gracefully, but a recoverable error is a vastly better failure mode than a frozen host.

    Root Cause 4: Database Heavy Writes

    Databases are IO-hungry, and under certain conditions they can saturate disk bandwidth entirely. The most common scenarios are a bulk data import, a PostgreSQL VACUUM or ANALYZE operation consuming all write throughput, a MySQL index rebuild, or a checkpoint flush event that dumps a massive dirty buffer to disk in a sudden burst. These aren't hardware failures — they're workload-driven IO storms, and they're fixable through configuration.

    PostgreSQL's checkpoint behavior is the pattern I've seen catch teams off guard most often. When

    max_wal_size
    is too small relative to your write volume, PostgreSQL triggers checkpoints very frequently. Each checkpoint flushes a large amount of dirty shared buffers to disk in a concentrated burst. If you watch iowait over time and see it spike sharply every few minutes in a regular pattern, that rhythm almost always points to checkpointing.

    First, confirm what process is doing the writing with

    iotop
    :

    iotop -o -P

    The

    -o
    flag shows only processes with active IO. Output like this tells the story immediately:

      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
     4821 be/4 postgres    0.00 B/s  147.23 M/s  0.00 %  98.72 %  postgres: checkpointer
     4822 be/4 postgres    0.00 B/s   23.41 M/s  0.00 %   1.10 %  postgres: walwriter

    The checkpointer writing 147 MB/s is saturating the disk. In

    postgresql.conf
    , tune checkpoint behavior to spread writes over a longer interval:

    checkpoint_completion_target = 0.9
    max_wal_size = 4GB
    checkpoint_warning = 30s

    Setting

    checkpoint_completion_target
    to 0.9 instructs PostgreSQL to spread dirty page writes over 90% of the checkpoint interval, smoothing out the burst considerably. Increasing
    max_wal_size
    reduces how frequently checkpoints are triggered in the first place. Reload the config with
    SELECT pg_reload_conf();
    — no restart required.

    For MySQL InnoDB, tune the buffer pool flush settings:

    innodb_io_capacity = 400
    innodb_io_capacity_max = 2000
    innodb_flush_method = O_DIRECT
    innodb_flush_neighbors = 0

    innodb_flush_method = O_DIRECT
    bypasses the OS page cache for InnoDB data files, eliminating double-buffering and reducing effective IO overhead. On SSDs, set
    innodb_flush_neighbors = 0
    — there's no benefit to flushing adjacent pages on flash storage the way there is on spinning disks where spatial locality matters.

    For a one-time bulk import that's causing iowait, throttle it with

    ionice
    so other services stay responsive:

    ionice -c 3 -p $(pidof import_process)

    IO class 3 (idle) means the import only consumes disk bandwidth when no other process needs it — a clean way to run a heavy job in the background without impacting production workloads.

    Root Cause 5: Swap Thrashing

    Swap thrashing feels different from the other causes because it's fundamentally a memory problem that manifests as a disk problem. When a system exhausts physical RAM, the kernel starts paging memory out to the swap device. Swap reads and writes are orders of magnitude slower than RAM access. If the system is actively thrashing — evicting pages to swap, then immediately needing them back in — you'll see severe iowait coming from what looks like an ordinary disk but is actually your swap partition consuming all available IO bandwidth.

    Thrashing happens when the active working set of running processes exceeds available physical RAM. The kernel evicts pages to make room, those evicted pages are needed again almost immediately, they're swapped back in, other pages are evicted in their place, and the cycle repeats endlessly. The system burns all its time doing IO instead of useful work, and CPU stays idle waiting for page-ins to complete.

    Diagnose with

    vmstat
    :

    vmstat 1 10

    Watch the

    si
    (swap in) and
    so
    (swap out) columns carefully:

    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     2 14 2097152  12288   1024  98304 4821 3201  5432  4120  892 1241  3  2  1 94  0

    si=4821 and so=3201 KB/s means the system is swapping roughly 5 MB/s in and out simultaneously. Combined with 94% iowait, this is a system in full thrash. Confirm with

    free
    :

    free -h
                  total        used        free      shared  buff/cache   available
    Mem:           7.8G        7.6G        12M       324M       178M        -32M
    Swap:          2.0G        2.0G          0B

    Zero available memory, swap completely exhausted. This system is in serious trouble. The immediate relief is identifying and addressing the memory hog:

    ps aux --sort=-%mem | head -20

    For more precise RSS accounting across the process tree:

    smem -r -s rss | head -20

    Kill or restart the offending process. For longer-term relief, tune

    vm.swappiness
    to make the kernel less aggressive about preemptive swapping:

    sysctl -w vm.swappiness=10

    The default value of 60 permits aggressive preemptive page eviction. Dropping to 10 tells the kernel to strongly prefer keeping pages in RAM. Persist the change so it survives reboots:

    echo "vm.swappiness=10" >> /etc/sysctl.d/99-vm-tuning.conf
    sysctl -p /etc/sysctl.d/99-vm-tuning.conf

    This tuning won't save you if you're genuinely out of RAM — it just reduces unnecessary preemptive swapping on systems that are pressured but not fully exhausted. The real fix is adding RAM or reducing the memory footprint of the services running on the host. Swap is a safety net, not a long-term performance strategy.

    Root Cause 6: Runaway Log Writes and Bulk Filesystem Operations

    A less obvious iowait source is automated filesystem operations that nobody explicitly planned for. Log rotation with

    logrotate
    , unthrottled
    rsync
    backup jobs,
    find
    operations scanning large directory trees, or
    updatedb
    running via cron can all generate IO bursts that catch you off guard at 2 AM.

    These are usually identifiable by correlating iowait spikes with your cron schedule. Check

    journalctl
    around the time of the spike and cross-reference with
    iotop -o -P
    output the next time a spike occurs. For rsync jobs that run on a schedule, add bandwidth limiting to prevent them from saturating the link:

    rsync --bwlimit=50000 -av /source/ infrarunbook-admin@192.168.10.20:/destination/

    For

    logrotate
    , if you're compressing large log files synchronously, add
    delaycompress
    to the relevant stanza. This defers compression until the next rotation cycle, spreading the IO cost over time rather than concentrating it at the moment of rotation when the log files are largest.


    Prevention

    Preventing high iowait is largely about monitoring before things degrade and designing systems so they fail gracefully rather than catastrophically. Start by setting up meaningful alerts. In Prometheus, alerting when

    rate(node_cpu_seconds_total{mode="iowait"}[2m])
    exceeds 0.15 for more than two minutes catches problems early. Catching iowait at 15% is dramatically easier to diagnose and fix than waiting until it's at 80% with users screaming.

    Enable SMART monitoring with

    smartd
    and configure it to send alerts on predictive failure attributes. Most drive failures announce themselves through SMART data days or weeks before the drive actually dies. Pair that with
    mdadm --monitor --scan --daemonize
    for software RAID arrays, which will notify you the moment a drive is marked failed rather than letting the degraded state persist unnoticed.

    For NFS infrastructure, always include

    _netdev
    in mount options so that NFS mounts don't block the boot sequence, and consider
    soft
    mounts with appropriate timeout values for any share that isn't absolutely critical for write consistency. Have a documented runbook procedure for forcibly unmounting stuck NFS mounts — when it happens at 3 AM, the on-call engineer shouldn't be improvising.

    Review database checkpoint and flush configurations before you're under production load. Use

    pg_stat_bgwriter
    to monitor PostgreSQL checkpoint frequency and buffer write patterns over time. Set
    checkpoint_warning
    low enough that you receive log entries when checkpoints happen more frequently than expected — treat those log lines as early warnings, not noise.

    Track memory utilization trends, not just instantaneous values. If available memory is consistently below 10% of total RAM during normal business hours on sw-infrarunbook-01, you're one traffic spike away from swap thrashing. Either add RAM, move services to another host, or reduce the memory footprint of the applications running there.

    Finally, establish baseline IO performance numbers for every production host before it goes live. Run

    fio
    on freshly provisioned systems:

    fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 --group_reporting

    Store those baseline numbers somewhere permanent. When something feels slow six months later, you'll have real data to compare against instead of relying on gut feel. The combination of proactive monitoring, hardware health checks, properly tuned workloads, and documented response procedures will keep iowait out of the danger zone for the vast majority of systems you manage.

    Frequently Asked Questions

    What is a normal iowait percentage on Linux?

    Generally, iowait below 5% is considered healthy. Values between 5–15% warrant investigation, especially if they're sustained. Anything above 15% consistently indicates an IO bottleneck that will impact application performance. The threshold varies somewhat depending on your workload — a database server doing heavy sequential writes may tolerate slightly more iowait than a latency-sensitive application server.

    How do I quickly find which process is causing high iowait?

    Run 'iotop -o -P' to see only processes with active IO at that moment, sorted by IO usage. The '-o' flag filters out idle processes so you see the culprits immediately. If iotop isn't available, 'ps aux | awk '$8 ~ /D/ {print}'' shows processes in uninterruptible sleep (D state), which are blocked waiting on IO.

    Can high iowait cause high load average even with low CPU usage?

    Yes — this is one of the most common sources of confusion. Linux load average counts both runnable processes and processes in uninterruptible sleep (D state). When many processes are blocked waiting on IO, load average climbs dramatically even though CPU usage remains low. A load average of 20 with 5% CPU usage and 80% iowait is a textbook IO bottleneck, not a CPU problem.

    How do I reduce iowait during a RAID rebuild without stopping the rebuild?

    Throttle the resync speed by writing a lower value to /proc/sys/dev/raid/speed_limit_max. For example, 'echo 50000 > /proc/sys/dev/raid/speed_limit_max' caps rebuild throughput at roughly 50 MB/s, leaving bandwidth available for normal workloads. You can raise it back to the default (often 200000) during off-hours when you want the rebuild to complete faster.

    Is vm.swappiness=0 a good idea to prevent swap thrashing?

    Not usually. Setting vm.swappiness=0 tells the kernel to avoid swap as much as possible, but on older kernels it can cause the OOM killer to trigger aggressively instead of using available swap space as a buffer. A value of 10 is a better tradeoff for most production systems — it strongly prefers RAM while still allowing swap to absorb brief memory spikes without immediately killing processes.

    Related Articles