Symptoms
You log into a router and something feels wrong immediately. Commands take three seconds to echo back. The console is sluggish. You run
show processes cpuand see 94% for the past five minutes. Routing protocol neighbors start flapping. Your phone rings. This is high CPU on a Cisco router, and it's one of the more stressful situations in network operations because the box itself is struggling to help you investigate what's wrong with it.
Common symptoms you'll observe before and during the incident:
- CLI response time severely degraded — commands taking 2–10 seconds to execute
- SSH sessions dropping or refusing new connections entirely
- Routing protocol neighbors going down: OSPF, EIGRP, or BGP sessions resetting
- Syslog flooding with
%SYS-3-CPUHOG
or%SCHED-3-STARVATION
messages - SNMP traps firing for CPU threshold violations
- Increased latency and drops on transit traffic despite no interface errors
- NTP clock drift, keepalives missed, HSRP state transitions
Your first move is always to snapshot CPU utilization before you start changing anything. Get the data while the symptom is active.
sw-infrarunbook-01# show processes cpu sorted
CPU utilization for five seconds: 94%/72%; one minute: 91%; five minutes: 88%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
169 148721564 4832910 30775 41.59% 38.20% 37.10% 0 IP Input
62 52384721 9823641 5332 18.22% 15.40% 14.89% 0 OSPF Hello
45 12938471 2341987 5525 8.44% 7.22% 6.98% 0 CEF process
1 8241983 1023456 8054 3.20% 2.80% 2.75% 0 Chunk ManagerThe two numbers after "five seconds" are total CPU and interrupt CPU respectively. In the example above, 72% of that 94% is interrupt-driven. That distinction is critical — it tells you whether the problem lives in the software process scheduler or deeper down at the hardware interface layer. Getting that wrong sends you troubleshooting the wrong thing entirely.
Root Cause 1: A Single Process Consuming Excessive CPU
This is the most common scenario you'll encounter. A specific IOS process — IP Input, BGP I/O, OSPF Hello, or sometimes a management daemon like SNMP ENGINE — climbs to 40–60% utilization and stays there. In my experience, IP Input is the most frequent offender, and when it shows up at the top of the list, it usually means the router is process-switching traffic that should be CEF-switched, or it's receiving a flood of packets addressed directly to the router itself.
Why it happens: IOS runs a cooperative multitasking scheduler. Each process gets CPU time in turns. If one process has a massive backlog — BGP is processing thousands of updates, IP Input is handling a scan or amplification attack aimed at the router's own IP, or SNMP ENGINE is fielding a full routing table MIB walk every 30 seconds — it hogs the scheduler and starves everything else. The router's own routing and keepalive processes get delayed. Neighbors drop. Traffic forwarding degrades.
How to identify it: Start with
show processes cpu sortedto find the top offender, then use the history view to understand the timeline and duration.
sw-infrarunbook-01# show processes cpu history
888886666655555444443333322222111110000
321098765432109876543210987654321098765
100
90 **
80 ****
70 ****** *
60 ******** ***
50 ********* *****
40 ********** *******
CPU% per second (last 60 seconds)For IP Input specifically, check how much traffic is addressed to the router itself rather than transiting through it:
sw-infrarunbook-01# show ip traffic
IP statistics:
Rcvd: 15234982 total, 8231456 local destination,
0 format errors, 0 checksum errors, 0 bad hop count
0 unknown protocol, 12 not a gateway
Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
Bcast: 231 received, 0 sent
Mcast: 0 received, 0 sent
Sent: 6234981 generated, 0 forwarded
Drop: 1823 encapsulation failed, 0 unresolved, 0 no adjacencyWhen "local destination" is a large percentage of total received, the router is spending CPU handling traffic to its own address — management traffic, scans, DDoS, or misconfigured hosts. That all goes through IP Input as process-switched traffic.
How to fix it: For traffic flooding the router's own control plane, implement Control Plane Policing. CoPP rate-limits traffic destined to the router itself without affecting transit forwarding.
sw-infrarunbook-01(config)# ip access-list extended COPP-ICMP
sw-infrarunbook-01(config-ext-nacl)# permit icmp any any
sw-infrarunbook-01(config-ext-nacl)# exit
sw-infrarunbook-01(config)# class-map match-all COPP-ICMP-CLASS
sw-infrarunbook-01(config-cmap)# match access-group name COPP-ICMP
sw-infrarunbook-01(config-cmap)# exit
sw-infrarunbook-01(config)# policy-map COPP-POLICY
sw-infrarunbook-01(config-pmap)# class COPP-ICMP-CLASS
sw-infrarunbook-01(config-pmap-c)# police rate 1000 pps
sw-infrarunbook-01(config-pmap-c)# exit
sw-infrarunbook-01(config)# control-plane
sw-infrarunbook-01(config-cp)# service-policy input COPP-POLICYExtend this pattern to cover SNMP, SSH, BGP, OSPF, and NTP traffic with appropriate rate limits for each. One class per protocol so that an anomaly in one doesn't choke the others.
Root Cause 2: Interrupt-Level CPU High
This one catches engineers off-guard because most troubleshooting instincts focus on software processes. Interrupt CPU is fundamentally different — it represents time the CPU spends handling hardware interrupts, primarily packet receive and transmit operations at the interface driver level. When interrupt CPU is the dominant number, the problem isn't in the IOS scheduler — it's at the hardware boundary.
Why it happens: Every packet arriving on an interface generates a hardware interrupt. The CPU must service that interrupt to move the packet from the interface buffer into the software queue. If traffic volume is high enough, or if small packets arrive at extremely high packet-per-second rates — which generate far more interrupts than large packets at equivalent bandwidth — interrupt CPU climbs. A 1 Gbps stream of 64-byte packets generates roughly 1.5 million interrupts per second. Software-based routers without ASIC-based forwarding can't keep up.
How to identify it: Look at the second number in the five-seconds CPU field. When it's close to the total, interrupts are the problem, not processes.
sw-infrarunbook-01# show processes cpu sorted
CPU utilization for five seconds: 89%/81%; one minute: 85%; five minutes: 82%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
62 1238471 923456 1341 4.20% 3.80% 3.75% 0 OSPF Hello
45 1093847 812345 1347 2.44% 2.22% 2.18% 0 ARP Input
1 823456 654321 1258 1.10% 0.98% 0.95% 0 Chunk Manager89% total, 81% interrupt — the process list shows almost nothing, yet the router is grinding. That mismatch is the tell. Confirm with interface statistics:
sw-infrarunbook-01# show interfaces GigabitEthernet0/0/0
GigabitEthernet0/0/0 is up, line protocol is up
Hardware is ISR4400-4x1GE, address is 0050.56ab.1234
Internet address is 10.10.10.1/24
MTU 1500 bytes, BW 1000000 Kbit/sec
reliability 255/255, txload 248/255, rxload 251/255
Full-duplex, 1000Mb/s
input rate 987,654,321 bits/sec, 1,823,456 packets/sec
output rate 823,456,789 bits/sec, 1,456,789 packets/sec
Input queue: 245/75/1823/0 (size/max/drops/flushes)Nearly 1.8 million packets per second on input with input queue drops — that's your confirmation. The drops tell you the CPU can't drain the receive queue fast enough. Also check for interface errors that might be generating spurious interrupts from retransmits or CRC conditions:
sw-infrarunbook-01# show interfaces GigabitEthernet0/0/0 counters errors
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
Gi0/0/0 0 0 0 0 0 0How to fix it: Short-term, identify the traffic source via
show ip cache flowor NetFlow and null-route or upstream-filter the offending source. Long-term, if legitimate traffic is the cause, this router has hit a platform limitation — you need hardware with distributed forwarding via NPUs or ASICs (ASR, Catalyst 8000 series) rather than software interrupt handling. Verify CEF is fully enabled (covered next) as a first step, since process switching amplifies interrupt load.
Root Cause 3: CEF Disabled — Traffic Falling to Process Switching
This one has caused me genuine production pain more than once, and it's insidious because the symptoms look identical to a traffic flood. CEF is IOS's high-performance forwarding engine. When it's healthy, transit packets are switched using a pre-built FIB and adjacency table — no process involvement, minimal CPU. Disable it or break it, and every single transit packet goes through the IP Input process. On a busy router carrying a few hundred thousand packets per second, that means immediate CPU saturation.
Why it happens: CEF can be disabled manually with
no ip cef— sometimes left that way after troubleshooting. It also falls back to process switching for specific traffic types: unsupported GRE configurations, certain encryption modes, IP accounting enabled on an interface, adjacency failures that prevent the FIB from resolving next-hops, or interfaces with features that IOS can't accelerate. I've also seen it happen after a config restore from backup where the original config had CEF disabled for a debugging session that nobody cleaned up.
How to identify it:
sw-infrarunbook-01# show ip cef summary
IPv4 CEF is disabled
sw-infrarunbook-01# show ip interface GigabitEthernet0/0/0 | include switching
IP fast switching is disabled
IP CEF switching is disabled
IP route-cache flags are No CEFThat "No CEF" flag is the confirmation. But CEF can also be globally enabled while still failing per-prefix due to incomplete adjacencies. Check for those separately:
sw-infrarunbook-01# show ip cef summary
IPv4 CEF is enabled for distributed CEF
VRF Default
56234 prefixes (54891/1343 fwd/non-fwd)
Table id 0x0
sw-infrarunbook-01# show adjacency GigabitEthernet0/0/0 detail | include incomplete
10.10.10.254 incomplete
10.10.10.253 incompleteIncomplete adjacencies force the router to ARP for every single packet to those next-hops. That's per-packet ARP resolution, which is pure process-switching overhead even when CEF is nominally enabled. If your gateway IPs show as incomplete, you're process-switching everything that routes through them.
How to fix it: Re-enable CEF globally and clear the incomplete adjacencies:
sw-infrarunbook-01(config)# ip cef
sw-infrarunbook-01(config)# ip cef distributed
sw-infrarunbook-01(config)# interface GigabitEthernet0/0/0
sw-infrarunbook-01(config-if)# ip route-cache cef
! Clear stuck incomplete adjacencies
sw-infrarunbook-01# clear adjacency GigabitEthernet0/0/0
! Verify CEF is now active and adjacencies resolved
sw-infrarunbook-01# show ip cef summary
IPv4 CEF is enabled for distributed CEF
VRF Default
56234 prefixes (56234/0 fwd/non-fwd)
sw-infrarunbook-01# show adjacency GigabitEthernet0/0/0 summary
Protocol Interface Address
IP GigabitEthernet0/0/0 10.10.10.254(9)After re-enabling, watch IP Input in
show processes cpu sorted. If CEF was the cause, that process should drop from 40–60% to under 5% within 30–60 seconds as forwarding shifts back to the hardware-assisted FIB path.
Root Cause 4: Routing Protocol Instability
Routing protocol flapping is a CPU force multiplier. Every OSPF neighbor drop triggers an SPF recalculation. Every BGP session reset triggers withdrawal and re-advertisement of potentially hundreds of thousands of prefixes. Each convergence event burns CPU on route computation, RIB updates, and FIB rebuilds. In a network with many prefixes or a persistently flapping link, this can sustain elevated CPU for minutes — or indefinitely if you don't fix the underlying cause.
Why it happens: OSPF neighbors drop due to missed hellos from CPU overload (yes, high CPU causes flapping which causes more CPU — a feedback loop), interface errors, MTU mismatches, authentication failures, or dead timer misconfiguration. BGP sessions reset due to hold timer expiry when the router is too busy to send keepalives, TCP resets, or peer misconfigurations. Each protocol reconvergence is CPU-intensive, and in large topologies the cost compounds.
How to identify it: Check OSPF SPF execution frequency — a healthy network should show very few SPF runs:
sw-infrarunbook-01# show ip ospf statistics
OSPF Router with ID (10.10.10.1) (Process ID 1)
Area 0:
SPF algorithm executed 847 times
Last executed 00:00:03 ago
Last full SPF: 00:00:03 ago
SPF Throttling:
Initial SPF schedule delay 50 msecs
Minimum hold time between two consecutive SPFs 200 msecs
Maximum wait time between two consecutive SPFs 5000 msecs847 SPF runs is a serious problem — that means 847 topology change events. Check current neighbor states and look for anything not in FULL:
sw-infrarunbook-01# show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface
10.20.20.1 1 FULL/DR 00:00:35 10.10.10.2 Gi0/0/0
10.30.30.1 1 EXSTART/ - 00:00:34 10.10.10.3 Gi0/0/1
10.40.40.1 1 LOADING/ - 00:00:31 10.10.10.4 Gi0/0/2EXSTART and LOADING neighbors are stuck in database exchange, which burns CPU continuously. For BGP, check session stability and reset counts:
sw-infrarunbook-01# show bgp summary
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.10.10.254 4 65001 823456 412398 189234 0 982 00:12:43 524288
10.10.10.253 4 65002 8234 4123 0 0 0 00:00:14 Idle (Admin)
sw-infrarunbook-01# show bgp neighbors 10.10.10.254 | include resets
Number of resets: 234
Number of resets due to: Peer closed the session: 198An OutQ of 982 means BGP is backlogged — it can't send updates fast enough because the CPU is too busy to service the BGP I/O process. 234 session resets means that peer has been flapping. Each reset triggers a full table withdrawal and re-advertisement of 524,288 prefixes.
How to fix it: For OSPF, identify why neighbors are stuck. Check MTU consistency and authentication on both sides:
sw-infrarunbook-01# show ip ospf interface GigabitEthernet0/0/1
Internet Address 10.10.10.1/30, Area 0
Process ID 1, Router ID 10.10.10.1, Network Type POINT_TO_POINT
Timer intervals: Hello 10, Dead 40, Wait 40, Retransmit 5
Neighbor Count is 0, Adjacent neighbor count is 0
sw-infrarunbook-01# show interfaces GigabitEthernet0/0/1 | include MTU
MTU 1500 bytesZero neighbors on a point-to-point link with correct timers usually means an MTU mismatch. Tune SPF throttle timers to reduce the CPU cost of rapid reconvergence:
sw-infrarunbook-01(config)# router ospf 1
sw-infrarunbook-01(config-router)# timers throttle spf 50 200 5000
! For BGP: add dampening and BFD for fast, clean failure detection
sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router)# bgp dampening 15 750 2000 60
sw-infrarunbook-01(config-router)# neighbor 10.10.10.254 fall-over bfdBFD lets BGP detect link failures in milliseconds rather than waiting for the hold timer to expire. That means fewer queued updates get built up before the session tears down, which means less reconvergence work when it does.
Root Cause 5: Memory Pressure Causing CPU Spikes
Memory exhaustion and high CPU are tightly coupled — the relationship isn't obvious until you've seen it a few times. When a router starts running critically low on free memory, IOS spends increasing CPU cycles on memory management: garbage collection, buffer coalescing, memory pool compaction, and handling allocation failures. The CPU gauge shows high utilization, but the root driver is actually a leak or genuine memory exhaustion. Treating the CPU without finding the memory issue means the problem comes back.
Why it happens: Memory gets consumed by a full BGP routing table, memory leaks in specific IOS features or versions, large ACLs expanded into TCAM, NetFlow caches that grow unbounded, crypto session state accumulating on VPN concentrators, or processes that allocate memory and never free it. Once free memory drops below a threshold, IOS starts aggressively compacting memory pools — a CPU-intensive operation that runs repeatedly and starves everything else.
How to identify it:
sw-infrarunbook-01# show memory summary
Head Total(b) Used(b) Free(b) Lowest(b) Largest(b)
Processor 7F1234560000 536870912 521234567 15636345 8234123 7812345
lsmpi_io 7F2345670000 67108864 66234567 874297 234567 412345
I/O 7F3456780000 134217728 87654321 46563407 12345678 2345678915 MB of free processor memory on a 512 MB router is critical. The "Lowest" column is important — it shows the historical minimum since the last reload. At 8 MB, this router has been even lower than current. That's a platform on the edge of a crash. Now find what's holding the memory:
sw-infrarunbook-01# show processes memory sorted
Processor memory
PID TTY Allocated Freed Holding Getbufs Retbufs Process
62 0 523456789 478234567 45222222 0 0 BGP Router
186 0 234567890 189234567 45333323 0 0 BGP I/O
45 0 123456789 123450000 6789 0 0 OSPF Hello
0 0 89234567 43456789 45777778 0 0 *Dead*
sw-infrarunbook-01# show memory allocating-process totals | include Dead
45777778 *Dead*BGP Router and BGP I/O together holding 90 MB and never releasing. The
*Dead*entry with 45 MB held means a crashed process left allocated memory behind — a classic leak indicator. Check the IOS-XE version against Cisco's bug database for known BGP memory leaks on your platform.
How to fix it: Short-term, reduce memory consumption by filtering the BGP table and compressing the NetFlow cache:
! Cap BGP prefix acceptance with a warning at 80%
sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router)# neighbor 10.10.10.254 maximum-prefix 600000 80
! Check for runaway NetFlow cache
sw-infrarunbook-01# show ip cache flow | include entries
IP Flow Switching Cache, 4456448 bytes
1823456 active, 221568 inactive, 9234567 added
! Reduce cache size
sw-infrarunbook-01(config)# ip flow-cache entries 32768
! Emergency: clear specific caches for immediate relief
sw-infrarunbook-01# clear ip cache
sw-infrarunbook-01# clear ip flow statsIf you've confirmed a software leak and need to recover memory without a reload, clearing the BGP table causes it to be rebuilt from scratch, which sometimes reclaims leaked memory from the old state:
sw-infrarunbook-01# clear ip bgp 10.10.10.254
! This resets the BGP session — only do this in a maintenance window
! or if the alternative is a router reloadLong-term, upgrade to an IOS-XE release with the memory leak patched. Cisco's Software Checker will map your current version to known defects and recommend a fixed release for your platform.
Root Cause 6: SNMP Polling and Management Plane Overload
I've walked into more than one "mystery high CPU" situation where the culprit turned out to be an NMS polling the router every 30 seconds on every OID in the book. SNMP runs as a process in IOS and can consume substantial CPU when polled at high frequency, especially when walking large MIB tables.
Why it happens: Walking ipRouteTable or CISCO-BGP4-MIB on a router carrying a full Internet routing table forces IOS to serialize its internal routing data structures into SNMP response PDUs. It's expensive. Multiply by 10 monitoring systems all polling at 60-second intervals and you have sustained CPU load from SNMP ENGINE alone. Misconfigured trap destinations or SNMP community strings accessible to scanners make this worse.
How to identify it:
sw-infrarunbook-01# show processes cpu sorted | include SNMP
145 48234521 4823456 9997 22.34% 19.23% 18.45% 0 SNMP ENGINE
sw-infrarunbook-01# show snmp
Chassis: FOX1234ABCD
...
187234 SNMP packets input
0 Bad SNMP version errors
0 Unknown community name
187234 Number of requested variables
187234 Get-next PDUs
0 Set-request PDUs187,234 get-next PDUs confirms an active MIB walk. SNMP ENGINE at 22% CPU is the process-level confirmation. Fix it by restricting MIB access and enforcing SNMPv3:
sw-infrarunbook-01(config)# snmp-server view INFRA-VIEW internet included
sw-infrarunbook-01(config)# snmp-server view INFRA-VIEW ipRouteDest excluded
sw-infrarunbook-01(config)# snmp-server view INFRA-VIEW ipCidrRouteDest excluded
sw-infrarunbook-01(config)# snmp-server view INFRA-VIEW bgpPathAttrTable excluded
sw-infrarunbook-01(config)# snmp-server group INFRA-OPS v3 priv read INFRA-VIEW
sw-infrarunbook-01(config)# snmp-server user infrarunbook-admin INFRA-OPS v3 auth sha AuthP@ssXR9 priv aes 128 PrivP@ssXR9
! Restrict SNMP to known NMS hosts only
sw-infrarunbook-01(config)# ip access-list standard SNMP-HOSTS
sw-infrarunbook-01(config-std-nacl)# permit 10.20.30.10
sw-infrarunbook-01(config-std-nacl)# permit 10.20.30.11
sw-infrarunbook-01(config-std-nacl)# deny any log
sw-infrarunbook-01(config-std-nacl)# exit
sw-infrarunbook-01(config)# snmp-server community READONLY ro SNMP-HOSTSPrevention
Preventing high CPU incidents comes down to three fundamentals: baseline before you need it, protect the control plane before something attacks it, and keep the router doing what it was designed to do.
Start by establishing a CPU baseline on every router during normal operations. Run
show processes cpu historyand document what normal looks like. A router idling at 15% process CPU has headroom. One idling at 40% doesn't — the next BGP reconvergence or traffic spike will push it over the edge. Know your floor before the floor moves.
Deploy CoPP on every router. It's not optional in production networks. Without it, a single misconfigured host sending ICMP floods to your router's management IP can saturate the control plane and take down routing protocol sessions for legitimate traffic. Build separate CoPP classes for ICMP, SNMP, SSH, BGP, OSPF, and NTP so that an anomaly in one protocol doesn't kill the others.
Keep CEF healthy and monitored. Build a simple EEM applet to alert on syslog messages containing "CEF disabled" so accidental disablement gets caught in minutes rather than discovered during an incident. Periodically audit
show ip cef summaryacross your fleet — especially after config changes, software upgrades, or restores from backup.
Tune routing protocol timers for your topology scale. Default SPF timers and BGP hold timers were designed for small networks. In large-scale deployments, SPF throttle timers on OSPF and BGP dampening for unstable prefixes significantly reduce CPU cost during reconvergence. BFD for BGP fast failover means the session tears down before a backlog of keepalives and updates builds up.
Monitor memory trends with your NMS, not just current values. A router losing 1 MB of free processor memory per day has roughly two weeks before it becomes a problem. Trending
ciscoMemoryPoolFreeover time will surface slow leaks months before they cause an outage. Alert on a threshold, not just a floor.
Keep IOS-XE patched against known bugs. A significant number of high-CPU incidents in production environments are caused by documented software defects that already have fixes available. Use Cisco's Software Checker to identify recommended releases for your hardware platform and schedule maintenance upgrades before bugs become outages. The best time to patch is before the router pages you at 2 AM.
