Symptoms
High CPU on an Arista switch rarely announces itself politely. The first sign is usually SSH becoming sluggish — keystrokes lag, tab completion stalls for two or three seconds, and sometimes you'll get a connection timeout before you even reach the prompt. Control-plane traffic starts dropping. BGP neighbors go down, OSPF adjacencies reset, and suddenly you're getting paged about a network event that started with a single overloaded process.
The canonical first command is
show processes top. Run it and watch for anything consuming more than 20-30% CPU consistently. On a healthy switch under normal load, no single process should be pinning the CPU. Here's what a troubled switch looks like:
sw-infrarunbook-01# show processes top once
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1842 root 20 0 512424 89212 14320 R 87.3 2.1 14:32.41 Bgp
421 root 20 0 198432 22104 8812 S 12.1 0.5 2:14.22 Syslog
654 root 20 0 88432 11204 4812 S 4.2 0.3 0:44.12 Ospf
1 root 20 0 41320 5012 3412 S 0.1 0.1 0:02.14 init
That BGP process at 87% is a serious problem. In my experience, once a process sustains above 50%, the switch is already struggling to keep up with control-plane work, and it won't recover on its own.
Other symptoms you'll commonly see alongside high CPU include syslog flooding — thousands of identical messages per second when you run
show logging— incrementing drops on management or CPU-facing ports in
show interfaces, BGP or OSPF adjacency flaps showing up in
show logging last 100, and occasionally ZeroTouch Provisioning still running on a switch that should have been long since provisioned.
Let's go through the root causes one by one, starting with the ones that cause the most damage.
Root Cause 1: Routing Protocol Flapping
OSPF and BGP both generate significant CPU work when neighbors are unstable. Every time a neighbor drops and comes back up, the switch runs SPF calculations, updates the RIB, and reprograms the FIB. If this is happening multiple times per minute, it becomes a CPU death spiral. OSPF is particularly brutal here because SPF runs are synchronous and expensive — the process can't do anything else while it's running the shortest-path tree.
I've seen this triggered most often by a marginal physical link: intermittent fiber, a bad SFP, or a misconfigured MTU causing OSPF hellos to get dropped sporadically. BGP flapping can also be caused by an overwhelmed peer that isn't responding to keepalives within the negotiated hold time.
How to Identify It
Start with syslog and grep for neighbor state transitions:
sw-infrarunbook-01# show logging last 200 | grep -i "neighbor\|adjac\|state"
Apr 17 03:14:22 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: INIT
Apr 17 03:14:28 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: FULL
Apr 17 03:14:41 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: INIT
Apr 17 03:14:47 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: FULL
That INIT → FULL → INIT → FULL pattern repeating every 20 seconds is the smoking gun. Pull the OSPF neighbor detail to see how many state changes have happened:
sw-infrarunbook-01# show ip ospf neighbor 10.0.0.1 detail
Neighbor 10.0.0.1, interface address 10.0.0.1
In the area 0.0.0.0 via interface Ethernet3
Neighbor priority is 1, State is Full, 6 state changes
DR is 10.0.0.2, BDR is 10.0.0.1
Options is 0x12 -|-|-|-|-|-|E|-
Dead timer due in 00:00:38
Last hello received 00:00:02 ago
Six state changes on what should be a stable link is a clear sign of instability. Check BGP as well:
sw-infrarunbook-01# show ip bgp summary
BGP summary information for VRF default
Router identifier 10.0.1.1, local AS number 65001
Neighbor AS Session State AFI/SAFI Up/Down State Reason
10.0.2.1 65002 Established IPv4 Unicast 00:00:43
10.0.3.1 65003 Established IPv4 Unicast 00:00:12
10.0.4.1 65004 Idle IPv4 Unicast 00:14:22 NoReason
How to Fix It
Fix the underlying physical problem first. Check the interface error counters:
sw-infrarunbook-01# show interfaces Ethernet3 counters errors
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
Et3 0 4821 0 4821 0 0
FCS errors confirm a physical-layer problem — bad cable, SFP, or fiber. Replace the hardware. While you're waiting, increase the OSPF dead interval on the flapping interface to give yourself breathing room:
sw-infrarunbook-01(config)# interface Ethernet3
sw-infrarunbook-01(config-if-Et3)# ip ospf dead-interval 60
sw-infrarunbook-01(config-if-Et3)# ip ospf hello-interval 15
For BGP, enable route dampening to prevent a flapping peer from continuously triggering full RIB recalculations:
sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# bgp dampening
Root Cause 2: Large BGP Table
If you're peering with an upstream provider and accepting a full internet routing table — around 950,000 prefixes as of 2026 — the BGP process on your switch is doing a huge amount of work just maintaining that table, running best-path selection, and keeping peers synchronized. On lower-end Arista platforms that weren't sized for this, it'll peg the CPU during convergence events. Even on capable hardware, accepting an unfiltered full table from multiple peers compounds the problem considerably.
This often creeps up gradually. The table grew slowly over months, and the CPU followed. Operators don't notice until something triggers a full BGP reconvergence — a peer reset, a software upgrade, a link flap — and suddenly the switch is completely swamped.
How to Identify It
sw-infrarunbook-01# show ip bgp summary
BGP summary information for VRF default
Router identifier 10.0.1.1, local AS number 65001
Neighbor AS Session State AFI/SAFI Pfx Rcvd Up/Down
10.0.2.1 65002 Established IPv4 Unicast 952847 5d03h
10.0.3.1 65003 Established IPv4 Unicast 948221 5d03h
Almost a million prefixes from two peers — that's the full internet table, twice over. Check what BGP is doing to memory:
sw-infrarunbook-01# show processes top once | grep -i bgp
1842 root 20 0 2.1g 1.4g 14320 R 72.3 34.1 94:32.41 Bgp
1.4 GB of resident memory for BGP alone. Also check your RIB size to confirm the scale of the problem:
sw-infrarunbook-01# show ip route summary
Operating routing protocol model: multi-agent
Maximum number of ecmp paths allowed: 4
Connected: 12 prefixes (12 paths)
Static: 4 prefixes (4 paths)
BGP: 952847 prefixes (952847 paths)
OSPF: 42 prefixes (84 paths)
Total: 953905 prefixes (953947 paths)
How to Fix It
The right fix is route filtering. Unless this switch specifically needs to make forwarding decisions based on a full internet table, configure a prefix list and route map to accept only what you need — typically a default route plus any specific prefixes you use for traffic engineering:
sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 5 permit 0.0.0.0/0
sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 10 permit 10.0.0.0/8 le 32
sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 15 permit 172.16.0.0/12 le 32
sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 20 permit 192.168.0.0/16 le 32
sw-infrarunbook-01(config)# route-map ACCEPT-LIMITED permit 10
sw-infrarunbook-01(config-route-map-ACCEPT-LIMITED)# match ip address prefix-list FILTER-FULL-TABLE
sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 route-map ACCEPT-LIMITED in
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.3.1 route-map ACCEPT-LIMITED in
After applying the route map, do a soft reset so you don't drop and re-establish the sessions:
sw-infrarunbook-01# clear ip bgp 10.0.2.1 soft in
sw-infrarunbook-01# clear ip bgp 10.0.3.1 soft in
Also set a max-routes limit as a safety net so this can't quietly happen again:
sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 maximum-routes 10000 warning-limit 8000
Root Cause 3: Interface Errors Causing a Log Storm
This one is sneaky. A single flapping interface — or one generating continuous CRC errors — can flood syslog at a rate that overwhelms the Syslog process itself. The switch spends more CPU writing log messages than doing actual network work. I've walked into situations where the Syslog process was sitting at 40% CPU and every other process was starved for cycles. The culprit was a single bad SFP toggling link state hundreds of times per hour.
The tricky part is that syslog storms mask the real cause. You'll see Syslog high in
show processes top, but that's a symptom. The root cause is always the event generating the messages.
How to Identify It
sw-infrarunbook-01# show logging last 50 | grep -i "link\|down\|up"
Apr 17 03:22:01 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to down
Apr 17 03:22:02 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to up
Apr 17 03:22:04 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to down
Apr 17 03:22:05 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to up
Two link flaps per second. Now check the interface error counters to confirm the physical problem:
sw-infrarunbook-01# show interfaces Ethernet12
Ethernet12 is up, line protocol is up (connected)
Hardware is Ethernet, address is 001c.7300.1234
Last clearing of "show interface" counters: 1d02h
Input statistics:
0 runts, 0 giants, 0 throttles
18943 input errors, 18943 CRC, 0 alignment
0 symbol, 0 input discard
Output statistics:
0 output errors
18,943 CRC errors since the last counter clear. That's your culprit. You can also confirm by watching the Syslog process directly:
sw-infrarunbook-01# show processes top once | grep -i syslog
421 root 20 0 198432 22104 8812 S 38.4 0.5 2:14.22 Syslog
How to Fix It
The permanent fix is replacing the bad SFP, cable, or far-end transceiver. But while you're waiting for hardware, shut the offending interface to stop the storm immediately:
sw-infrarunbook-01(config)# interface Ethernet12
sw-infrarunbook-01(config-if-Et12)# shutdown
For future protection, configure logging rate limiting so no single event source can ever monopolize the system:
sw-infrarunbook-01(config)# logging rate-limit 100
Configure errdisable link-flap detection as well. This will automatically disable an interface that's flapping beyond a threshold, preventing it from generating continuous log events:
sw-infrarunbook-01(config)# errdisable detect cause link-flap
sw-infrarunbook-01(config)# errdisable recovery cause link-flap
sw-infrarunbook-01(config)# errdisable recovery interval 300
Root Cause 4: ZTP Still Running
ZeroTouch Provisioning is an EOS feature that lets a switch bootstrap itself from a provisioning server on first boot — it fires up, hits DHCP, grabs a config script, and configures itself without anyone touching the CLI. Extremely useful. The problem is when ZTP is still active on a switch that's already deployed and operational, or on a switch that can't reach its provisioning server. ZTP will keep retrying DHCP, polling for configuration, logging failures, and all of that activity burns CPU and fills up logs. Don't laugh — I've seen production switches running ZTP continuously for weeks because nobody noticed.
This happens most often after a reload where the startup config was lost or corrupted, after someone accidentally ran
write erase, or on new switches that were powered up and connected to the network without completing the provisioning flow.
How to Identify It
sw-infrarunbook-01# show zerotouch
ZeroTouch State: Active
ZeroTouch Config: Provisioning server unreachable
Last ZeroTouch Attempt: 00:02:14 ago
ZeroTouch script: Not downloaded
That "Active" state on a switch that's already configured is the problem. You'll also see ZTP in the process list consuming CPU:
sw-infrarunbook-01# show processes top once | grep -i ztp
312 root 20 0 32124 8012 3412 S 14.2 0.2 2:11.44 ZeroTouch
And syslog will be full of DHCP and HTTP retry noise at a regular interval:
sw-infrarunbook-01# show logging | grep -i ztp
Apr 17 03:30:01 sw-infrarunbook-01 ZeroTouch: %ZTP-5-DHCP_ATTEMPT: Attempting DHCP on Management1
Apr 17 03:30:07 sw-infrarunbook-01 ZeroTouch: %ZTP-3-DHCP_FAILED: DHCP failed on Management1
Apr 17 03:30:07 sw-infrarunbook-01 ZeroTouch: %ZTP-3-PROVISION_FAILED: Provisioning attempt failed, retrying in 120s
How to Fix It
This is the easiest fix on this list. Cancel ZTP and the process stops immediately:
sw-infrarunbook-01# zerotouch cancel
ZeroTouch: Cancelling ZeroTouch
ZeroTouch: Disabled
Verify it stopped:
sw-infrarunbook-01# show zerotouch
ZeroTouch State: Disabled
To permanently disable ZTP so it never activates again after a reload, run:
sw-infrarunbook-01# zerotouch disable
Then save the running config so the disable state persists across reboots:
sw-infrarunbook-01# write memory
Root Cause 5: BFD Sessions Flapping
Bidirectional Forwarding Detection is designed to provide fast failure detection — subsecond, when configured aggressively. But BFD's speed is also its weakness. When BFD timers are set too low relative to what the underlying path can reliably support, sessions will flap. And unlike a simple keepalive, every BFD session flap triggers protocol events: BGP peers go down, OSPF adjacencies reset, static routes disappear. All of that reconvergence work hammers the CPU in a compounding loop — BFD flap causes BGP reset which causes RIB churn which causes FIB reprogramming, all while BFD is flapping again.
In my experience, this almost always happens after someone tuned BFD timers aggressively to improve failover speed without fully characterizing the path. A congested uplink, a hypervisor host briefly pausing during vMotion, or even high-frequency garbage collection on a software BGP speaker can be enough to miss BFD hellos at 300ms intervals.
How to Identify It
sw-infrarunbook-01# show bfd peers
VRF name: default
-----------------
DstAddr MyDisc YourDisc Interface/Transport Type LastUp
10.0.2.1 3120498932 2847612301 Ethernet1 normal 04/17 02:44:01
10.0.3.1 1234098123 9871234509 Ethernet2 normal 04/17 03:29:47
10.0.4.1 8712349812 1234987123 Ethernet3 normal 04/17 03:30:01
The timestamps tell part of the story — sessions came up just minutes apart. Get the full detail including up/down transition counts:
sw-infrarunbook-01# show bfd peers detail
VRF name: default
-----------------
DstAddr: 10.0.4.1
State: Up, Timer Multiplier: 3, BFD Type: normal
Tx Interval: 300 ms, Rx Interval: 300 ms
Registered Protocols: BGP
Up/Down: 83/82
Last State Change: 04/17 03:30:01
83 up/down transitions for a single BFD peer. That's catastrophic — each one of those triggered BGP reconvergence. Cross-reference with syslog to see the frequency:
sw-infrarunbook-01# show logging | grep -i bfd
Apr 17 03:30:01 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Up
Apr 17 03:29:58 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Down
Apr 17 03:29:55 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Up
Apr 17 03:29:52 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Down
Flapping every 3 seconds. The CPU is running BGP reconvergence faster than it can complete a single cycle.
How to Fix It
Back off the BFD timers to something the path can reliably sustain. The default 300ms minimum interval with a 3x multiplier gives you a 900ms detection time. That's already aggressive for anything but a local Ethernet segment. Bump it up:
sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.4.1 bfd interval 750 min-rx 750 multiplier 3
That gives you a 2.25-second detection window — still fast, but far more tolerant of brief path delays. If BFD isn't strictly required for a given neighbor, disable it entirely until you can characterize and fix the path:
sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# no neighbor 10.0.4.1 bfd
After adjusting timers, verify the session stabilizes using the counters output:
sw-infrarunbook-01# show bfd peers counters
VRF name: default
-----------------
DstAddr LastDown LastUp FailedTx FailedRx TimeoutTx TimeoutRx
10.0.4.1 04/17 03:30:01 04/17 03:35:12 0 0 0 0
No new failures after 03:35 — the session has been stable since the timer adjustment.
Root Cause 6: SNMP Polling Overload
An SNMP management system polling a switch every 30 seconds while walking the full MIB tree will consume a surprisingly large amount of CPU. Multiply that across OID walks from multiple monitoring systems, and
snmpdcan become a real contributor to sustained high CPU. This is particularly true for MIBs that require building large response tables — interface statistics across a 96-port switch, the full routing table via
ipCidrRouteTable, or BGP4 MIB walks across a large peering table.
How to Identify It
sw-infrarunbook-01# show processes top once | grep -i snmp
892 root 20 0 198432 44212 8812 S 28.4 1.1 8:34.22 snmpd
sw-infrarunbook-01# show snmp
Chassis: FCW2142L05H
Contact: infrarunbook-admin@solvethenetwork.com
Location: DC1-Row4-Rack12
SNMP packets input: 284921
Bad SNMP version errors: 0
Unknown community string: 0
Get-request PDUs: 142304
Get-next PDUs: 18151517
Set-request PDUs: 0
SNMP packets output: 18293821
18 million OID requests is significant. The get-next PDU count being 100x higher than get-requests is the classic sign of MIB walks rather than targeted polls. The switch is building full response tables for each walk.
How to Fix It
Coordinate with your monitoring team to reduce polling frequency and target specific OIDs instead of doing full MIB walks. On the switch side, configure SNMP to run at a lower priority so it doesn't starve control-plane processes:
sw-infrarunbook-01(config)# snmp-server qos dscp 0
Longer term, moving from SNMP polling to streaming telemetry via gNMI is the right answer. EOS has excellent gNMI support, and subscribing to specific paths at a defined interval is far more efficient than repeated MIB walks. The switch doesn't have to build response tables on demand, and you'll get sub-second granularity on the metrics that matter.
Root Cause 7: ACL TCAM Overflow Causing Software Forwarding
Very large ACLs that exceed TCAM capacity force traffic evaluation up to the CPU for software-based policy processing. On Arista, ACL processing is normally done entirely in hardware via TCAM — it's essentially free from a CPU perspective. But when your ACL entry count exceeds what the hardware can hold, EOS starts spilling entries into software, and any traffic that matches those software entries has to be handled by the CPU.
How to Identify It
sw-infrarunbook-01# show hardware capacity
...
TCAM:
Ingress IPv4 ACL entries : 3998/4000 (99%)
Egress IPv4 ACL entries : 2100/2000 (105%) *** EXCEEDED ***
Exceeded TCAM means software fallback. Check which ACL is the culprit:
sw-infrarunbook-01# show ip access-lists summary
IPV4 ACL BLOCK-THREATS
Total ACEs configured: 2847
Sequence numbers: 10-28470
IPV4 ACL MGMT-ACCESS
Total ACEs configured: 12
Sequence numbers: 10-120
An ACL with 2,847 entries is almost certainly the source of your TCAM pressure. Work with your security team to consolidate entries using object-groups, summarize IP ranges into prefix blocks, or migrate the policy to a dedicated firewall that's purpose-built for large ACL tables.
Prevention
Most high-CPU incidents on Arista EOS are preventable. The work happens before the incident, not during it.
Set up CPU alerting in your monitoring system. Alert at 60% sustained, page at 80%. Don't wait for control-plane drops to tell you something is wrong — by the time BGP sessions are dropping, you're already in full incident response mode.
Configure logging rate limits on every switch during your initial build. A syslog storm should never be able to monopolize system resources:
sw-infrarunbook-01(config)# logging rate-limit 200
Disable ZTP explicitly on every switch once it's provisioned. Make this part of your standard build checklist and your configuration management template. It takes 10 seconds to prevent the problem permanently:
sw-infrarunbook-01# zerotouch disable
sw-infrarunbook-01# write memory
Apply max-routes on all BGP neighbors so a peer can't accidentally blow up your routing table. A reasonable starting point is 10,000 routes for internal peers and 1,000 for anything you don't fully control. The warning-limit triggers a syslog alert before the hard limit kicks in, giving you time to investigate:
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 maximum-routes 10000 warning-limit 8000
Set conservative BFD timers by default and only tune them aggressively on paths you've explicitly validated. A 750ms minimum interval with a 3x multiplier is a solid starting point for most deployments. Never copy aggressive BFD configs from data center core links to WAN-facing interfaces — the path characteristics are completely different and the latency characteristics that work on a 10GE local link will cause constant flapping over a 100ms-latency WAN circuit.
Enable link-flap errdisable detection as a standard build item. A single bad SFP shouldn't be able to generate enough syslog events to degrade the entire control plane:
sw-infrarunbook-01(config)# errdisable detect cause link-flap
sw-infrarunbook-01(config)# errdisable recovery cause link-flap
sw-infrarunbook-01(config)# errdisable recovery interval 300
Finally, if you're still running SNMP polling for monitoring, evaluate a migration to streaming telemetry. EOS's gNMI implementation is mature and well-supported, and targeted subscriptions are dramatically more efficient than periodic MIB walks. You'll get better data at lower cost to the switch — and you won't be explaining to management why a monitoring system degraded a production switch.
High CPU on a network switch is never just one of those things. There's always a root cause, and EOS gives you the tools to find it. The commands in this article will get you to the answer in under five minutes for most scenarios. Know them before you need them.
