Arista EOS High CPU Troubleshooting

Symptoms

High CPU on an Arista switch rarely announces itself politely. The first sign is usually SSH becoming sluggish — keystrokes lag, tab completion stalls for two or three seconds, and sometimes you'll get a connection timeout before you even reach the prompt. Control-plane traffic starts dropping. BGP neighbors go down, OSPF adjacencies reset, and suddenly you're getting paged about a network event that started with a single overloaded process.

The canonical first command is

show processes top

. Run it and watch for anything consuming more than 20-30% CPU consistently. On a healthy switch under normal load, no single process should be pinning the CPU. Here's what a troubled switch looks like:

sw-infrarunbook-01# show processes top once
PID    USER    PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
1842   root    20   0  512424  89212  14320 R  87.3  2.1  14:32.41 Bgp
 421   root    20   0  198432  22104   8812 S  12.1  0.5   2:14.22 Syslog
 654   root    20   0   88432  11204   4812 S   4.2  0.3   0:44.12 Ospf
   1   root    20   0   41320   5012   3412 S   0.1  0.1   0:02.14 init

That BGP process at 87% is a serious problem. In my experience, once a process sustains above 50%, the switch is already struggling to keep up with control-plane work, and it won't recover on its own.

Other symptoms you'll commonly see alongside high CPU include syslog flooding — thousands of identical messages per second when you run

show logging

— incrementing drops on management or CPU-facing ports in

show interfaces

, BGP or OSPF adjacency flaps showing up in

show logging last 100

, and occasionally ZeroTouch Provisioning still running on a switch that should have been long since provisioned.

Let's go through the root causes one by one, starting with the ones that cause the most damage.

Root Cause 1: Routing Protocol Flapping

OSPF and BGP both generate significant CPU work when neighbors are unstable. Every time a neighbor drops and comes back up, the switch runs SPF calculations, updates the RIB, and reprograms the FIB. If this is happening multiple times per minute, it becomes a CPU death spiral. OSPF is particularly brutal here because SPF runs are synchronous and expensive — the process can't do anything else while it's running the shortest-path tree.

I've seen this triggered most often by a marginal physical link: intermittent fiber, a bad SFP, or a misconfigured MTU causing OSPF hellos to get dropped sporadically. BGP flapping can also be caused by an overwhelmed peer that isn't responding to keepalives within the negotiated hold time.

How to Identify It

Start with syslog and grep for neighbor state transitions:

sw-infrarunbook-01# show logging last 200 | grep -i "neighbor\|adjac\|state"
Apr 17 03:14:22 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: INIT
Apr 17 03:14:28 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: FULL
Apr 17 03:14:41 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: INIT
Apr 17 03:14:47 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: FULL

That INIT → FULL → INIT → FULL pattern repeating every 20 seconds is the smoking gun. Pull the OSPF neighbor detail to see how many state changes have happened:

sw-infrarunbook-01# show ip ospf neighbor 10.0.0.1 detail
Neighbor 10.0.0.1, interface address 10.0.0.1
  In the area 0.0.0.0 via interface Ethernet3
  Neighbor priority is 1, State is Full, 6 state changes
  DR is 10.0.0.2, BDR is 10.0.0.1
  Options is 0x12 -|-|-|-|-|-|E|-
  Dead timer due in 00:00:38
  Last hello received 00:00:02 ago

Six state changes on what should be a stable link is a clear sign of instability. Check BGP as well:

sw-infrarunbook-01# show ip bgp summary
BGP summary information for VRF default
Router identifier 10.0.1.1, local AS number 65001
Neighbor          AS Session State AFI/SAFI    Up/Down   State Reason
10.0.2.1       65002 Established   IPv4 Unicast  00:00:43
10.0.3.1       65003 Established   IPv4 Unicast  00:00:12
10.0.4.1       65004 Idle          IPv4 Unicast  00:14:22  NoReason

How to Fix It

Fix the underlying physical problem first. Check the interface error counters:

sw-infrarunbook-01# show interfaces Ethernet3 counters errors
Port         Align-Err    FCS-Err   Xmit-Err    Rcv-Err  UnderSize  OutDiscards
Et3                  0       4821           0       4821          0            0

FCS errors confirm a physical-layer problem — bad cable, SFP, or fiber. Replace the hardware. While you're waiting, increase the OSPF dead interval on the flapping interface to give yourself breathing room:

sw-infrarunbook-01(config)# interface Ethernet3
sw-infrarunbook-01(config-if-Et3)# ip ospf dead-interval 60
sw-infrarunbook-01(config-if-Et3)# ip ospf hello-interval 15

For BGP, enable route dampening to prevent a flapping peer from continuously triggering full RIB recalculations:

sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# bgp dampening

Root Cause 2: Large BGP Table

If you're peering with an upstream provider and accepting a full internet routing table — around 950,000 prefixes as of 2026 — the BGP process on your switch is doing a huge amount of work just maintaining that table, running best-path selection, and keeping peers synchronized. On lower-end Arista platforms that weren't sized for this, it'll peg the CPU during convergence events. Even on capable hardware, accepting an unfiltered full table from multiple peers compounds the problem considerably.

This often creeps up gradually. The table grew slowly over months, and the CPU followed. Operators don't notice until something triggers a full BGP reconvergence — a peer reset, a software upgrade, a link flap — and suddenly the switch is completely swamped.

How to Identify It

sw-infrarunbook-01# show ip bgp summary
BGP summary information for VRF default
Router identifier 10.0.1.1, local AS number 65001
Neighbor          AS Session State AFI/SAFI    Pfx Rcvd   Up/Down
10.0.2.1       65002 Established   IPv4 Unicast    952847   5d03h
10.0.3.1       65003 Established   IPv4 Unicast    948221   5d03h

Almost a million prefixes from two peers — that's the full internet table, twice over. Check what BGP is doing to memory:

sw-infrarunbook-01# show processes top once | grep -i bgp
1842   root    20   0  2.1g   1.4g  14320 R  72.3 34.1  94:32.41 Bgp

1.4 GB of resident memory for BGP alone. Also check your RIB size to confirm the scale of the problem:

sw-infrarunbook-01# show ip route summary
Operating routing protocol model: multi-agent
Maximum number of ecmp paths allowed: 4

  Connected: 12 prefixes (12 paths)
  Static: 4 prefixes (4 paths)
  BGP: 952847 prefixes (952847 paths)
  OSPF: 42 prefixes (84 paths)

Total: 953905 prefixes (953947 paths)

How to Fix It

The right fix is route filtering. Unless this switch specifically needs to make forwarding decisions based on a full internet table, configure a prefix list and route map to accept only what you need — typically a default route plus any specific prefixes you use for traffic engineering:

sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 5 permit 0.0.0.0/0
sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 10 permit 10.0.0.0/8 le 32
sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 15 permit 172.16.0.0/12 le 32
sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 20 permit 192.168.0.0/16 le 32

sw-infrarunbook-01(config)# route-map ACCEPT-LIMITED permit 10
sw-infrarunbook-01(config-route-map-ACCEPT-LIMITED)# match ip address prefix-list FILTER-FULL-TABLE

sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 route-map ACCEPT-LIMITED in
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.3.1 route-map ACCEPT-LIMITED in

After applying the route map, do a soft reset so you don't drop and re-establish the sessions:

sw-infrarunbook-01# clear ip bgp 10.0.2.1 soft in
sw-infrarunbook-01# clear ip bgp 10.0.3.1 soft in

Also set a max-routes limit as a safety net so this can't quietly happen again:

sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 maximum-routes 10000 warning-limit 8000

Root Cause 3: Interface Errors Causing a Log Storm

This one is sneaky. A single flapping interface — or one generating continuous CRC errors — can flood syslog at a rate that overwhelms the Syslog process itself. The switch spends more CPU writing log messages than doing actual network work. I've walked into situations where the Syslog process was sitting at 40% CPU and every other process was starved for cycles. The culprit was a single bad SFP toggling link state hundreds of times per hour.

The tricky part is that syslog storms mask the real cause. You'll see Syslog high in

show processes top

, but that's a symptom. The root cause is always the event generating the messages.

How to Identify It

sw-infrarunbook-01# show logging last 50 | grep -i "link\|down\|up"
Apr 17 03:22:01 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to down
Apr 17 03:22:02 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to up
Apr 17 03:22:04 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to down
Apr 17 03:22:05 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to up

Two link flaps per second. Now check the interface error counters to confirm the physical problem:

sw-infrarunbook-01# show interfaces Ethernet12
Ethernet12 is up, line protocol is up (connected)
  Hardware is Ethernet, address is 001c.7300.1234
  Last clearing of "show interface" counters: 1d02h
  Input statistics:
    0 runts, 0 giants, 0 throttles
    18943 input errors, 18943 CRC, 0 alignment
    0 symbol, 0 input discard
  Output statistics:
    0 output errors

18,943 CRC errors since the last counter clear. That's your culprit. You can also confirm by watching the Syslog process directly:

sw-infrarunbook-01# show processes top once | grep -i syslog
 421   root    20   0  198432  22104   8812 S  38.4  0.5   2:14.22 Syslog

How to Fix It

The permanent fix is replacing the bad SFP, cable, or far-end transceiver. But while you're waiting for hardware, shut the offending interface to stop the storm immediately:

sw-infrarunbook-01(config)# interface Ethernet12
sw-infrarunbook-01(config-if-Et12)# shutdown

For future protection, configure logging rate limiting so no single event source can ever monopolize the system:

sw-infrarunbook-01(config)# logging rate-limit 100

Configure errdisable link-flap detection as well. This will automatically disable an interface that's flapping beyond a threshold, preventing it from generating continuous log events:

sw-infrarunbook-01(config)# errdisable detect cause link-flap
sw-infrarunbook-01(config)# errdisable recovery cause link-flap
sw-infrarunbook-01(config)# errdisable recovery interval 300

Root Cause 4: ZTP Still Running

ZeroTouch Provisioning is an EOS feature that lets a switch bootstrap itself from a provisioning server on first boot — it fires up, hits DHCP, grabs a config script, and configures itself without anyone touching the CLI. Extremely useful. The problem is when ZTP is still active on a switch that's already deployed and operational, or on a switch that can't reach its provisioning server. ZTP will keep retrying DHCP, polling for configuration, logging failures, and all of that activity burns CPU and fills up logs. Don't laugh — I've seen production switches running ZTP continuously for weeks because nobody noticed.

This happens most often after a reload where the startup config was lost or corrupted, after someone accidentally ran

write erase

, or on new switches that were powered up and connected to the network without completing the provisioning flow.

How to Identify It

sw-infrarunbook-01# show zerotouch
ZeroTouch State: Active
ZeroTouch Config: Provisioning server unreachable
Last ZeroTouch Attempt: 00:02:14 ago
ZeroTouch script: Not downloaded

That "Active" state on a switch that's already configured is the problem. You'll also see ZTP in the process list consuming CPU:

sw-infrarunbook-01# show processes top once | grep -i ztp
 312   root    20   0   32124   8012   3412 S  14.2  0.2   2:11.44 ZeroTouch

And syslog will be full of DHCP and HTTP retry noise at a regular interval:

sw-infrarunbook-01# show logging | grep -i ztp
Apr 17 03:30:01 sw-infrarunbook-01 ZeroTouch: %ZTP-5-DHCP_ATTEMPT: Attempting DHCP on Management1
Apr 17 03:30:07 sw-infrarunbook-01 ZeroTouch: %ZTP-3-DHCP_FAILED: DHCP failed on Management1
Apr 17 03:30:07 sw-infrarunbook-01 ZeroTouch: %ZTP-3-PROVISION_FAILED: Provisioning attempt failed, retrying in 120s

How to Fix It

This is the easiest fix on this list. Cancel ZTP and the process stops immediately:

sw-infrarunbook-01# zerotouch cancel
ZeroTouch: Cancelling ZeroTouch
ZeroTouch: Disabled

Verify it stopped:

sw-infrarunbook-01# show zerotouch
ZeroTouch State: Disabled

To permanently disable ZTP so it never activates again after a reload, run:

sw-infrarunbook-01# zerotouch disable

Then save the running config so the disable state persists across reboots:

sw-infrarunbook-01# write memory

Root Cause 5: BFD Sessions Flapping

Bidirectional Forwarding Detection is designed to provide fast failure detection — subsecond, when configured aggressively. But BFD's speed is also its weakness. When BFD timers are set too low relative to what the underlying path can reliably support, sessions will flap. And unlike a simple keepalive, every BFD session flap triggers protocol events: BGP peers go down, OSPF adjacencies reset, static routes disappear. All of that reconvergence work hammers the CPU in a compounding loop — BFD flap causes BGP reset which causes RIB churn which causes FIB reprogramming, all while BFD is flapping again.

In my experience, this almost always happens after someone tuned BFD timers aggressively to improve failover speed without fully characterizing the path. A congested uplink, a hypervisor host briefly pausing during vMotion, or even high-frequency garbage collection on a software BGP speaker can be enough to miss BFD hellos at 300ms intervals.

How to Identify It

sw-infrarunbook-01# show bfd peers
VRF name: default
-----------------
DstAddr         MyDisc   YourDisc Interface/Transport    Type          LastUp
10.0.2.1    3120498932 2847612301 Ethernet1              normal    04/17 02:44:01
10.0.3.1    1234098123 9871234509 Ethernet2              normal    04/17 03:29:47
10.0.4.1    8712349812 1234987123 Ethernet3              normal    04/17 03:30:01

The timestamps tell part of the story — sessions came up just minutes apart. Get the full detail including up/down transition counts:

sw-infrarunbook-01# show bfd peers detail
VRF name: default
-----------------
DstAddr: 10.0.4.1
  State: Up, Timer Multiplier: 3, BFD Type: normal
  Tx Interval: 300 ms, Rx Interval: 300 ms
  Registered Protocols: BGP
  Up/Down: 83/82
  Last State Change: 04/17 03:30:01

83 up/down transitions for a single BFD peer. That's catastrophic — each one of those triggered BGP reconvergence. Cross-reference with syslog to see the frequency:

sw-infrarunbook-01# show logging | grep -i bfd
Apr 17 03:30:01 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Up
Apr 17 03:29:58 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Down
Apr 17 03:29:55 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Up
Apr 17 03:29:52 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Down

Flapping every 3 seconds. The CPU is running BGP reconvergence faster than it can complete a single cycle.

How to Fix It

Back off the BFD timers to something the path can reliably sustain. The default 300ms minimum interval with a 3x multiplier gives you a 900ms detection time. That's already aggressive for anything but a local Ethernet segment. Bump it up:

sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.4.1 bfd interval 750 min-rx 750 multiplier 3

That gives you a 2.25-second detection window — still fast, but far more tolerant of brief path delays. If BFD isn't strictly required for a given neighbor, disable it entirely until you can characterize and fix the path:

sw-infrarunbook-01(config)# router bgp 65001
sw-infrarunbook-01(config-router-bgp)# no neighbor 10.0.4.1 bfd

After adjusting timers, verify the session stabilizes using the counters output:

sw-infrarunbook-01# show bfd peers counters
VRF name: default
-----------------
DstAddr         LastDown       LastUp    FailedTx  FailedRx  TimeoutTx  TimeoutRx
10.0.4.1  04/17 03:30:01  04/17 03:35:12         0         0          0          0

No new failures after 03:35 — the session has been stable since the timer adjustment.

Root Cause 6: SNMP Polling Overload

An SNMP management system polling a switch every 30 seconds while walking the full MIB tree will consume a surprisingly large amount of CPU. Multiply that across OID walks from multiple monitoring systems, and

snmpd

can become a real contributor to sustained high CPU. This is particularly true for MIBs that require building large response tables — interface statistics across a 96-port switch, the full routing table via

ipCidrRouteTable

, or BGP4 MIB walks across a large peering table.

How to Identify It

sw-infrarunbook-01# show processes top once | grep -i snmp
 892   root    20   0  198432  44212   8812 S  28.4  1.1   8:34.22 snmpd

sw-infrarunbook-01# show snmp
Chassis: FCW2142L05H
Contact: infrarunbook-admin@solvethenetwork.com
Location: DC1-Row4-Rack12

SNMP packets input: 284921
  Bad SNMP version errors: 0
  Unknown community string: 0
  Get-request PDUs: 142304
  Get-next PDUs: 18151517
  Set-request PDUs: 0
SNMP packets output: 18293821

18 million OID requests is significant. The get-next PDU count being 100x higher than get-requests is the classic sign of MIB walks rather than targeted polls. The switch is building full response tables for each walk.

How to Fix It

Coordinate with your monitoring team to reduce polling frequency and target specific OIDs instead of doing full MIB walks. On the switch side, configure SNMP to run at a lower priority so it doesn't starve control-plane processes:

sw-infrarunbook-01(config)# snmp-server qos dscp 0

Longer term, moving from SNMP polling to streaming telemetry via gNMI is the right answer. EOS has excellent gNMI support, and subscribing to specific paths at a defined interval is far more efficient than repeated MIB walks. The switch doesn't have to build response tables on demand, and you'll get sub-second granularity on the metrics that matter.

Root Cause 7: ACL TCAM Overflow Causing Software Forwarding

Very large ACLs that exceed TCAM capacity force traffic evaluation up to the CPU for software-based policy processing. On Arista, ACL processing is normally done entirely in hardware via TCAM — it's essentially free from a CPU perspective. But when your ACL entry count exceeds what the hardware can hold, EOS starts spilling entries into software, and any traffic that matches those software entries has to be handled by the CPU.

How to Identify It

sw-infrarunbook-01# show hardware capacity
...
  TCAM:
    Ingress IPv4 ACL entries : 3998/4000 (99%)
    Egress IPv4 ACL entries  : 2100/2000 (105%) *** EXCEEDED ***

Exceeded TCAM means software fallback. Check which ACL is the culprit:

sw-infrarunbook-01# show ip access-lists summary
IPV4 ACL BLOCK-THREATS
  Total ACEs configured: 2847
  Sequence numbers: 10-28470

IPV4 ACL MGMT-ACCESS
  Total ACEs configured: 12
  Sequence numbers: 10-120

An ACL with 2,847 entries is almost certainly the source of your TCAM pressure. Work with your security team to consolidate entries using object-groups, summarize IP ranges into prefix blocks, or migrate the policy to a dedicated firewall that's purpose-built for large ACL tables.

Prevention

Most high-CPU incidents on Arista EOS are preventable. The work happens before the incident, not during it.

Set up CPU alerting in your monitoring system. Alert at 60% sustained, page at 80%. Don't wait for control-plane drops to tell you something is wrong — by the time BGP sessions are dropping, you're already in full incident response mode.

Configure logging rate limits on every switch during your initial build. A syslog storm should never be able to monopolize system resources:

sw-infrarunbook-01(config)# logging rate-limit 200

Disable ZTP explicitly on every switch once it's provisioned. Make this part of your standard build checklist and your configuration management template. It takes 10 seconds to prevent the problem permanently:

sw-infrarunbook-01# zerotouch disable
sw-infrarunbook-01# write memory

Apply max-routes on all BGP neighbors so a peer can't accidentally blow up your routing table. A reasonable starting point is 10,000 routes for internal peers and 1,000 for anything you don't fully control. The warning-limit triggers a syslog alert before the hard limit kicks in, giving you time to investigate:

sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 maximum-routes 10000 warning-limit 8000

Set conservative BFD timers by default and only tune them aggressively on paths you've explicitly validated. A 750ms minimum interval with a 3x multiplier is a solid starting point for most deployments. Never copy aggressive BFD configs from data center core links to WAN-facing interfaces — the path characteristics are completely different and the latency characteristics that work on a 10GE local link will cause constant flapping over a 100ms-latency WAN circuit.

Enable link-flap errdisable detection as a standard build item. A single bad SFP shouldn't be able to generate enough syslog events to degrade the entire control plane:

sw-infrarunbook-01(config)# errdisable detect cause link-flap
sw-infrarunbook-01(config)# errdisable recovery cause link-flap
sw-infrarunbook-01(config)# errdisable recovery interval 300

Finally, if you're still running SNMP polling for monitoring, evaluate a migration to streaming telemetry. EOS's gNMI implementation is mature and well-supported, and targeted subscriptions are dramatically more efficient than periodic MIB walks. You'll get better data at lower cost to the switch — and you won't be explaining to management why a monitoring system degraded a production switch.

High CPU on a network switch is never just one of those things. There's always a root cause, and EOS gives you the tools to find it. The commands in this article will get you to the answer in under five minutes for most scenarios. Know them before you need them.

Arista EOS High CPU Troubleshooting

Symptoms

Root Cause 1: Routing Protocol Flapping

How to Identify It

How to Fix It

Root Cause 2: Large BGP Table

How to Identify It

How to Fix It

Root Cause 3: Interface Errors Causing a Log Storm

How to Identify It

How to Fix It

Root Cause 4: ZTP Still Running

How to Identify It

How to Fix It

Root Cause 5: BFD Sessions Flapping

How to Identify It

How to Fix It

Root Cause 6: SNMP Polling Overload

How to Identify It

How to Fix It

Root Cause 7: ACL TCAM Overflow Causing Software Forwarding

How to Identify It

Prevention

Frequently Asked Questions

How do I check which process is causing high CPU on an Arista EOS switch?

How do I stop ZTP from running on an Arista switch?

What causes BFD sessions to keep flapping on Arista EOS?

How do I reduce BGP CPU usage on Arista EOS?

Can an interface with CRC errors cause high CPU on Arista?

Related Articles