InfraRunBook
    Back to articles

    Arista EOS High CPU Troubleshooting

    Arista
    Published: Apr 17, 2026
    Updated: Apr 17, 2026

    High CPU on an Arista EOS switch can drop BGP sessions, stall OSPF convergence, and make the CLI unusable. This guide walks through every major root cause with real commands and fixes.

    Arista EOS High CPU Troubleshooting

    Symptoms

    High CPU on an Arista switch rarely announces itself politely. The first sign is usually SSH becoming sluggish — keystrokes lag, tab completion stalls for two or three seconds, and sometimes you'll get a connection timeout before you even reach the prompt. Control-plane traffic starts dropping. BGP neighbors go down, OSPF adjacencies reset, and suddenly you're getting paged about a network event that started with a single overloaded process.

    The canonical first command is

    show processes top
    . Run it and watch for anything consuming more than 20-30% CPU consistently. On a healthy switch under normal load, no single process should be pinning the CPU. Here's what a troubled switch looks like:

    sw-infrarunbook-01# show processes top once
    PID    USER    PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1842   root    20   0  512424  89212  14320 R  87.3  2.1  14:32.41 Bgp
     421   root    20   0  198432  22104   8812 S  12.1  0.5   2:14.22 Syslog
     654   root    20   0   88432  11204   4812 S   4.2  0.3   0:44.12 Ospf
       1   root    20   0   41320   5012   3412 S   0.1  0.1   0:02.14 init

    That BGP process at 87% is a serious problem. In my experience, once a process sustains above 50%, the switch is already struggling to keep up with control-plane work, and it won't recover on its own.

    Other symptoms you'll commonly see alongside high CPU include syslog flooding — thousands of identical messages per second when you run

    show logging
    — incrementing drops on management or CPU-facing ports in
    show interfaces
    , BGP or OSPF adjacency flaps showing up in
    show logging last 100
    , and occasionally ZeroTouch Provisioning still running on a switch that should have been long since provisioned.

    Let's go through the root causes one by one, starting with the ones that cause the most damage.


    Root Cause 1: Routing Protocol Flapping

    OSPF and BGP both generate significant CPU work when neighbors are unstable. Every time a neighbor drops and comes back up, the switch runs SPF calculations, updates the RIB, and reprograms the FIB. If this is happening multiple times per minute, it becomes a CPU death spiral. OSPF is particularly brutal here because SPF runs are synchronous and expensive — the process can't do anything else while it's running the shortest-path tree.

    I've seen this triggered most often by a marginal physical link: intermittent fiber, a bad SFP, or a misconfigured MTU causing OSPF hellos to get dropped sporadically. BGP flapping can also be caused by an overwhelmed peer that isn't responding to keepalives within the negotiated hold time.

    How to Identify It

    Start with syslog and grep for neighbor state transitions:

    sw-infrarunbook-01# show logging last 200 | grep -i "neighbor\|adjac\|state"
    Apr 17 03:14:22 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: INIT
    Apr 17 03:14:28 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: FULL
    Apr 17 03:14:41 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: INIT
    Apr 17 03:14:47 sw-infrarunbook-01 Ospf: %OSPF-4-NEIGHBOR_STATE_CHANGE: Neighbor 10.0.0.1 (Ethernet3) is now: FULL

    That INIT → FULL → INIT → FULL pattern repeating every 20 seconds is the smoking gun. Pull the OSPF neighbor detail to see how many state changes have happened:

    sw-infrarunbook-01# show ip ospf neighbor 10.0.0.1 detail
    Neighbor 10.0.0.1, interface address 10.0.0.1
      In the area 0.0.0.0 via interface Ethernet3
      Neighbor priority is 1, State is Full, 6 state changes
      DR is 10.0.0.2, BDR is 10.0.0.1
      Options is 0x12 -|-|-|-|-|-|E|-
      Dead timer due in 00:00:38
      Last hello received 00:00:02 ago

    Six state changes on what should be a stable link is a clear sign of instability. Check BGP as well:

    sw-infrarunbook-01# show ip bgp summary
    BGP summary information for VRF default
    Router identifier 10.0.1.1, local AS number 65001
    Neighbor          AS Session State AFI/SAFI    Up/Down   State Reason
    10.0.2.1       65002 Established   IPv4 Unicast  00:00:43
    10.0.3.1       65003 Established   IPv4 Unicast  00:00:12
    10.0.4.1       65004 Idle          IPv4 Unicast  00:14:22  NoReason

    How to Fix It

    Fix the underlying physical problem first. Check the interface error counters:

    sw-infrarunbook-01# show interfaces Ethernet3 counters errors
    Port         Align-Err    FCS-Err   Xmit-Err    Rcv-Err  UnderSize  OutDiscards
    Et3                  0       4821           0       4821          0            0

    FCS errors confirm a physical-layer problem — bad cable, SFP, or fiber. Replace the hardware. While you're waiting, increase the OSPF dead interval on the flapping interface to give yourself breathing room:

    sw-infrarunbook-01(config)# interface Ethernet3
    sw-infrarunbook-01(config-if-Et3)# ip ospf dead-interval 60
    sw-infrarunbook-01(config-if-Et3)# ip ospf hello-interval 15

    For BGP, enable route dampening to prevent a flapping peer from continuously triggering full RIB recalculations:

    sw-infrarunbook-01(config)# router bgp 65001
    sw-infrarunbook-01(config-router-bgp)# bgp dampening

    Root Cause 2: Large BGP Table

    If you're peering with an upstream provider and accepting a full internet routing table — around 950,000 prefixes as of 2026 — the BGP process on your switch is doing a huge amount of work just maintaining that table, running best-path selection, and keeping peers synchronized. On lower-end Arista platforms that weren't sized for this, it'll peg the CPU during convergence events. Even on capable hardware, accepting an unfiltered full table from multiple peers compounds the problem considerably.

    This often creeps up gradually. The table grew slowly over months, and the CPU followed. Operators don't notice until something triggers a full BGP reconvergence — a peer reset, a software upgrade, a link flap — and suddenly the switch is completely swamped.

    How to Identify It

    sw-infrarunbook-01# show ip bgp summary
    BGP summary information for VRF default
    Router identifier 10.0.1.1, local AS number 65001
    Neighbor          AS Session State AFI/SAFI    Pfx Rcvd   Up/Down
    10.0.2.1       65002 Established   IPv4 Unicast    952847   5d03h
    10.0.3.1       65003 Established   IPv4 Unicast    948221   5d03h

    Almost a million prefixes from two peers — that's the full internet table, twice over. Check what BGP is doing to memory:

    sw-infrarunbook-01# show processes top once | grep -i bgp
    1842   root    20   0  2.1g   1.4g  14320 R  72.3 34.1  94:32.41 Bgp

    1.4 GB of resident memory for BGP alone. Also check your RIB size to confirm the scale of the problem:

    sw-infrarunbook-01# show ip route summary
    Operating routing protocol model: multi-agent
    Maximum number of ecmp paths allowed: 4
    
      Connected: 12 prefixes (12 paths)
      Static: 4 prefixes (4 paths)
      BGP: 952847 prefixes (952847 paths)
      OSPF: 42 prefixes (84 paths)
    
    Total: 953905 prefixes (953947 paths)

    How to Fix It

    The right fix is route filtering. Unless this switch specifically needs to make forwarding decisions based on a full internet table, configure a prefix list and route map to accept only what you need — typically a default route plus any specific prefixes you use for traffic engineering:

    sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 5 permit 0.0.0.0/0
    sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 10 permit 10.0.0.0/8 le 32
    sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 15 permit 172.16.0.0/12 le 32
    sw-infrarunbook-01(config)# ip prefix-list FILTER-FULL-TABLE seq 20 permit 192.168.0.0/16 le 32
    
    sw-infrarunbook-01(config)# route-map ACCEPT-LIMITED permit 10
    sw-infrarunbook-01(config-route-map-ACCEPT-LIMITED)# match ip address prefix-list FILTER-FULL-TABLE
    
    sw-infrarunbook-01(config)# router bgp 65001
    sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 route-map ACCEPT-LIMITED in
    sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.3.1 route-map ACCEPT-LIMITED in

    After applying the route map, do a soft reset so you don't drop and re-establish the sessions:

    sw-infrarunbook-01# clear ip bgp 10.0.2.1 soft in
    sw-infrarunbook-01# clear ip bgp 10.0.3.1 soft in

    Also set a max-routes limit as a safety net so this can't quietly happen again:

    sw-infrarunbook-01(config)# router bgp 65001
    sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 maximum-routes 10000 warning-limit 8000

    Root Cause 3: Interface Errors Causing a Log Storm

    This one is sneaky. A single flapping interface — or one generating continuous CRC errors — can flood syslog at a rate that overwhelms the Syslog process itself. The switch spends more CPU writing log messages than doing actual network work. I've walked into situations where the Syslog process was sitting at 40% CPU and every other process was starved for cycles. The culprit was a single bad SFP toggling link state hundreds of times per hour.

    The tricky part is that syslog storms mask the real cause. You'll see Syslog high in

    show processes top
    , but that's a symptom. The root cause is always the event generating the messages.

    How to Identify It

    sw-infrarunbook-01# show logging last 50 | grep -i "link\|down\|up"
    Apr 17 03:22:01 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to down
    Apr 17 03:22:02 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to up
    Apr 17 03:22:04 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to down
    Apr 17 03:22:05 sw-infrarunbook-01 Kernel: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet12, changed state to up

    Two link flaps per second. Now check the interface error counters to confirm the physical problem:

    sw-infrarunbook-01# show interfaces Ethernet12
    Ethernet12 is up, line protocol is up (connected)
      Hardware is Ethernet, address is 001c.7300.1234
      Last clearing of "show interface" counters: 1d02h
      Input statistics:
        0 runts, 0 giants, 0 throttles
        18943 input errors, 18943 CRC, 0 alignment
        0 symbol, 0 input discard
      Output statistics:
        0 output errors

    18,943 CRC errors since the last counter clear. That's your culprit. You can also confirm by watching the Syslog process directly:

    sw-infrarunbook-01# show processes top once | grep -i syslog
     421   root    20   0  198432  22104   8812 S  38.4  0.5   2:14.22 Syslog

    How to Fix It

    The permanent fix is replacing the bad SFP, cable, or far-end transceiver. But while you're waiting for hardware, shut the offending interface to stop the storm immediately:

    sw-infrarunbook-01(config)# interface Ethernet12
    sw-infrarunbook-01(config-if-Et12)# shutdown

    For future protection, configure logging rate limiting so no single event source can ever monopolize the system:

    sw-infrarunbook-01(config)# logging rate-limit 100

    Configure errdisable link-flap detection as well. This will automatically disable an interface that's flapping beyond a threshold, preventing it from generating continuous log events:

    sw-infrarunbook-01(config)# errdisable detect cause link-flap
    sw-infrarunbook-01(config)# errdisable recovery cause link-flap
    sw-infrarunbook-01(config)# errdisable recovery interval 300

    Root Cause 4: ZTP Still Running

    ZeroTouch Provisioning is an EOS feature that lets a switch bootstrap itself from a provisioning server on first boot — it fires up, hits DHCP, grabs a config script, and configures itself without anyone touching the CLI. Extremely useful. The problem is when ZTP is still active on a switch that's already deployed and operational, or on a switch that can't reach its provisioning server. ZTP will keep retrying DHCP, polling for configuration, logging failures, and all of that activity burns CPU and fills up logs. Don't laugh — I've seen production switches running ZTP continuously for weeks because nobody noticed.

    This happens most often after a reload where the startup config was lost or corrupted, after someone accidentally ran

    write erase
    , or on new switches that were powered up and connected to the network without completing the provisioning flow.

    How to Identify It

    sw-infrarunbook-01# show zerotouch
    ZeroTouch State: Active
    ZeroTouch Config: Provisioning server unreachable
    Last ZeroTouch Attempt: 00:02:14 ago
    ZeroTouch script: Not downloaded

    That "Active" state on a switch that's already configured is the problem. You'll also see ZTP in the process list consuming CPU:

    sw-infrarunbook-01# show processes top once | grep -i ztp
     312   root    20   0   32124   8012   3412 S  14.2  0.2   2:11.44 ZeroTouch

    And syslog will be full of DHCP and HTTP retry noise at a regular interval:

    sw-infrarunbook-01# show logging | grep -i ztp
    Apr 17 03:30:01 sw-infrarunbook-01 ZeroTouch: %ZTP-5-DHCP_ATTEMPT: Attempting DHCP on Management1
    Apr 17 03:30:07 sw-infrarunbook-01 ZeroTouch: %ZTP-3-DHCP_FAILED: DHCP failed on Management1
    Apr 17 03:30:07 sw-infrarunbook-01 ZeroTouch: %ZTP-3-PROVISION_FAILED: Provisioning attempt failed, retrying in 120s

    How to Fix It

    This is the easiest fix on this list. Cancel ZTP and the process stops immediately:

    sw-infrarunbook-01# zerotouch cancel
    ZeroTouch: Cancelling ZeroTouch
    ZeroTouch: Disabled

    Verify it stopped:

    sw-infrarunbook-01# show zerotouch
    ZeroTouch State: Disabled

    To permanently disable ZTP so it never activates again after a reload, run:

    sw-infrarunbook-01# zerotouch disable

    Then save the running config so the disable state persists across reboots:

    sw-infrarunbook-01# write memory

    Root Cause 5: BFD Sessions Flapping

    Bidirectional Forwarding Detection is designed to provide fast failure detection — subsecond, when configured aggressively. But BFD's speed is also its weakness. When BFD timers are set too low relative to what the underlying path can reliably support, sessions will flap. And unlike a simple keepalive, every BFD session flap triggers protocol events: BGP peers go down, OSPF adjacencies reset, static routes disappear. All of that reconvergence work hammers the CPU in a compounding loop — BFD flap causes BGP reset which causes RIB churn which causes FIB reprogramming, all while BFD is flapping again.

    In my experience, this almost always happens after someone tuned BFD timers aggressively to improve failover speed without fully characterizing the path. A congested uplink, a hypervisor host briefly pausing during vMotion, or even high-frequency garbage collection on a software BGP speaker can be enough to miss BFD hellos at 300ms intervals.

    How to Identify It

    sw-infrarunbook-01# show bfd peers
    VRF name: default
    -----------------
    DstAddr         MyDisc   YourDisc Interface/Transport    Type          LastUp
    10.0.2.1    3120498932 2847612301 Ethernet1              normal    04/17 02:44:01
    10.0.3.1    1234098123 9871234509 Ethernet2              normal    04/17 03:29:47
    10.0.4.1    8712349812 1234987123 Ethernet3              normal    04/17 03:30:01

    The timestamps tell part of the story — sessions came up just minutes apart. Get the full detail including up/down transition counts:

    sw-infrarunbook-01# show bfd peers detail
    VRF name: default
    -----------------
    DstAddr: 10.0.4.1
      State: Up, Timer Multiplier: 3, BFD Type: normal
      Tx Interval: 300 ms, Rx Interval: 300 ms
      Registered Protocols: BGP
      Up/Down: 83/82
      Last State Change: 04/17 03:30:01

    83 up/down transitions for a single BFD peer. That's catastrophic — each one of those triggered BGP reconvergence. Cross-reference with syslog to see the frequency:

    sw-infrarunbook-01# show logging | grep -i bfd
    Apr 17 03:30:01 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Up
    Apr 17 03:29:58 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Down
    Apr 17 03:29:55 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Up
    Apr 17 03:29:52 sw-infrarunbook-01 Bfd: %BFD-6-STATE_CHANGE: Peer 10.0.4.1 changed to Down

    Flapping every 3 seconds. The CPU is running BGP reconvergence faster than it can complete a single cycle.

    How to Fix It

    Back off the BFD timers to something the path can reliably sustain. The default 300ms minimum interval with a 3x multiplier gives you a 900ms detection time. That's already aggressive for anything but a local Ethernet segment. Bump it up:

    sw-infrarunbook-01(config)# router bgp 65001
    sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.4.1 bfd interval 750 min-rx 750 multiplier 3

    That gives you a 2.25-second detection window — still fast, but far more tolerant of brief path delays. If BFD isn't strictly required for a given neighbor, disable it entirely until you can characterize and fix the path:

    sw-infrarunbook-01(config)# router bgp 65001
    sw-infrarunbook-01(config-router-bgp)# no neighbor 10.0.4.1 bfd

    After adjusting timers, verify the session stabilizes using the counters output:

    sw-infrarunbook-01# show bfd peers counters
    VRF name: default
    -----------------
    DstAddr         LastDown       LastUp    FailedTx  FailedRx  TimeoutTx  TimeoutRx
    10.0.4.1  04/17 03:30:01  04/17 03:35:12         0         0          0          0

    No new failures after 03:35 — the session has been stable since the timer adjustment.


    Root Cause 6: SNMP Polling Overload

    An SNMP management system polling a switch every 30 seconds while walking the full MIB tree will consume a surprisingly large amount of CPU. Multiply that across OID walks from multiple monitoring systems, and

    snmpd
    can become a real contributor to sustained high CPU. This is particularly true for MIBs that require building large response tables — interface statistics across a 96-port switch, the full routing table via
    ipCidrRouteTable
    , or BGP4 MIB walks across a large peering table.

    How to Identify It

    sw-infrarunbook-01# show processes top once | grep -i snmp
     892   root    20   0  198432  44212   8812 S  28.4  1.1   8:34.22 snmpd
    sw-infrarunbook-01# show snmp
    Chassis: FCW2142L05H
    Contact: infrarunbook-admin@solvethenetwork.com
    Location: DC1-Row4-Rack12
    
    SNMP packets input: 284921
      Bad SNMP version errors: 0
      Unknown community string: 0
      Get-request PDUs: 142304
      Get-next PDUs: 18151517
      Set-request PDUs: 0
    SNMP packets output: 18293821

    18 million OID requests is significant. The get-next PDU count being 100x higher than get-requests is the classic sign of MIB walks rather than targeted polls. The switch is building full response tables for each walk.

    How to Fix It

    Coordinate with your monitoring team to reduce polling frequency and target specific OIDs instead of doing full MIB walks. On the switch side, configure SNMP to run at a lower priority so it doesn't starve control-plane processes:

    sw-infrarunbook-01(config)# snmp-server qos dscp 0

    Longer term, moving from SNMP polling to streaming telemetry via gNMI is the right answer. EOS has excellent gNMI support, and subscribing to specific paths at a defined interval is far more efficient than repeated MIB walks. The switch doesn't have to build response tables on demand, and you'll get sub-second granularity on the metrics that matter.


    Root Cause 7: ACL TCAM Overflow Causing Software Forwarding

    Very large ACLs that exceed TCAM capacity force traffic evaluation up to the CPU for software-based policy processing. On Arista, ACL processing is normally done entirely in hardware via TCAM — it's essentially free from a CPU perspective. But when your ACL entry count exceeds what the hardware can hold, EOS starts spilling entries into software, and any traffic that matches those software entries has to be handled by the CPU.

    How to Identify It

    sw-infrarunbook-01# show hardware capacity
    ...
      TCAM:
        Ingress IPv4 ACL entries : 3998/4000 (99%)
        Egress IPv4 ACL entries  : 2100/2000 (105%) *** EXCEEDED ***

    Exceeded TCAM means software fallback. Check which ACL is the culprit:

    sw-infrarunbook-01# show ip access-lists summary
    IPV4 ACL BLOCK-THREATS
      Total ACEs configured: 2847
      Sequence numbers: 10-28470
    
    IPV4 ACL MGMT-ACCESS
      Total ACEs configured: 12
      Sequence numbers: 10-120

    An ACL with 2,847 entries is almost certainly the source of your TCAM pressure. Work with your security team to consolidate entries using object-groups, summarize IP ranges into prefix blocks, or migrate the policy to a dedicated firewall that's purpose-built for large ACL tables.


    Prevention

    Most high-CPU incidents on Arista EOS are preventable. The work happens before the incident, not during it.

    Set up CPU alerting in your monitoring system. Alert at 60% sustained, page at 80%. Don't wait for control-plane drops to tell you something is wrong — by the time BGP sessions are dropping, you're already in full incident response mode.

    Configure logging rate limits on every switch during your initial build. A syslog storm should never be able to monopolize system resources:

    sw-infrarunbook-01(config)# logging rate-limit 200

    Disable ZTP explicitly on every switch once it's provisioned. Make this part of your standard build checklist and your configuration management template. It takes 10 seconds to prevent the problem permanently:

    sw-infrarunbook-01# zerotouch disable
    sw-infrarunbook-01# write memory

    Apply max-routes on all BGP neighbors so a peer can't accidentally blow up your routing table. A reasonable starting point is 10,000 routes for internal peers and 1,000 for anything you don't fully control. The warning-limit triggers a syslog alert before the hard limit kicks in, giving you time to investigate:

    sw-infrarunbook-01(config-router-bgp)# neighbor 10.0.2.1 maximum-routes 10000 warning-limit 8000

    Set conservative BFD timers by default and only tune them aggressively on paths you've explicitly validated. A 750ms minimum interval with a 3x multiplier is a solid starting point for most deployments. Never copy aggressive BFD configs from data center core links to WAN-facing interfaces — the path characteristics are completely different and the latency characteristics that work on a 10GE local link will cause constant flapping over a 100ms-latency WAN circuit.

    Enable link-flap errdisable detection as a standard build item. A single bad SFP shouldn't be able to generate enough syslog events to degrade the entire control plane:

    sw-infrarunbook-01(config)# errdisable detect cause link-flap
    sw-infrarunbook-01(config)# errdisable recovery cause link-flap
    sw-infrarunbook-01(config)# errdisable recovery interval 300

    Finally, if you're still running SNMP polling for monitoring, evaluate a migration to streaming telemetry. EOS's gNMI implementation is mature and well-supported, and targeted subscriptions are dramatically more efficient than periodic MIB walks. You'll get better data at lower cost to the switch — and you won't be explaining to management why a monitoring system degraded a production switch.

    High CPU on a network switch is never just one of those things. There's always a root cause, and EOS gives you the tools to find it. The commands in this article will get you to the answer in under five minutes for most scenarios. Know them before you need them.

    Frequently Asked Questions

    How do I check which process is causing high CPU on an Arista EOS switch?

    Run 'show processes top once' from the EOS CLI. This gives you a snapshot of all running processes sorted by CPU usage. Look for any process above 20-30% sustained CPU — common offenders include Bgp, Syslog, Ospf, ZeroTouch, and snmpd. Running it without 'once' gives you a live refreshing view.

    How do I stop ZTP from running on an Arista switch?

    Run 'zerotouch cancel' from the privileged exec prompt to stop an active ZTP process immediately. To permanently prevent ZTP from activating after future reloads, run 'zerotouch disable' followed by 'write memory'. Verify with 'show zerotouch' — the state should show Disabled.

    What causes BFD sessions to keep flapping on Arista EOS?

    BFD flapping is almost always caused by BFD timers set too aggressively for the underlying path. If the path has variable latency — due to congestion, a WAN circuit, or a hypervisor host pausing briefly — BFD hellos get missed and the session drops. Fix it by increasing the BFD interval: 'neighbor <ip> bfd interval 750 min-rx 750 multiplier 3' in router BGP config gives a 2.25-second detection window that's far more tolerant of brief delays.

    How do I reduce BGP CPU usage on Arista EOS?

    The most effective approach is route filtering — use a prefix list and route map to limit what prefixes you accept from each peer. If you're accepting a full internet table but don't need one, filter it down to a default route plus your internal prefixes. Also configure 'maximum-routes' on each neighbor as a safety limit. After applying route maps, use 'clear ip bgp <peer> soft in' to activate them without dropping sessions.

    Can an interface with CRC errors cause high CPU on Arista?

    Yes — a single interface generating continuous CRC errors and link-state flaps can flood syslog at a rate that overwhelms the Syslog process itself, consuming 30-40% CPU. Shut down the offending interface immediately to stop the log storm, then replace the hardware causing the errors. Configure 'logging rate-limit' and 'errdisable detect cause link-flap' to prevent this class of problem in the future.

    Related Articles