InfraRunBook
    Back to articles

    ECMP Equal Cost Multipath Routing Explained

    Networking
    Published: Apr 8, 2026
    Updated: Apr 8, 2026

    ECMP Equal Cost Multipath Routing lets routers forward traffic across multiple paths simultaneously, delivering bandwidth aggregation and built-in redundancy. This guide covers how it works, where it matters, and the operational gotchas that catch engineers off guard.

    ECMP Equal Cost Multipath Routing Explained

    What Is ECMP?

    Equal Cost Multipath Routing — ECMP — is the mechanism that lets a router or switch forward traffic toward a single destination using multiple paths simultaneously, provided those paths carry the same cost metric. It sounds straightforward on paper, but the nuances in how traffic gets distributed, how failures get handled, and how different protocols implement it are where things get genuinely interesting.

    Without ECMP, when your routing protocol discovers two equal-cost paths to a prefix, it picks one and discards the other. You lose half your available bandwidth to that destination, and if the active path fails, you wait for convergence. ECMP flips that model entirely — both paths stay active in the Forwarding Information Base (FIB), and the router spreads traffic across them. You get bandwidth aggregation and built-in redundancy at the same time, which is a rare combination in networking where you usually have to choose.

    I've seen network engineers conflate ECMP with link bonding or LAG (Link Aggregation Groups). They're related in spirit but different in implementation. LAG operates at Layer 2, bundling physical interfaces into a single logical one. ECMP operates at Layer 3, distributing routed traffic across independent next-hops. Both hash traffic to pick a path, and neither guarantees perfect balance — but they solve different problems at different layers of the stack.

    How ECMP Works Under the Hood

    The core mechanism works like this: the routing protocol — whether it's OSPF, IS-IS, BGP, or even static routes — identifies multiple next-hops for a given prefix that share the same cost. The router installs all of those next-hops into the FIB as equal candidates. When a packet arrives, the forwarding plane has to choose one of those next-hops for this particular flow.

    That choice is made via a hash function. The router takes some combination of packet header fields, runs them through a hash algorithm, and maps the result to one of the available paths. The fields used in the hash vary by implementation, but the most common combination is the 5-tuple: source IP address, destination IP address, IP protocol number, source port, and destination port. Some implementations also fold in VLAN tags, MPLS labels, or IPv6 flow labels to improve distribution. The more entropy in the hash inputs, the more evenly flows spread across paths.

    The hash output is typically fed into a modulo operation against the number of available paths. If you have four equal-cost next-hops and the hash result mod 4 equals 2, traffic goes out the third path. Every packet belonging to the same TCP connection will carry the same 5-tuple, so they all hash identically and follow the same path. This is per-flow load balancing, and it's the dominant behavior in modern switching and routing ASICs.

    Some older implementations and certain niche use cases use per-packet load balancing, where each individual packet gets hashed independently — or even round-robined — across the available paths. Per-packet sounds more efficient on the surface, but it's catastrophic for TCP. Packets arrive out of order, the receiver's TCP stack interprets reordering as congestion, it backs off its window, and your throughput craters. For anything running TCP at scale, per-flow hashing is the only sensible choice.

    The FIB and ECMP Groups

    At the hardware level, the FIB doesn't store a single next-hop per prefix when ECMP is active. Instead, a prefix entry points to an ECMP group — a data structure containing the full set of valid next-hops with their associated output interfaces and next-hop MAC addresses. The ASIC hashes each incoming packet and indexes into the group to select the forwarding entry. The whole operation happens in the fast path at line rate; there's no software involvement once the group is programmed.

    Most platforms enforce a hard limit on paths per ECMP group. Common limits are 4, 8, 16, 32, or 64 depending on hardware generation and TCAM capacity. Cisco IOS defaults to 4 ECMP paths for OSPF but can be raised. Arista EOS and Juniper Junos both support 64 or more paths on modern platforms. Knowing your platform's limit matters — in large spine-leaf fabrics you can easily design a topology that exceeds the default and get silently truncated ECMP groups as a result.

    # Cisco IOS-XE — raise OSPF maximum paths on sw-infrarunbook-01
    router ospf 1
     maximum-paths 8
    
    # Verify the installed FIB entry
    show ip route 10.10.0.0 255.255.0.0
    Routing entry for 10.10.0.0/16
      Known via "ospf 1", distance 110, metric 20, type intra area
      Last update from 192.168.1.2 on GigabitEthernet0/1, 00:04:12 ago
      Routing Descriptor Blocks:
      * 192.168.1.2, from 192.168.10.1, via GigabitEthernet0/1
        Route metric is 20
        192.168.2.2, from 192.168.10.2, via GigabitEthernet0/2
        Route metric is 20
        192.168.3.2, from 192.168.10.3, via GigabitEthernet0/3
        Route metric is 20
        192.168.4.2, from 192.168.10.4, via GigabitEthernet0/4
        Route metric is 20

    ECMP Across Different Protocols

    OSPF and IS-IS

    In link-state protocols like OSPF and IS-IS, ECMP emerges naturally when the SPF algorithm produces multiple shortest paths of equal total cost to a destination. You don't configure ECMP directly — you configure your link costs so that multiple paths end up with the same total metric. If you want four paths to be equal, every path's cumulative cost from source to destination must be identical.

    In practice this is easy to achieve with disciplined interface cost assignment. Where it gets complicated is when your topology evolves organically and someone installs a link with a slightly different speed or cost, inadvertently breaking cost symmetry. In my experience, this is one of the most common silent killers of ECMP in production — someone upgrades a 1G link to 10G, the auto-calculated OSPF reference bandwidth changes the cost, and suddenly one of your four equal paths is no longer equal. Traffic shifts to three paths, the fourth goes idle, and nobody notices until a link starts saturating.

    # Verify OSPF interface costs on sw-infrarunbook-01
    # All uplinks must show identical cost values for ECMP
    show ip ospf interface brief
    Interface    PID   Area      IP Address/Mask     Cost  State  Nbrs F/C
    Gi0/0        1     0.0.0.0   192.168.1.1/30      4     P2P    1/1
    Gi0/1        1     0.0.0.0   192.168.2.1/30      4     P2P    1/1
    Gi0/2        1     0.0.0.0   192.168.3.1/30      4     P2P    1/1
    Gi0/3        1     0.0.0.0   192.168.4.1/30      4     P2P    1/1

    BGP Multipath

    BGP ECMP is more nuanced because BGP's path selection process is explicitly designed to choose exactly one best path. Multipath is an opt-in feature, and the conditions under which paths are considered equal are considerably stricter than in IGPs.

    For iBGP multipath, candidate paths must share the same IGP metric to the next-hop, the same local preference, the same AS path length, and the same MED, and they must arrive from different next-hop addresses. For eBGP multipath, paths additionally need to originate from different neighboring ASes — unless you configure

    maximum-paths eibgp
    , which relaxes that requirement and allows mixing iBGP and eBGP paths in the same ECMP group.

    In modern data center spine-leaf designs running BGP everywhere — the model described in RFC 7938 — ECMP is absolutely fundamental. Every leaf switch has eBGP sessions to every spine, and every prefix in the fabric gets announced via all spines simultaneously. The leaf receives identical prefixes from multiple spines with the same AS path length, and ECMP across all of them is what delivers full bisectional bandwidth utilization across the fabric.

    # BGP multipath configuration on sw-infrarunbook-01 (Cisco IOS-XE)
    router bgp 65001
     address-family ipv4 unicast
      maximum-paths 8
      maximum-paths ibgp 8
    
    # Verify multipath entries
    show bgp ipv4 unicast 10.20.0.0/24
    BGP routing table entry for 10.20.0.0/24
    Paths: (4 available, best #1, table default)
    Multipath: eBGP
      65002
        192.168.10.1 from 192.168.10.1 (192.168.10.1)
          Origin IGP, metric 0, localpref 100, valid, external, multipath, best
      65002
        192.168.20.1 from 192.168.20.1 (192.168.20.1)
          Origin IGP, metric 0, localpref 100, valid, external, multipath
      65002
        192.168.30.1 from 192.168.30.1 (192.168.30.1)
          Origin IGP, metric 0, localpref 100, valid, external, multipath
      65002
        192.168.40.1 from 192.168.40.1 (192.168.40.1)
          Origin IGP, metric 0, localpref 100, valid, external, multipath

    Why ECMP Matters in Modern Infrastructure

    The shift to spine-leaf data center architectures made ECMP not just useful but essential. In traditional three-tier designs — access, distribution, core — spanning tree blocked redundant Layer 2 paths to prevent loops. You had physical bandwidth capacity, but STP ensured you couldn't use most of it. When the industry moved to fully routed fabrics where every link does Layer 3 forwarding, ECMP became the mechanism that actually leverages the physical topology you paid for.

    A typical two-tier spine-leaf pod with four spine switches gives every leaf four equal-cost paths to every other leaf. With ECMP active across all four uplinks, a leaf can saturate all four spine-facing interfaces simultaneously for traffic flowing toward different destination leaves. Without ECMP, 75% of that uplink capacity sits idle. The math is brutal and the business case is obvious.

    Redundancy is the other side of the equation. When one of those spine paths fails — a spine reboots, a cable gets pulled, a BGP session drops — the ECMP group shrinks from four to three paths. Traffic hashes across three paths instead of four. No manual intervention, no failover timer waiting to expire. In my experience, ECMP implementations paired with BFD (Bidirectional Forwarding Detection) on each peer can detect failures and reroute in well under a second, often in the 100–300ms range depending on BFD timers and platform forwarding pipeline latency.

    ECMP and Bandwidth Aggregation

    Here's a point that causes real confusion at the operator level. ECMP does not create a single fat pipe. You don't get four times 10G as a single 40G connection. What you get is four times 10G across which different flows are distributed. A single TCP connection between two endpoints will always follow one path — that's the nature of per-flow hashing. The aggregation benefit only materializes when you have many concurrent flows whose hashes spread them across the available paths.

    For most production workloads — web traffic, storage replication, microservice communication — you have enough flow diversity that hash distribution approximates balance reasonably well. But for workloads involving a small number of very large flows (a backup job running a single rsync stream, a database doing a large sequential bulk export), ECMP won't help that specific workload at all. The flow takes one path, period. This doesn't mean ECMP is wrong for the network — it means ECMP is a population-level tool, not a per-flow optimization.

    Real-World Examples

    Data Center Fabric with BGP Unnumbered

    One of the cleanest ECMP implementations I've worked with is a data center fabric running BGP unnumbered (RFC 5549) on SONiC or Cumulus Linux. Each leaf forms eBGP sessions over uplinks using only link-local IPv6 addresses for peering, while advertising IPv4 prefixes into the fabric. Every spine receives the same loopback prefix from every leaf and re-advertises it to all other leaves. The result is that every leaf has four or more ECMP paths to every other leaf's loopback. Since applications connect using those loopbacks, ECMP works end-to-end transparently — no operator action required per-flow.

    # Cumulus Linux / FRR — verify ECMP paths on sw-infrarunbook-01
    net show route 10.0.0.16/32
    RIB entry for 10.0.0.16/32
    =========================
    Routing entry for 10.0.0.16/32
      Known via "bgp", distance 20, metric 0, best
      Last update 00:12:44 ago
      * fe80::4638:39ff:fe00:5c, via swp51, weight 1
      * fe80::4638:39ff:fe00:5d, via swp52, weight 1
      * fe80::4638:39ff:fe00:5e, via swp53, weight 1
      * fe80::4638:39ff:fe00:5f, via swp54, weight 1

    WAN Dual-Uplink with OSPF ECMP

    On the WAN edge, a practical ECMP pattern is dual uplinks from a regional site into two separate provider edge routers, both advertising the same default route or summary prefix via OSPF at equal cost. The CE router installs both next-hops in the FIB and hashes internet-bound traffic across both uplinks. If one PE or uplink fails, OSPF with BFD detects it and the ECMP group drops to a single path instantly. This is a simple, effective design that gives you both load distribution during steady state and fast failover during failure — all without BGP or any complex policy.

    Kubernetes and the Fabric Edge

    ECMP shows up in Kubernetes cluster networking more often than most people realize. Projects like Cilium and MetalLB use BGP ECMP to advertise LoadBalancer service IPs directly into the upstream network fabric, letting Top-of-Rack switches distribute traffic natively across multiple node endpoints. The TOR receives multiple BGP advertisements for the same VIP from different cluster nodes, installs them as an ECMP group, and hashes client connections across nodes without any software load balancer in the data path. It's elegant when it works, and it pushes the complexity where it belongs — into the network fabric that already handles it well.

    Common Misconceptions

    ECMP Guarantees Equal Load

    It doesn't. ECMP means equal-cost paths with traffic distributed by hash. The actual utilization on each path depends entirely on how flows hash, and hash collisions are real. With a small number of active flows, you can easily end up with 70% of traffic on one path and 30% on another despite perfect 4-way ECMP configuration. Distribution improves statistically as flow count increases, but mathematical equality is never guaranteed. I've debugged more than one "my link is saturated but I have ECMP" incident where the root cause was a handful of high-bandwidth flows all hashing to the same output interface. The fix wasn't to remove ECMP — it was to understand the hash algorithm and sometimes adjust the hash seed or add additional header fields to shift the distribution.

    Removing a Path Is Safe During Maintenance

    This one trips up experienced engineers. When you remove a next-hop from an ECMP group — say you're taking a spine switch down for a software upgrade — existing flows that were using other paths also get disturbed. With simple modulo hashing, removing one of N paths forces a rehash of all flows, not just the ones assigned to the removed path. A significant fraction of all active sessions will move to different paths, causing momentary congestion bursts and disrupting long-lived TCP connections.

    Resilient ECMP solves this. It uses a consistent hash ring where each path owns a portion of the hash space. When a path is removed, only the flows that hashed into that path's segment get redistributed — all other flows stay on their current path. Arista, Juniper, and Cisco all ship resilient ECMP implementations. If your fabric carries stateful services or long-lived connections that are sensitive to disruption, enabling resilient ECMP before doing maintenance operations is worth the configuration overhead.

    Equal Cost Requires Identical Links

    Paths need to have the same routing metric — the cost as computed by the routing protocol — but the underlying physical characteristics don't have to be identical. A 10G link and a 1G link can technically both sit in an ECMP group if their OSPF interface costs are manually set equal. That said, doing this is a mistake in practice. ECMP distributes flows roughly evenly by count, not by bandwidth. The 1G link will receive approximately the same number of flows as the 10G link and promptly get crushed. Equal cost in the routing table should reflect equal capacity in the physical world. If it doesn't, you're setting yourself up for asymmetric congestion that's genuinely difficult to diagnose.

    More Paths Always Means Better Performance

    Adding more ECMP paths improves aggregate capacity and redundancy up to the point where the hash distribution becomes the bottleneck. Beyond a certain path count, adding more paths produces diminishing returns because individual flows still only use one path, and the statistical improvement in distribution flattens out. There's also operational overhead: larger ECMP groups mean more complex failure domains, more BFD sessions to maintain, and more potential for partial failures that are hard to diagnose. Design for the number of paths that the physical topology naturally provides — don't artificially inflate it.


    ECMP is one of those foundational concepts that touches nearly everything in modern network design — data center fabrics, WAN edges, cloud on-ramps, container networking. Getting it right means understanding not just that it exists, but how your specific hardware hashes traffic, what the path limits are for your ECMP groups, how your routing protocol determines cost equality, and critically, what happens to existing sessions when the topology changes. The protocol configuration is the easy part. The operational behavior under failure, maintenance, and skewed workloads is where the real engineering lives.

    Frequently Asked Questions

    What is the difference between ECMP and LAG (Link Aggregation)?

    LAG (Link Aggregation Group) operates at Layer 2, bundling multiple physical interfaces into a single logical interface using LACP. ECMP operates at Layer 3, distributing routed IP traffic across multiple independent next-hops in the FIB. Both use hashing to assign flows to paths, but ECMP works across any routed topology including geographically separated paths, while LAG requires direct parallel links between two devices.

    Does ECMP guarantee that all paths carry the same amount of traffic?

    No. ECMP distributes flows using a hash function, so actual path utilization depends on how many flows hash to each path. With few flows, distribution can be quite uneven. With thousands of concurrent flows, the distribution approximates balance statistically. Equal cost in the routing table does not mean equal load in practice.

    How does ECMP behave when one path fails?

    When a path fails and is removed from the ECMP group, traffic is redistributed across the remaining paths. With standard modulo hashing, this causes all flows to be rehashed — not just the ones that were using the failed path — which can briefly disrupt existing sessions. Resilient ECMP (consistent hashing) addresses this by only moving flows that were assigned to the failed path, leaving all other flows undisturbed.

    How do I enable ECMP in BGP?

    BGP does not enable multipath by default. You must explicitly configure it using the 'maximum-paths' command under the BGP address family. For eBGP, paths must have equal AS path length, MED, and local preference to qualify. For iBGP multipath, paths must also have the same IGP metric to the next-hop. The exact syntax varies by vendor but the concept is consistent across IOS-XE, Junos, and EOS.

    What is resilient ECMP and when should I use it?

    Resilient ECMP uses a consistent hash ring to assign flows to paths. When a path is added or removed, only the flows mapped to that specific path are redistributed — all other flows retain their current path assignment. This is important in environments with stateful services, long-lived TCP connections, or frequent maintenance events where minimizing flow disruption matters. It's supported on Arista EOS, Juniper Junos, and Cisco IOS-XR, and is worth enabling on any fabric where graceful path changes are a requirement.

    Can ECMP work with a mix of different link speeds?

    Technically yes — if you manually assign equal routing metrics to links of different speeds, they will be placed in the same ECMP group. However, this is generally a bad idea. ECMP distributes flows by count, not by bandwidth capacity, so a slower link will receive the same number of flows as a faster one and will quickly become congested. Interface costs should always reflect actual link capacity to avoid creating asymmetric congestion that is difficult to diagnose.

    Related Articles