ECMP Equal Cost Multipath Routing Explained

What Is ECMP?

Equal Cost Multipath Routing — ECMP — is the mechanism that lets a router or switch forward traffic toward a single destination using multiple paths simultaneously, provided those paths carry the same cost metric. It sounds straightforward on paper, but the nuances in how traffic gets distributed, how failures get handled, and how different protocols implement it are where things get genuinely interesting.

Without ECMP, when your routing protocol discovers two equal-cost paths to a prefix, it picks one and discards the other. You lose half your available bandwidth to that destination, and if the active path fails, you wait for convergence. ECMP flips that model entirely — both paths stay active in the Forwarding Information Base (FIB), and the router spreads traffic across them. You get bandwidth aggregation and built-in redundancy at the same time, which is a rare combination in networking where you usually have to choose.

I've seen network engineers conflate ECMP with link bonding or LAG (Link Aggregation Groups). They're related in spirit but different in implementation. LAG operates at Layer 2, bundling physical interfaces into a single logical one. ECMP operates at Layer 3, distributing routed traffic across independent next-hops. Both hash traffic to pick a path, and neither guarantees perfect balance — but they solve different problems at different layers of the stack.

How ECMP Works Under the Hood

The core mechanism works like this: the routing protocol — whether it's OSPF, IS-IS, BGP, or even static routes — identifies multiple next-hops for a given prefix that share the same cost. The router installs all of those next-hops into the FIB as equal candidates. When a packet arrives, the forwarding plane has to choose one of those next-hops for this particular flow.

That choice is made via a hash function. The router takes some combination of packet header fields, runs them through a hash algorithm, and maps the result to one of the available paths. The fields used in the hash vary by implementation, but the most common combination is the 5-tuple: source IP address, destination IP address, IP protocol number, source port, and destination port. Some implementations also fold in VLAN tags, MPLS labels, or IPv6 flow labels to improve distribution. The more entropy in the hash inputs, the more evenly flows spread across paths.

The hash output is typically fed into a modulo operation against the number of available paths. If you have four equal-cost next-hops and the hash result mod 4 equals 2, traffic goes out the third path. Every packet belonging to the same TCP connection will carry the same 5-tuple, so they all hash identically and follow the same path. This is per-flow load balancing, and it's the dominant behavior in modern switching and routing ASICs.

Some older implementations and certain niche use cases use per-packet load balancing, where each individual packet gets hashed independently — or even round-robined — across the available paths. Per-packet sounds more efficient on the surface, but it's catastrophic for TCP. Packets arrive out of order, the receiver's TCP stack interprets reordering as congestion, it backs off its window, and your throughput craters. For anything running TCP at scale, per-flow hashing is the only sensible choice.

The FIB and ECMP Groups

At the hardware level, the FIB doesn't store a single next-hop per prefix when ECMP is active. Instead, a prefix entry points to an ECMP group — a data structure containing the full set of valid next-hops with their associated output interfaces and next-hop MAC addresses. The ASIC hashes each incoming packet and indexes into the group to select the forwarding entry. The whole operation happens in the fast path at line rate; there's no software involvement once the group is programmed.

Most platforms enforce a hard limit on paths per ECMP group. Common limits are 4, 8, 16, 32, or 64 depending on hardware generation and TCAM capacity. Cisco IOS defaults to 4 ECMP paths for OSPF but can be raised. Arista EOS and Juniper Junos both support 64 or more paths on modern platforms. Knowing your platform's limit matters — in large spine-leaf fabrics you can easily design a topology that exceeds the default and get silently truncated ECMP groups as a result.

# Cisco IOS-XE — raise OSPF maximum paths on sw-infrarunbook-01
router ospf 1
 maximum-paths 8

# Verify the installed FIB entry
show ip route 10.10.0.0 255.255.0.0
Routing entry for 10.10.0.0/16
  Known via "ospf 1", distance 110, metric 20, type intra area
  Last update from 192.168.1.2 on GigabitEthernet0/1, 00:04:12 ago
  Routing Descriptor Blocks:
  * 192.168.1.2, from 192.168.10.1, via GigabitEthernet0/1
    Route metric is 20
    192.168.2.2, from 192.168.10.2, via GigabitEthernet0/2
    Route metric is 20
    192.168.3.2, from 192.168.10.3, via GigabitEthernet0/3
    Route metric is 20
    192.168.4.2, from 192.168.10.4, via GigabitEthernet0/4
    Route metric is 20

ECMP Across Different Protocols

OSPF and IS-IS

In link-state protocols like OSPF and IS-IS, ECMP emerges naturally when the SPF algorithm produces multiple shortest paths of equal total cost to a destination. You don't configure ECMP directly — you configure your link costs so that multiple paths end up with the same total metric. If you want four paths to be equal, every path's cumulative cost from source to destination must be identical.

In practice this is easy to achieve with disciplined interface cost assignment. Where it gets complicated is when your topology evolves organically and someone installs a link with a slightly different speed or cost, inadvertently breaking cost symmetry. In my experience, this is one of the most common silent killers of ECMP in production — someone upgrades a 1G link to 10G, the auto-calculated OSPF reference bandwidth changes the cost, and suddenly one of your four equal paths is no longer equal. Traffic shifts to three paths, the fourth goes idle, and nobody notices until a link starts saturating.

# Verify OSPF interface costs on sw-infrarunbook-01
# All uplinks must show identical cost values for ECMP
show ip ospf interface brief
Interface    PID   Area      IP Address/Mask     Cost  State  Nbrs F/C
Gi0/0        1     0.0.0.0   192.168.1.1/30      4     P2P    1/1
Gi0/1        1     0.0.0.0   192.168.2.1/30      4     P2P    1/1
Gi0/2        1     0.0.0.0   192.168.3.1/30      4     P2P    1/1
Gi0/3        1     0.0.0.0   192.168.4.1/30      4     P2P    1/1

BGP Multipath

BGP ECMP is more nuanced because BGP's path selection process is explicitly designed to choose exactly one best path. Multipath is an opt-in feature, and the conditions under which paths are considered equal are considerably stricter than in IGPs.

For iBGP multipath, candidate paths must share the same IGP metric to the next-hop, the same local preference, the same AS path length, and the same MED, and they must arrive from different next-hop addresses. For eBGP multipath, paths additionally need to originate from different neighboring ASes — unless you configure

maximum-paths eibgp

, which relaxes that requirement and allows mixing iBGP and eBGP paths in the same ECMP group.

In modern data center spine-leaf designs running BGP everywhere — the model described in RFC 7938 — ECMP is absolutely fundamental. Every leaf switch has eBGP sessions to every spine, and every prefix in the fabric gets announced via all spines simultaneously. The leaf receives identical prefixes from multiple spines with the same AS path length, and ECMP across all of them is what delivers full bisectional bandwidth utilization across the fabric.

# BGP multipath configuration on sw-infrarunbook-01 (Cisco IOS-XE)
router bgp 65001
 address-family ipv4 unicast
  maximum-paths 8
  maximum-paths ibgp 8

# Verify multipath entries
show bgp ipv4 unicast 10.20.0.0/24
BGP routing table entry for 10.20.0.0/24
Paths: (4 available, best #1, table default)
Multipath: eBGP
  65002
    192.168.10.1 from 192.168.10.1 (192.168.10.1)
      Origin IGP, metric 0, localpref 100, valid, external, multipath, best
  65002
    192.168.20.1 from 192.168.20.1 (192.168.20.1)
      Origin IGP, metric 0, localpref 100, valid, external, multipath
  65002
    192.168.30.1 from 192.168.30.1 (192.168.30.1)
      Origin IGP, metric 0, localpref 100, valid, external, multipath
  65002
    192.168.40.1 from 192.168.40.1 (192.168.40.1)
      Origin IGP, metric 0, localpref 100, valid, external, multipath

Why ECMP Matters in Modern Infrastructure

The shift to spine-leaf data center architectures made ECMP not just useful but essential. In traditional three-tier designs — access, distribution, core — spanning tree blocked redundant Layer 2 paths to prevent loops. You had physical bandwidth capacity, but STP ensured you couldn't use most of it. When the industry moved to fully routed fabrics where every link does Layer 3 forwarding, ECMP became the mechanism that actually leverages the physical topology you paid for.

A typical two-tier spine-leaf pod with four spine switches gives every leaf four equal-cost paths to every other leaf. With ECMP active across all four uplinks, a leaf can saturate all four spine-facing interfaces simultaneously for traffic flowing toward different destination leaves. Without ECMP, 75% of that uplink capacity sits idle. The math is brutal and the business case is obvious.

Redundancy is the other side of the equation. When one of those spine paths fails — a spine reboots, a cable gets pulled, a BGP session drops — the ECMP group shrinks from four to three paths. Traffic hashes across three paths instead of four. No manual intervention, no failover timer waiting to expire. In my experience, ECMP implementations paired with BFD (Bidirectional Forwarding Detection) on each peer can detect failures and reroute in well under a second, often in the 100–300ms range depending on BFD timers and platform forwarding pipeline latency.

ECMP and Bandwidth Aggregation

Here's a point that causes real confusion at the operator level. ECMP does not create a single fat pipe. You don't get four times 10G as a single 40G connection. What you get is four times 10G across which different flows are distributed. A single TCP connection between two endpoints will always follow one path — that's the nature of per-flow hashing. The aggregation benefit only materializes when you have many concurrent flows whose hashes spread them across the available paths.

For most production workloads — web traffic, storage replication, microservice communication — you have enough flow diversity that hash distribution approximates balance reasonably well. But for workloads involving a small number of very large flows (a backup job running a single rsync stream, a database doing a large sequential bulk export), ECMP won't help that specific workload at all. The flow takes one path, period. This doesn't mean ECMP is wrong for the network — it means ECMP is a population-level tool, not a per-flow optimization.

Real-World Examples

Data Center Fabric with BGP Unnumbered

One of the cleanest ECMP implementations I've worked with is a data center fabric running BGP unnumbered (RFC 5549) on SONiC or Cumulus Linux. Each leaf forms eBGP sessions over uplinks using only link-local IPv6 addresses for peering, while advertising IPv4 prefixes into the fabric. Every spine receives the same loopback prefix from every leaf and re-advertises it to all other leaves. The result is that every leaf has four or more ECMP paths to every other leaf's loopback. Since applications connect using those loopbacks, ECMP works end-to-end transparently — no operator action required per-flow.

# Cumulus Linux / FRR — verify ECMP paths on sw-infrarunbook-01
net show route 10.0.0.16/32
RIB entry for 10.0.0.16/32
=========================
Routing entry for 10.0.0.16/32
  Known via "bgp", distance 20, metric 0, best
  Last update 00:12:44 ago
  * fe80::4638:39ff:fe00:5c, via swp51, weight 1
  * fe80::4638:39ff:fe00:5d, via swp52, weight 1
  * fe80::4638:39ff:fe00:5e, via swp53, weight 1
  * fe80::4638:39ff:fe00:5f, via swp54, weight 1

WAN Dual-Uplink with OSPF ECMP

On the WAN edge, a practical ECMP pattern is dual uplinks from a regional site into two separate provider edge routers, both advertising the same default route or summary prefix via OSPF at equal cost. The CE router installs both next-hops in the FIB and hashes internet-bound traffic across both uplinks. If one PE or uplink fails, OSPF with BFD detects it and the ECMP group drops to a single path instantly. This is a simple, effective design that gives you both load distribution during steady state and fast failover during failure — all without BGP or any complex policy.

Kubernetes and the Fabric Edge

ECMP shows up in Kubernetes cluster networking more often than most people realize. Projects like Cilium and MetalLB use BGP ECMP to advertise LoadBalancer service IPs directly into the upstream network fabric, letting Top-of-Rack switches distribute traffic natively across multiple node endpoints. The TOR receives multiple BGP advertisements for the same VIP from different cluster nodes, installs them as an ECMP group, and hashes client connections across nodes without any software load balancer in the data path. It's elegant when it works, and it pushes the complexity where it belongs — into the network fabric that already handles it well.

Common Misconceptions

ECMP Guarantees Equal Load

It doesn't. ECMP means equal-cost paths with traffic distributed by hash. The actual utilization on each path depends entirely on how flows hash, and hash collisions are real. With a small number of active flows, you can easily end up with 70% of traffic on one path and 30% on another despite perfect 4-way ECMP configuration. Distribution improves statistically as flow count increases, but mathematical equality is never guaranteed. I've debugged more than one "my link is saturated but I have ECMP" incident where the root cause was a handful of high-bandwidth flows all hashing to the same output interface. The fix wasn't to remove ECMP — it was to understand the hash algorithm and sometimes adjust the hash seed or add additional header fields to shift the distribution.

Removing a Path Is Safe During Maintenance

This one trips up experienced engineers. When you remove a next-hop from an ECMP group — say you're taking a spine switch down for a software upgrade — existing flows that were using other paths also get disturbed. With simple modulo hashing, removing one of N paths forces a rehash of all flows, not just the ones assigned to the removed path. A significant fraction of all active sessions will move to different paths, causing momentary congestion bursts and disrupting long-lived TCP connections.

Resilient ECMP solves this. It uses a consistent hash ring where each path owns a portion of the hash space. When a path is removed, only the flows that hashed into that path's segment get redistributed — all other flows stay on their current path. Arista, Juniper, and Cisco all ship resilient ECMP implementations. If your fabric carries stateful services or long-lived connections that are sensitive to disruption, enabling resilient ECMP before doing maintenance operations is worth the configuration overhead.

Equal Cost Requires Identical Links

Paths need to have the same routing metric — the cost as computed by the routing protocol — but the underlying physical characteristics don't have to be identical. A 10G link and a 1G link can technically both sit in an ECMP group if their OSPF interface costs are manually set equal. That said, doing this is a mistake in practice. ECMP distributes flows roughly evenly by count, not by bandwidth. The 1G link will receive approximately the same number of flows as the 10G link and promptly get crushed. Equal cost in the routing table should reflect equal capacity in the physical world. If it doesn't, you're setting yourself up for asymmetric congestion that's genuinely difficult to diagnose.

More Paths Always Means Better Performance

Adding more ECMP paths improves aggregate capacity and redundancy up to the point where the hash distribution becomes the bottleneck. Beyond a certain path count, adding more paths produces diminishing returns because individual flows still only use one path, and the statistical improvement in distribution flattens out. There's also operational overhead: larger ECMP groups mean more complex failure domains, more BFD sessions to maintain, and more potential for partial failures that are hard to diagnose. Design for the number of paths that the physical topology naturally provides — don't artificially inflate it.

ECMP is one of those foundational concepts that touches nearly everything in modern network design — data center fabrics, WAN edges, cloud on-ramps, container networking. Getting it right means understanding not just that it exists, but how your specific hardware hashes traffic, what the path limits are for your ECMP groups, how your routing protocol determines cost equality, and critically, what happens to existing sessions when the topology changes. The protocol configuration is the easy part. The operational behavior under failure, maintenance, and skewed workloads is where the real engineering lives.