Juniper Chassis Cluster Failover Issues

Symptoms

You're staring at a chassis cluster that won't fail over, or worse, one that's actively failing over when it shouldn't be. Maybe a node went offline during a maintenance window and traffic didn't shift. Maybe both nodes are claiming to be primary for the same redundancy group. Maybe

show chassis cluster status

is showing redundancy group 0 and redundancy group 1 on different nodes and you can't figure out why. Whatever brought you here, chassis cluster failover problems on Juniper SRX platforms are almost always traceable to a handful of well-known root causes — and in my experience, the same ones show up again and again.

The most common symptoms you'll run into include both nodes showing as primary for the same redundancy group (split brain), failover not occurring after a node or link failure, failover completing but sessions not recovering, unexpected redundancy group placement after a reboot, fabric interfaces in a down or error state, and node isolation events showing up in syslog. Let's work through each root cause systematically — what's happening under the hood, how to confirm it, and how to fix it.

Root Cause 1: Fabric Link Down

The fabric link is the data plane interconnect between the two SRX nodes. On most SRX platforms these are dedicated interfaces — fab0 on node 0, fab1 on node 1 — and they carry session synchronization traffic. If the fabric goes down, you lose stateful session sync entirely. In some scenarios this also triggers unintended failovers because the cluster loses confidence in the peer's data plane state.

This happens most often from a physically unplugged or damaged cross-connect cable, or from someone plugging the fabric interfaces into a switch instead of directly into each other. I've seen the latter scenario produce a link that appears up at Layer 1 but fails to pass fabric traffic correctly — making it particularly annoying to diagnose.

Start your investigation here:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster interfaces

Fabric link status:
  fab0: Up
  fab1: Down

Child links:
  fab0.0: Up

If you see

fab1: Down

or both fabric links down, that's your culprit. Drill into the interface directly to confirm the physical state:

infrarunbook-admin@sw-infrarunbook-01> show interfaces fab1
Physical interface: fab1, Enabled, Physical link is Down
  Interface index: 131, SNMP ifIndex: 508
  Link-level type: Ethernet, MTU: 9192, Speed: Unspecified, Duplex: Unspecified
  Input rate   : 0 bps (0 pps)
  Output rate  : 0 bps (0 pps)

Syslog will usually confirm the timeline:

infrarunbook-admin@sw-infrarunbook-01> show log messages | match fabric
Apr 15 03:14:22  sw-infrarunbook-01 chassisd[1234]: CHASSISD_CLUS_FABRIC_LINK_DOWN: Chassis cluster fabric link down on node 1

The fix is straightforward for physical issues — reseat or replace the cable. If the interfaces are misconfigured, verify that the ports listed under your

fabric member-interfaces

config actually match the ports you've cross-connected. After fixing, confirm data plane sync has resumed:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster data-plane statistics
Ingress traffic statistics:
  Total bytes:                   9483920184
  Total packets:                 7294821
  Fabric sync packets:           3847201
  Fabric sync errors:            0

Zero fabric sync errors and incrementing sync packets tell you the fabric is healthy again.

Root Cause 2: Control Plane Redundancy Group Split Brain

Split brain is the most dangerous chassis cluster state you can encounter. It occurs when both nodes simultaneously believe they are the primary for redundancy group 0 — the control plane redundancy group. Both nodes start processing control plane traffic and making independent forwarding decisions, which leads to traffic blackholing, duplicate session tables, and routing instability that can take down an entire site.

Split brain happens when the control link between the two nodes fails. The control link carries heartbeat traffic — it's how each node confirms the other is alive. When it fails, each node assumes its peer has died and promotes itself to primary. In my experience this most commonly happens due to a failed control link cable, or in setups where the control link traverses a management switch and a VLAN or port security policy silently drops the heartbeat frames.

Detecting split brain is unambiguous once you know what to look for:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 3
  node0  200      primary        N       N        None
  node1  100      primary        N       N        None

Redundancy group: 1 , Failover count: 1
  node0  100      primary        N       N        None
  node1  1        secondary      N       N        None

Both nodes showing

primary

for redundancy group 0 is your smoking gun. Syslog will typically have logged the moment it happened:

infrarunbook-admin@sw-infrarunbook-01> show log messages | match "split brain|control link"
Apr 15 07:22:01  sw-infrarunbook-01 chassisd[1102]: CHASSISD_CLUS_CONTROL_LINK_DOWN: Chassis cluster control link is down
Apr 15 07:22:03  sw-infrarunbook-01 chassisd[1102]: CHASSISD_CLUS_SPLIT_BRAIN_DETECTED: Split brain detected for redundancy group 0

The fix requires resolving the underlying control link failure first — physically inspect and restore the control port connection. Once the control link is back up, manually resolve the split brain by designating which node should be primary:

infrarunbook-admin@sw-infrarunbook-01> request chassis cluster failover redundancy-group 0 node 1

This forces node 1 back to secondary, restoring node 0 as the sole primary. Always enable

control-link-recovery

in your cluster configuration if it isn't already — it allows the cluster to self-heal after a control link restoration rather than requiring manual intervention every time.

infrarunbook-admin@sw-infrarunbook-01# set chassis cluster control-link-recovery
infraunbook-admin@sw-infrarunbook-01# commit

Root Cause 3: Priority Not Set

This one catches a lot of people after fresh cluster deployments. If you don't explicitly configure node priorities, both nodes default to priority 0 for all redundancy groups. With equal priorities, the cluster falls back to an election mechanism that's essentially a coin flip — whichever node boots first or wins the internal tie-breaking logic becomes primary. You can't predict or control it.

Why does this matter? Your network topology almost certainly assumes a specific node is primary. Maybe node 0 sits in your primary data center and node 1 is in a DR facility. If node 1 wins the election because it completed its boot sequence three seconds faster, your traffic is now hairpinning across the WAN unnecessarily. And after every maintenance reboot, the outcome changes.

Check current priorities:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
  node0  0        primary        N       N        None
  node1  0        secondary      N       N        None

Redundancy group: 1 , Failover count: 0
  node0  0        primary        N       N        None
  node1  0        secondary      N       N        None

Both nodes showing priority 0 is the tell. Fix it by setting explicit priorities — higher number wins the election:

infrarunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 0 node 0 priority 200
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 0 node 1 priority 100
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 node 0 priority 100
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 node 1 priority 1
infraunbook-admin@sw-infrarunbook-01# commit

After committing, confirm the output looks right:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
  node0  200      primary        N       N        None
  node1  100      secondary      N       N        None

Redundancy group: 1 , Failover count: 0
  node0  100      primary        N       N        None
  node1  1        secondary      N       N        None

One thing to keep in mind: setting priorities doesn't immediately trigger a failover if the wrong node is currently primary and preemption isn't enabled. That's exactly the next issue.

Root Cause 4: Preemption Not Enabled

Preemption controls whether the higher-priority node automatically reclaims the primary role after recovering from a failure. Without it, the cluster is sticky — once a failover has occurred, the lower-priority node stays primary even after the higher-priority node comes back fully online. For most production environments, this is not the behavior you want.

Here's the scenario I see play out repeatedly: node 0 (priority 100, your intended primary) goes down for a planned maintenance window. Node 1 correctly takes over as primary. Node 0 comes back up and rejoins the cluster as secondary — then just sits there while node 1 continues running the show, even though node 0 has a higher priority and is supposed to own the primary role. Your cluster is now in a state that doesn't match your design, and any subsequent problem with node 1 will be harder to reason about because you're no longer in your baseline configuration.

Check whether preemption is active:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 1 , Failover count: 2
  node0  100      secondary      N       N        None
  node1  1        primary        N       N        None

The

Preempt: N

column confirms it's not configured. Node 1 is primary despite holding priority 1, while node 0 has priority 100. Also verify in the config directly:

infrarunbook-admin@sw-infrarunbook-01# show chassis cluster redundancy-group 1
node 0 priority 100;
node 1 priority 1;
## preempt statement is absent

Enable preemption:

infrarunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 preempt
infraunbook-admin@sw-infrarunbook-01# commit

After committing, the higher-priority node will reclaim primary status. Verify the outcome:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 1 , Failover count: 3
  node0  100      primary        Y       N        None
  node1  1        secondary      Y       N        None

Both nodes now show

Preempt: Y

and node 0 has reclaimed the primary role. One important operational note: enabling preemption on a live cluster will trigger an immediate failover if the current primary has a lower priority than the secondary. Plan this change for a maintenance window if there's active traffic you can't afford to interrupt.

Root Cause 5: Interface Monitoring Not Configured

Interface monitoring is the mechanism that tells the cluster to fail over when a monitored interface goes down. Without it, failover only happens when the entire node fails — not when an upstream or downstream link fails. This is a significant operational gap that gets overlooked more often than it should.

The scenario that exposes this gap: the WAN uplink on your primary node goes down. Traffic stops flowing. But the node itself is still alive, the control link is up, the cluster is healthy from a node perspective — so no failover happens. Users are down. Your cluster did exactly what it was configured to do, which was nothing, because nobody told it that the uplink failure should trigger a failover.

Check whether interface monitoring is configured:

infrarunbook-admin@sw-infrarunbook-01# show chassis cluster redundancy-group 1
node 0 priority 100;
node 1 priority 1;
preempt;
## interface-monitor block is absent

And from the operational view:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 1 , Failover count: 0
  node0  100      primary        Y       N        None
  node1  1        secondary      Y       N        None

Monitor-failures: None

No interfaces listed in the monitor-failures column and nothing being watched means you have no interface monitoring at all. To configure it, identify the critical interfaces — typically the physical uplink ports that are members of your reth interfaces — and assign weights. The weight value represents how much the node's effective priority drops when that interface fails. When the cumulative weight of failed monitored interfaces reaches or exceeds 255, the cluster triggers a failover:

infrarunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 interface-monitor ge-0/0/0 weight 255
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 interface-monitor ge-0/0/1 weight 128
infraunbook-admin@sw-infrarunbook-01# commit

With this config, losing

ge-0/0/0

alone (weight 255) immediately triggers failover. Losing

ge-0/0/1

alone (weight 128) doesn't — but losing both simultaneously pushes the combined weight past the 255 threshold and triggers failover. Set your weights based on the criticality of each link in your topology.

After configuring, verify by simulating a failure during a maintenance window. Bring down a monitored interface and watch the cluster respond:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 1 , Failover count: 1
  node0  100      secondary      Y       N        ge-0/0/0
  node1  1        primary        Y       N        None

Monitor-failures: ge-0/0/0

The

Monitor-failures: ge-0/0/0

field confirms the cluster detected the link failure and triggered a failover correctly. Node 1 is now primary. Restore the interface and confirm node 0 preempts back to primary as expected.

Root Cause 6: Heartbeat Packet Loss on Control Link

Even with the control link physically intact, I've seen clusters behave erratically because heartbeat packets are being silently dropped somewhere in the path. This is especially common when the control link traverses an intermediate management switch rather than being a direct crossover connection. A misconfigured VLAN, a port-security policy, or storm control on the switch can drop the multicast or broadcast heartbeat frames without generating any obvious error.

Check the control plane statistics to see whether heartbeats are being received:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster control-plane statistics
Control link statistics:
  Control link 0:
    Heartbeat packets sent: 847321
    Heartbeat packets received: 847318
    Heartbeat packet errors: 0

  Control link 1:
    Heartbeat packets sent: 12490
    Heartbeat packets received: 0
    Heartbeat packet errors: 3

Control link 1 is sending heartbeats but receiving zero back, with errors accumulating. This points squarely at a Layer 2 path problem rather than a physical cable issue. Check the switch port configuration between the nodes — verify the VLAN tagging is correct, no ACLs are filtering the traffic, and that storm control isn't rate-limiting or dropping the frames. If possible, convert the control link path to a direct crossover connection to eliminate the switch as a variable entirely.

Prevention

Most chassis cluster failover problems are preventable. They tend to surface either because a cluster was deployed without a complete configuration review, or because a change was made to one part of the environment without considering the cluster implications. A solid pre-production checklist eliminates the vast majority of these issues.

Before any cluster goes live, run all five of these commands and read the output critically:

infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
infraunbook-admin@sw-infrarunbook-01> show chassis cluster interfaces
infraunbook-admin@sw-infrarunbook-01> show chassis cluster control-plane statistics
infraunbook-admin@sw-infrarunbook-01> show chassis cluster data-plane statistics
infraunbook-admin@sw-infrarunbook-01> show chassis cluster statistics

You're looking for: fabric links up on both nodes, heartbeats incrementing symmetrically on both control links, non-zero and non-equal priorities on each redundancy group, preemption showing Y, and interface-monitor entries present for your critical uplinks. If any of those conditions aren't met, fix them before traffic goes live.

Always test failover before production traffic is on the cluster. Use

request chassis cluster failover redundancy-group 1 node 1

to manually trigger a failover, verify that traffic shifts as expected, then fail back and confirm the preemption logic works. This end-to-end test validates the full path: interface monitoring, priority logic, session synchronization over the fabric link, and reth interface migration. Skipping this step is how you discover a misconfiguration at 2am during an actual outage.

Set up monitoring for

CHASSISD_CLUS_*

syslog messages. Chassis cluster events — fabric link down, failovers, split brain detection, node isolation — should alert your on-call team immediately, not sit in a log file. By the time someone manually checks the logs, the damage may already be compounding. A properly monitored cluster turns a potential outage into a brief, expected failover event. An unmonitored cluster turns a brief failover into an extended incident when the secondary also has a problem nobody knew about.

Finally, document your intended cluster state explicitly: which node should be primary for each redundancy group, which interfaces are monitored and why, and what your expected failover behavior is after each type of failure. This documentation becomes invaluable when someone unfamiliar with the cluster is troubleshooting at 3am — and it's also a forcing function for making sure you've actually thought through all of these configuration points before something goes wrong.

Juniper Chassis Cluster Failover Issues

Symptoms

Root Cause 1: Fabric Link Down

Root Cause 2: Control Plane Redundancy Group Split Brain

Root Cause 3: Priority Not Set

Root Cause 4: Preemption Not Enabled

Root Cause 5: Interface Monitoring Not Configured

Root Cause 6: Heartbeat Packet Loss on Control Link

Prevention

Frequently Asked Questions

How do I check if my Juniper chassis cluster is in split brain?

What is the difference between the fabric link and the control link on a Juniper SRX chassis cluster?

Why did my Juniper chassis cluster not fail over when a WAN link went down?

After my primary node rebooted and came back up, why is the secondary still acting as primary?

What weight value should I use for interface monitoring on a Juniper SRX cluster?

Related Articles