Symptoms
You're staring at a chassis cluster that won't fail over, or worse, one that's actively failing over when it shouldn't be. Maybe a node went offline during a maintenance window and traffic didn't shift. Maybe both nodes are claiming to be primary for the same redundancy group. Maybe
show chassis cluster statusis showing redundancy group 0 and redundancy group 1 on different nodes and you can't figure out why. Whatever brought you here, chassis cluster failover problems on Juniper SRX platforms are almost always traceable to a handful of well-known root causes — and in my experience, the same ones show up again and again.
The most common symptoms you'll run into include both nodes showing as primary for the same redundancy group (split brain), failover not occurring after a node or link failure, failover completing but sessions not recovering, unexpected redundancy group placement after a reboot, fabric interfaces in a down or error state, and node isolation events showing up in syslog. Let's work through each root cause systematically — what's happening under the hood, how to confirm it, and how to fix it.
Root Cause 1: Fabric Link Down
The fabric link is the data plane interconnect between the two SRX nodes. On most SRX platforms these are dedicated interfaces — fab0 on node 0, fab1 on node 1 — and they carry session synchronization traffic. If the fabric goes down, you lose stateful session sync entirely. In some scenarios this also triggers unintended failovers because the cluster loses confidence in the peer's data plane state.
This happens most often from a physically unplugged or damaged cross-connect cable, or from someone plugging the fabric interfaces into a switch instead of directly into each other. I've seen the latter scenario produce a link that appears up at Layer 1 but fails to pass fabric traffic correctly — making it particularly annoying to diagnose.
Start your investigation here:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster interfaces
Fabric link status:
fab0: Up
fab1: Down
Child links:
fab0.0: Up
If you see
fab1: Downor both fabric links down, that's your culprit. Drill into the interface directly to confirm the physical state:
infrarunbook-admin@sw-infrarunbook-01> show interfaces fab1
Physical interface: fab1, Enabled, Physical link is Down
Interface index: 131, SNMP ifIndex: 508
Link-level type: Ethernet, MTU: 9192, Speed: Unspecified, Duplex: Unspecified
Input rate : 0 bps (0 pps)
Output rate : 0 bps (0 pps)
Syslog will usually confirm the timeline:
infrarunbook-admin@sw-infrarunbook-01> show log messages | match fabric
Apr 15 03:14:22 sw-infrarunbook-01 chassisd[1234]: CHASSISD_CLUS_FABRIC_LINK_DOWN: Chassis cluster fabric link down on node 1
The fix is straightforward for physical issues — reseat or replace the cable. If the interfaces are misconfigured, verify that the ports listed under your
fabric member-interfacesconfig actually match the ports you've cross-connected. After fixing, confirm data plane sync has resumed:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster data-plane statistics
Ingress traffic statistics:
Total bytes: 9483920184
Total packets: 7294821
Fabric sync packets: 3847201
Fabric sync errors: 0
Zero fabric sync errors and incrementing sync packets tell you the fabric is healthy again.
Root Cause 2: Control Plane Redundancy Group Split Brain
Split brain is the most dangerous chassis cluster state you can encounter. It occurs when both nodes simultaneously believe they are the primary for redundancy group 0 — the control plane redundancy group. Both nodes start processing control plane traffic and making independent forwarding decisions, which leads to traffic blackholing, duplicate session tables, and routing instability that can take down an entire site.
Split brain happens when the control link between the two nodes fails. The control link carries heartbeat traffic — it's how each node confirms the other is alive. When it fails, each node assumes its peer has died and promotes itself to primary. In my experience this most commonly happens due to a failed control link cable, or in setups where the control link traverses a management switch and a VLAN or port security policy silently drops the heartbeat frames.
Detecting split brain is unambiguous once you know what to look for:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 3
node0 200 primary N N None
node1 100 primary N N None
Redundancy group: 1 , Failover count: 1
node0 100 primary N N None
node1 1 secondary N N None
Both nodes showing
primaryfor redundancy group 0 is your smoking gun. Syslog will typically have logged the moment it happened:
infrarunbook-admin@sw-infrarunbook-01> show log messages | match "split brain|control link"
Apr 15 07:22:01 sw-infrarunbook-01 chassisd[1102]: CHASSISD_CLUS_CONTROL_LINK_DOWN: Chassis cluster control link is down
Apr 15 07:22:03 sw-infrarunbook-01 chassisd[1102]: CHASSISD_CLUS_SPLIT_BRAIN_DETECTED: Split brain detected for redundancy group 0
The fix requires resolving the underlying control link failure first — physically inspect and restore the control port connection. Once the control link is back up, manually resolve the split brain by designating which node should be primary:
infrarunbook-admin@sw-infrarunbook-01> request chassis cluster failover redundancy-group 0 node 1
This forces node 1 back to secondary, restoring node 0 as the sole primary. Always enable
control-link-recoveryin your cluster configuration if it isn't already — it allows the cluster to self-heal after a control link restoration rather than requiring manual intervention every time.
infrarunbook-admin@sw-infrarunbook-01# set chassis cluster control-link-recovery
infraunbook-admin@sw-infrarunbook-01# commit
Root Cause 3: Priority Not Set
This one catches a lot of people after fresh cluster deployments. If you don't explicitly configure node priorities, both nodes default to priority 0 for all redundancy groups. With equal priorities, the cluster falls back to an election mechanism that's essentially a coin flip — whichever node boots first or wins the internal tie-breaking logic becomes primary. You can't predict or control it.
Why does this matter? Your network topology almost certainly assumes a specific node is primary. Maybe node 0 sits in your primary data center and node 1 is in a DR facility. If node 1 wins the election because it completed its boot sequence three seconds faster, your traffic is now hairpinning across the WAN unnecessarily. And after every maintenance reboot, the outcome changes.
Check current priorities:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 0
node0 0 primary N N None
node1 0 secondary N N None
Redundancy group: 1 , Failover count: 0
node0 0 primary N N None
node1 0 secondary N N None
Both nodes showing priority 0 is the tell. Fix it by setting explicit priorities — higher number wins the election:
infrarunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 0 node 0 priority 200
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 0 node 1 priority 100
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 node 0 priority 100
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 node 1 priority 1
infraunbook-admin@sw-infrarunbook-01# commit
After committing, confirm the output looks right:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 0
node0 200 primary N N None
node1 100 secondary N N None
Redundancy group: 1 , Failover count: 0
node0 100 primary N N None
node1 1 secondary N N None
One thing to keep in mind: setting priorities doesn't immediately trigger a failover if the wrong node is currently primary and preemption isn't enabled. That's exactly the next issue.
Root Cause 4: Preemption Not Enabled
Preemption controls whether the higher-priority node automatically reclaims the primary role after recovering from a failure. Without it, the cluster is sticky — once a failover has occurred, the lower-priority node stays primary even after the higher-priority node comes back fully online. For most production environments, this is not the behavior you want.
Here's the scenario I see play out repeatedly: node 0 (priority 100, your intended primary) goes down for a planned maintenance window. Node 1 correctly takes over as primary. Node 0 comes back up and rejoins the cluster as secondary — then just sits there while node 1 continues running the show, even though node 0 has a higher priority and is supposed to own the primary role. Your cluster is now in a state that doesn't match your design, and any subsequent problem with node 1 will be harder to reason about because you're no longer in your baseline configuration.
Check whether preemption is active:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 1 , Failover count: 2
node0 100 secondary N N None
node1 1 primary N N None
The
Preempt: Ncolumn confirms it's not configured. Node 1 is primary despite holding priority 1, while node 0 has priority 100. Also verify in the config directly:
infrarunbook-admin@sw-infrarunbook-01# show chassis cluster redundancy-group 1
node 0 priority 100;
node 1 priority 1;
## preempt statement is absent
Enable preemption:
infrarunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 preempt
infraunbook-admin@sw-infrarunbook-01# commit
After committing, the higher-priority node will reclaim primary status. Verify the outcome:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 1 , Failover count: 3
node0 100 primary Y N None
node1 1 secondary Y N None
Both nodes now show
Preempt: Yand node 0 has reclaimed the primary role. One important operational note: enabling preemption on a live cluster will trigger an immediate failover if the current primary has a lower priority than the secondary. Plan this change for a maintenance window if there's active traffic you can't afford to interrupt.
Root Cause 5: Interface Monitoring Not Configured
Interface monitoring is the mechanism that tells the cluster to fail over when a monitored interface goes down. Without it, failover only happens when the entire node fails — not when an upstream or downstream link fails. This is a significant operational gap that gets overlooked more often than it should.
The scenario that exposes this gap: the WAN uplink on your primary node goes down. Traffic stops flowing. But the node itself is still alive, the control link is up, the cluster is healthy from a node perspective — so no failover happens. Users are down. Your cluster did exactly what it was configured to do, which was nothing, because nobody told it that the uplink failure should trigger a failover.
Check whether interface monitoring is configured:
infrarunbook-admin@sw-infrarunbook-01# show chassis cluster redundancy-group 1
node 0 priority 100;
node 1 priority 1;
preempt;
## interface-monitor block is absent
And from the operational view:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 1 , Failover count: 0
node0 100 primary Y N None
node1 1 secondary Y N None
Monitor-failures: None
No interfaces listed in the monitor-failures column and nothing being watched means you have no interface monitoring at all. To configure it, identify the critical interfaces — typically the physical uplink ports that are members of your reth interfaces — and assign weights. The weight value represents how much the node's effective priority drops when that interface fails. When the cumulative weight of failed monitored interfaces reaches or exceeds 255, the cluster triggers a failover:
infrarunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 interface-monitor ge-0/0/0 weight 255
infraunbook-admin@sw-infrarunbook-01# set chassis cluster redundancy-group 1 interface-monitor ge-0/0/1 weight 128
infraunbook-admin@sw-infrarunbook-01# commit
With this config, losing
ge-0/0/0alone (weight 255) immediately triggers failover. Losing
ge-0/0/1alone (weight 128) doesn't — but losing both simultaneously pushes the combined weight past the 255 threshold and triggers failover. Set your weights based on the criticality of each link in your topology.
After configuring, verify by simulating a failure during a maintenance window. Bring down a monitored interface and watch the cluster respond:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 1 , Failover count: 1
node0 100 secondary Y N ge-0/0/0
node1 1 primary Y N None
Monitor-failures: ge-0/0/0
The
Monitor-failures: ge-0/0/0field confirms the cluster detected the link failure and triggered a failover correctly. Node 1 is now primary. Restore the interface and confirm node 0 preempts back to primary as expected.
Root Cause 6: Heartbeat Packet Loss on Control Link
Even with the control link physically intact, I've seen clusters behave erratically because heartbeat packets are being silently dropped somewhere in the path. This is especially common when the control link traverses an intermediate management switch rather than being a direct crossover connection. A misconfigured VLAN, a port-security policy, or storm control on the switch can drop the multicast or broadcast heartbeat frames without generating any obvious error.
Check the control plane statistics to see whether heartbeats are being received:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster control-plane statistics
Control link statistics:
Control link 0:
Heartbeat packets sent: 847321
Heartbeat packets received: 847318
Heartbeat packet errors: 0
Control link 1:
Heartbeat packets sent: 12490
Heartbeat packets received: 0
Heartbeat packet errors: 3
Control link 1 is sending heartbeats but receiving zero back, with errors accumulating. This points squarely at a Layer 2 path problem rather than a physical cable issue. Check the switch port configuration between the nodes — verify the VLAN tagging is correct, no ACLs are filtering the traffic, and that storm control isn't rate-limiting or dropping the frames. If possible, convert the control link path to a direct crossover connection to eliminate the switch as a variable entirely.
Prevention
Most chassis cluster failover problems are preventable. They tend to surface either because a cluster was deployed without a complete configuration review, or because a change was made to one part of the environment without considering the cluster implications. A solid pre-production checklist eliminates the vast majority of these issues.
Before any cluster goes live, run all five of these commands and read the output critically:
infrarunbook-admin@sw-infrarunbook-01> show chassis cluster status
infraunbook-admin@sw-infrarunbook-01> show chassis cluster interfaces
infraunbook-admin@sw-infrarunbook-01> show chassis cluster control-plane statistics
infraunbook-admin@sw-infrarunbook-01> show chassis cluster data-plane statistics
infraunbook-admin@sw-infrarunbook-01> show chassis cluster statistics
You're looking for: fabric links up on both nodes, heartbeats incrementing symmetrically on both control links, non-zero and non-equal priorities on each redundancy group, preemption showing Y, and interface-monitor entries present for your critical uplinks. If any of those conditions aren't met, fix them before traffic goes live.
Always test failover before production traffic is on the cluster. Use
request chassis cluster failover redundancy-group 1 node 1to manually trigger a failover, verify that traffic shifts as expected, then fail back and confirm the preemption logic works. This end-to-end test validates the full path: interface monitoring, priority logic, session synchronization over the fabric link, and reth interface migration. Skipping this step is how you discover a misconfiguration at 2am during an actual outage.
Set up monitoring for
CHASSISD_CLUS_*syslog messages. Chassis cluster events — fabric link down, failovers, split brain detection, node isolation — should alert your on-call team immediately, not sit in a log file. By the time someone manually checks the logs, the damage may already be compounding. A properly monitored cluster turns a potential outage into a brief, expected failover event. An unmonitored cluster turns a brief failover into an extended incident when the secondary also has a problem nobody knew about.
Finally, document your intended cluster state explicitly: which node should be primary for each redundancy group, which interfaces are monitored and why, and what your expected failover behavior is after each type of failure. This documentation becomes invaluable when someone unfamiliar with the cluster is troubleshooting at 3am — and it's also a forcing function for making sure you've actually thought through all of these configuration points before something goes wrong.
