What FortiGate HA Actually Is
FortiGate High Availability uses Fortinet's proprietary FortiGate Clustering Protocol — FGCP — to bind two or more physical or virtual firewall appliances into a single logical unit. From the network's perspective, the cluster looks like one device: one management plane, one policy set, one routing table. Behind the scenes, FGCP handles everything from configuration synchronization to session state replication to failover detection and election.
There are two operating modes: Active-Passive (AP) and Active-Active (AA). The naming is intuitive enough that people think they understand the difference immediately, and that's exactly where the trouble starts. I've seen engineers deploy AA clusters expecting horizontal throughput scaling and then wonder why their firewall pair isn't pushing twice the advertised line rate. Understanding what each mode actually does — not what the name implies — is what separates a working cluster from an expensive pair of underperforming appliances.
How Active-Passive Works
In Active-Passive mode, one unit is the primary and all remaining units are secondary. The primary handles all traffic. The secondary units sit idle, continuously receiving synchronized session tables, routing information, and configuration updates from the primary — but they forward exactly zero packets under normal operating conditions.
FGCP uses dedicated heartbeat links between cluster members. Fortinet recommends two separate physical interfaces for heartbeat traffic, and in production you absolutely should use two. A single heartbeat link is a single point of failure for the clustering mechanism itself, which defeats the purpose. Heartbeats are sent every 200 milliseconds by default. If the primary misses enough consecutive heartbeats, the secondary promotes itself and takes over.
When failover happens, it's not instantaneous. The secondary detects the missed heartbeats, wins the election based on configured priority and unit serial number, assumes the virtual MAC addresses that the primary was using, and begins processing traffic. Existing TCP sessions that were synchronized before the failure resume on the new primary. Failover typically completes in under a second in well-tuned environments, though I've seen poorly configured deployments take 3–5 seconds — enough to drop SIP calls and reset BGP sessions if your hold timers are tight.
The virtual MAC address behavior is worth understanding in detail. FGCP assigns virtual MACs to cluster interfaces so that when failover occurs, the new primary can answer ARP requests with the same MAC address that downstream switches and routers already have cached. This prevents the cluster from needing to trigger gratuitous ARPs across the entire network on every failover. The virtual MACs are derived from the HA group ID, which means if you have two separate clusters on the same Layer 2 segment, you must configure different group IDs or you'll get MAC collisions.
How Active-Active Works
Active-Active mode changes the forwarding model in a meaningful way. All cluster members are active simultaneously and can process traffic. The primary unit is still elected and it still handles management, routing protocol adjacencies, and cluster coordination — but when it receives a new session, it can forward that session to a secondary unit for actual processing using FGCP's internal load balancing mechanism.
Here's the part that consistently trips people up: load balancing in AA mode is per-session, not per-packet. Once a session is assigned to a cluster member, all packets for that session go through that same member. The primary intercepts new sessions, makes a scheduling decision, and then encapsulates and forwards the session to the assigned member. The assigned member processes the traffic — including any UTM inspection, IPS, application control, SSL decryption — and sends it out its own interface directly.
The load balancing algorithm is configurable. Options include round-robin, weighted round-robin, random, IP, and IP-port. In my experience, IP-based or IP-port-based scheduling produces more predictable behavior for troubleshooting, because you can predict which cluster member is handling a given flow. Round-robin works fine in lab environments but makes live troubleshooting painful when you're trying to correlate logs across two or more units and the session could be on either one.
There's a critical architectural point here that catches people off guard: in AA mode, traffic can enter one cluster member and exit a different one. This means all cluster members need full Layer 2 connectivity to each other on all data-plane interfaces, not just the dedicated heartbeat links. FGCP handles the inter-unit encapsulation internally, but your physical topology has to support it. If you've connected the cluster to a pair of switches with port security policies or strict VLAN filtering, you need to account for this cross-unit forwarding behavior or you'll see asymmetric drops.
Configuration Synchronization in Both Modes
In both AP and AA, configuration is synchronized from the primary to all secondaries automatically. You make changes on the primary and FGCP pushes them cluster-wide. This is largely seamless, but there are edge cases worth knowing. Interface-specific settings like IP addresses are not synchronized — each unit retains its own management IP so you can still reach each box individually. SNMP traps, NTP, and a handful of other system-level parameters do synchronize, but the per-unit management interface stays unique by design.
Session synchronization runs continuously in both modes. The primary replicates the session table to secondaries so that in-flight connections survive a failover. Not every session type synchronizes with equal fidelity. UDP sessions, ICMP, and short-lived TCP connections often don't survive failover cleanly — they're either too short-lived to be worth synchronizing or they complete before the failover window closes. Long-lived TCP sessions like database connections and persistent HTTP/2 streams tend to survive failover when synchronization is working correctly.
Why the Mode Choice Matters
The decision between AP and AA comes down to what problem you're actually trying to solve. If your primary concern is resilience — surviving a hardware failure without dropping production traffic — Active-Passive is the simpler, more predictable choice. There are fewer moving parts, troubleshooting is straightforward because only one unit is ever forwarding traffic at a time, and failover behavior is well-understood and consistent.
If you're pushing against the throughput ceiling of a single FortiGate unit and you need more processing capacity for UTM-heavy workloads, Active-Active can help. Environments running IPS, application control, SSL inspection, and antivirus simultaneously can benefit from distributing that compute load. Each FortiGate has dedicated NP (Network Processor) and CP (Content Processor) ASICs, and AA mode lets you leverage those ASICs across multiple physical boxes.
But AA introduces complexity that AP doesn't. Asymmetric routing becomes a real operational concern. Log correlation gets harder because a single user's traffic may be spread across multiple units. Certain FortiGate features behave differently or carry limitations in AA mode — ZTNA proxy mode has specific AA constraints, and some NAT scenarios create issues when the unit that received the session isn't the one holding the NAT translation table entry. Don't let the throughput promise push you into AA without honestly evaluating that complexity cost.
Real-World Deployment Scenarios
In my experience, the vast majority of enterprise FortiGate deployments run Active-Passive. A typical setup might involve two FortiGate 200F units connected to a pair of core switches, with heartbeat links running over a dedicated VLAN or direct crossover cable, and a separate out-of-band management network on the MGMT interface. The primary handles all traffic, the secondary syncs sessions and config, and failover completes in under a second.
For a site managed under solvethenetwork.com, the HA configuration on the primary unit — accessed via CLI as infrarunbook-admin on sw-infrarunbook-01's connected console — would typically look like this:
config system ha
set mode a-p
set group-name "solvethenetwork-ha"
set group-id 1
set password ENC <hash>
set hbdev "port9" 50 "port10" 50
set session-sync-dev "port9" "port10"
set priority 200
set override enable
set monitor "port1" "port2" "port3"
end
The
monitorsetting is something engineers consistently forget to configure properly. It tells FGCP which interfaces to watch. If a monitored interface goes down on the primary — say, the uplink to the core switch fails — it triggers a failover even if the unit itself is fully alive. Without proper monitoring, you can have a scenario where your uplink dies and the firewall sits there processing nothing while the secondary waits for a heartbeat it never loses. That's a silent outage that's infuriating to diagnose at 2 AM.
For an Active-Active deployment, you'd set the mode and add the load balancing configuration:
config system ha
set mode a-a
set group-name "solvethenetwork-ha"
set group-id 2
set password ENC <hash>
set hbdev "port9" 50 "port10" 50
set session-sync-dev "port9" "port10"
set load-balance-all enable
set schedule ip-port
set priority 200
set override enable
set monitor "port1" "port2" "port3"
end
The
load-balance-all enableoption extends load balancing to firewall-policy-only sessions, not just sessions hitting UTM profiles. Without it, only UTM-inspected sessions get distributed. Whether you want this depends on your traffic profile. In some environments, enabling it for all sessions actually increases inter-unit encapsulation overhead enough to offset the distribution benefit on low-inspection traffic. Profile before you enable.
A use case where AA genuinely makes sense is a colocation environment running as a managed security provider, where a single FortiGate pair is servicing multiple customer VDOMs with heavy SSL inspection across all of them. The per-session distribution provides real throughput gains in that scenario because the bottleneck is the CP processor doing decryption and re-encryption, not the NP doing hardware forwarding. Spreading that CP workload across two units gives you meaningful headroom.
Checking HA Status and Diagnosing Issues
Day-to-day HA health monitoring comes down to a handful of commands. The one I run first when checking cluster health is:
get system ha status
This gives you the cluster mode, member count, which unit is primary, uptime, and synchronization status. If you see a member showing as out of sync, that's your first indicator that something is wrong with configuration or session replication. Follow it up with:
diagnose sys ha checksum cluster
This shows the configuration checksum for each cluster member. Mismatched checksums mean the cluster will attempt to re-sync automatically, but persistent mismatches may require a manual trigger:
execute ha synchronize start
For verifying which unit is handling a specific session in AA mode, run a packet capture on both units simultaneously. FortiGate's built-in sniffer is reliable for this — connect to each unit individually and run:
diagnose sniffer packet any "host 10.10.10.50" 4 0 l
Whichever unit shows the traffic is the one processing that session. It's a quick way to confirm your load balancing is actually distributing sessions rather than pinning everything to the primary.
Common Misconceptions
The biggest misconception I run into is that Active-Active doubles your throughput. It doesn't — at least not in any straightforward, guaranteed way. The primary unit is still the ingress point for all new sessions. It's still doing the initial session lookup, the policy match, and the routing decision before deciding whether to offload the session to a secondary. The primary can become a bottleneck in AA mode, particularly at high session setup rates. You're not halving the workload on the primary — you're offloading the sustained inspection work, not the session establishment overhead.
A related misconception is that all UTM features benefit equally from AA distribution. SSL inspection is the heavy hitter. If you're doing full SSL inspection on a large percentage of your traffic, AA can genuinely help because the CP processor work for decryption and re-encryption gets distributed. For standard IPS without SSL inspection, the NP ASICs handle much of that work in hardware anyway, and the distribution gains are often smaller than expected.
There's also persistent confusion about what HA means for uptime. HA means reduced downtime. It doesn't mean zero downtime. There's still a failover window — typically 1–3 seconds in a well-configured AP cluster. BGP sessions with short hold timers will reset. OSPF adjacencies may drop momentarily. Stateful application sessions should survive if they were synchronized, but applications that can't tolerate even a brief TCP pause will have problems. Design your application tier with this reality in mind rather than assuming the firewall failover is completely transparent to everything above it.
IPsec IKE security associations do synchronize in FortiOS 7.x and later — that's a significant improvement over earlier versions where VPN tunnels required full re-negotiation after failover. But SSL VPN sessions are not synchronized the same way, and users may need to re-authenticate depending on your FortiOS version and the specific SSL VPN mode in use.
Finally, some engineers believe you can mix FortiGate models in an FGCP cluster. You can't. Both units must be identical hardware models running the same FortiOS major version, and ideally the same minor version. Running mismatched firmware between primary and secondary is supported temporarily during a rolling upgrade, but it's a transient state with feature limitations — not a configuration you want to leave running indefinitely.
Making the Decision
If you're sizing a new deployment and choosing between AP and AA, start with Active-Passive. It's simpler to deploy, simpler to troubleshoot, and covers the resilience requirement that most organizations actually need from a firewall cluster. If throughput benchmarking shows you're consistently hitting 70–80% of single-unit capacity under full UTM load — and the bottleneck is specifically CP-bound inspection work — then AA is worth evaluating. But profile your traffic first before committing to the added complexity.
Whatever mode you choose, invest time in heartbeat link design, interface monitoring configuration, and failover testing. A cluster that's never been failed over in a controlled maintenance window is a cluster that will surprise you during a production incident. Schedule the test, manually trigger a failover, verify traffic resumes cleanly, and write down your actual observed failover time. That number — measured in your environment with your specific traffic profile — is the one that matters for RTO planning, not the sub-second figure in a datasheet.
