High Availability — Active-Passive Pair

AiFw supports a two-node active-passive cluster: one master forwarding production traffic, one backup with replicated pf state, ready to take over within seconds of any failure on the master.

What survives a master reboot

Component Survives Notes
TCP sessions through the firewall yes pfsync replicates state to the backup; state-policy floating lets replicated states match traffic on the new master’s interface
WireGuard tunnels yes* wireguard-go binds wildcard so CARP VIPs are accepted automatically; existing peers reconnect within ~5 s provided remote peers have PersistentKeepalive ≤ 5 (the default is off — peers without keepalive only reconnect on next outbound traffic)
DHCP leases (rDHCP) yes rDHCP HA handles its own state replication; AiFw’s dhcp_link flag keeps the peer list in sync
ACME certificates yes renewal happens master-only; on success the cert+key are pushed to peers via POST /api/v1/cluster/cert-push
In-flight DNS lookups no small visible glitch during the failover window — most resolvers retry transparently
IDS in-memory ring buffer no rule overrides and suppressions are replicated; runtime alert ring buffer is not

Prerequisites

Security considerations

The replication channel between cluster nodes is treated as TRUSTED. Specifically:

If your network does not satisfy “trusted pfsync segment,” do not enable HA replication.

Setup

  1. Install AiFw on both nodes via the ISO build.
  2. On node A, in the first-boot wizard:
    • Answer Yes to “Configure HA pair?”
    • Choose Primary.
    • Select the pfsync interface, peer IP, password, and per-LAN/WAN VIPs.
  3. On node B: run the same wizard, choose Secondary, enter the same VHIDs and password.
  4. After both nodes are up: visit https://<node-A-mgmt-ip>/cluster and confirm both nodes appear in the table with a green health status.
  5. Verify with aifw cluster verify on each node, then run scripts/ha-verify.sh node-a node-b over SSH for a pair-wide check.

Latency profiles

The pfsync.latency_profile setting controls CARP advertisement timing and therefore the detection window for unplanned failures.

Profile advbase secondary advskew Detection time Use when
Conservative (default) 1 100 ~3 s Default. Tolerates flaky networks.
Tight 1 20 ~1.5 s Reliable network with dedicated pfsync link.
Aggressive 1 10 ~1 s Requires future heartbeat daemon — schema-only this release.

The primary node always uses advskew=0 regardless of profile. Set the profile via the CLI:

aifw cluster pfsync set --latency-profile tight

Or via the API (PUT /api/v1/cluster/pfsync).

Minimizing the unplanned-failure gap

For planned reboots and service restarts, AiFw demotes CARP via sysctl net.inet.carp.demotion=240 before tearing down the local data plane (the aifw_demote_on_shutdown rc.d script + per-service stop preludes). The peer takes over within ~1 s, so a reboot of the master typically misses zero to two packets.

For unplanned failures (power loss, kernel panic, NIC death), the gap depends on CARP timer detection:

Without a UPS, a hard power loss results in:

Split-brain handling

If the pfsync link fails but both nodes stay up, both may temporarily think they are MASTER (a “split brain”). When the link reconnects:

There is no application-layer node-id tiebreaker; the design relies on CARP’s deterministic timer comparison plus preempt. If both nodes happen to be configured with identical advskew (a misconfiguration), CARP itself does not deterministically resolve and operators must manually demote one node via aifw cluster demote until the misconfiguration is corrected.

Operations

Planned maintenance / rolling upgrade

Always upgrade the standby first.

# On the standby
aifw update install --restart

aifw update install --restart runs service X restart for each managed service. The rc.d stop function for aifw_daemon, aifw_api, and aifw_ids includes a prelude (added in #220) that sets net.inet.carp.demotion=240 and sleeps 1 second before killing the service, so the peer takes over CARP master before the local data plane drops.

After the standby is healthy on the new version, fail over manually if needed and repeat on the (now) standby:

aifw cluster demote          # on the current master, hands master to peer
aifw update install --restart  # on the (now) standby, upgrades the second node

Confirm version drift is gone via aifw cluster nodes list (or the dashboard’s per-node panel — the software_version field shows the running version of each node).

Manual promote / demote

aifw cluster demote   # this node becomes BACKUP  (sysctl carp.demotion=240)
aifw cluster promote  # this node becomes MASTER  (sysctl carp.demotion=0)

Demote the current master before promoting the standby to avoid a brief split-brain window.

Decommission a node

aifw cluster nodes remove <node-id>

The remaining node continues as a standalone (Standalone role). Obtain the <node-id> from aifw cluster nodes list.

Force a config sync

aifw cluster sync     # this node pulls the current snapshot from the primary

The dashboard’s Force sync from peer button does the same thing. Use this when the standby’s cluster_snapshot_state.last_applied_hash doesn’t match the master’s live config hash and the next replicator tick is too far away.

Show cluster status

aifw cluster status
aifw cluster status --json

Verifying

# On either node — exits 0 healthy, exits 1 with reason on failure
aifw cluster verify

# Machine-readable output (used by the harness)
aifw cluster verify --json | python3 -m json.tool

The verify command checks:

  1. pf state-policy floating is set.
  2. pfsync0 interface is UP.
  3. At least one CARP VIP is configured (carp: line in ifconfig).
  4. /api/v1/cluster/status reports peer_reachable: true.
  5. A config snapshot hash is present (replication is not stalled).

Pair-wide check via SSH

sh scripts/ha-verify.sh node-a.example.com node-b.example.com

The harness asserts:

Exit codes:

Code Meaning
0 Pair healthy
1 A node was unreachable or aifw cluster verify returned a non-zero exit
2 At least one node failed its local checks (ok=false)
3 Expected exactly 1 MASTER, got a different count (0 or 2)

Out of scope (this release)

See also

Last updated: