High Availability — Active-Passive Pair
AiFw supports a two-node active-passive cluster: one master forwarding production traffic, one backup with replicated pf state, ready to take over within seconds of any failure on the master.
What survives a master reboot
| Component | Survives | Notes |
|---|---|---|
| TCP sessions through the firewall | yes | pfsync replicates state to the backup; state-policy floating lets replicated states match traffic on the new master’s interface |
| WireGuard tunnels | yes* | wireguard-go binds wildcard so CARP VIPs are accepted automatically; existing peers reconnect within ~5 s provided remote peers have PersistentKeepalive ≤ 5 (the default is off — peers without keepalive only reconnect on next outbound traffic) |
| DHCP leases (rDHCP) | yes | rDHCP HA handles its own state replication; AiFw’s dhcp_link flag keeps the peer list in sync |
| ACME certificates | yes | renewal happens master-only; on success the cert+key are pushed to peers via POST /api/v1/cluster/cert-push |
| In-flight DNS lookups | no | small visible glitch during the failover window — most resolvers retry transparently |
| IDS in-memory ring buffer | no | rule overrides and suppressions are replicated; runtime alert ring buffer is not |
Prerequisites
- Two AiFw nodes on the same broadcast domain. CARP advertisements use multicast (224.0.0.18); the nodes must see each other’s L2 traffic.
- Dedicated NIC for pfsync. Not strictly required, but state-sync traffic on a shared LAN link will degrade under load. A point-to-point cable between two NICs is ideal.
- Synchronized time. Both nodes should run NTP /
rtime(the AiFw companion service). CARP advertisement timing is sensitive to clock drift. - Same software version. Always upgrade the standby first; the cluster dashboard surfaces version drift between nodes.
Security considerations
The replication channel between cluster nodes is treated as TRUSTED. Specifically:
- Snapshot pushes carry secrets in plaintext. The full firewall config is replicated across nodes, including VPN (WireGuard) private keys, preshared keys, DDNS TSIG keys, rDHCP HA TLS material, and the CARP shared password.
- TLS verification is disabled on inter-node calls. Each node accepts self-signed certificates from peers. Anyone with network-layer access to the pfsync segment can MITM these calls and recover the secrets above.
- Mitigation: use a dedicated, physically-secured pfsync NIC. A back-to-back cable between the two nodes, or a VLAN that no other host can reach, is the recommended configuration.
- Per-peer API keys. Each node holds an API key for the peer it pushes to. These are stored in plaintext in the local SQLite DB; treat the DB file as a credential store.
/usr/local/etc/aifw/daemon.keyholds the loopback API key used by daemon background tasks. File mode640, ownedroot:aifw. The aifw user must have read access; no other local user should.- Future cert pinning would close the MITM gap. If a follow-up issue has not yet been filed, open one to track this work.
If your network does not satisfy “trusted pfsync segment,” do not enable HA replication.
Setup
- Install AiFw on both nodes via the ISO build.
- On node A, in the first-boot wizard:
- Answer Yes to “Configure HA pair?”
- Choose Primary.
- Select the pfsync interface, peer IP, password, and per-LAN/WAN VIPs.
- On node B: run the same wizard, choose Secondary, enter the same VHIDs and password.
- After both nodes are up: visit
https://<node-A-mgmt-ip>/clusterand confirm both nodes appear in the table with a green health status. - Verify with
aifw cluster verifyon each node, then runscripts/ha-verify.sh node-a node-bover SSH for a pair-wide check.
Latency profiles
The pfsync.latency_profile setting controls CARP advertisement timing and therefore the detection window for unplanned failures.
| Profile | advbase | secondary advskew | Detection time | Use when |
|---|---|---|---|---|
| Conservative (default) | 1 | 100 | ~3 s | Default. Tolerates flaky networks. |
| Tight | 1 | 20 | ~1.5 s | Reliable network with dedicated pfsync link. |
| Aggressive | 1 | 10 | ~1 s | Requires future heartbeat daemon — schema-only this release. |
The primary node always uses advskew=0 regardless of profile. Set the profile via the CLI:
aifw cluster pfsync set --latency-profile tight
Or via the API (PUT /api/v1/cluster/pfsync).
Minimizing the unplanned-failure gap
For planned reboots and service restarts, AiFw demotes CARP via
sysctl net.inet.carp.demotion=240 before tearing down the local data plane
(the aifw_demote_on_shutdown rc.d script + per-service stop preludes). The peer
takes over within ~1 s, so a reboot of the master typically misses zero to two
packets.
For unplanned failures (power loss, kernel panic, NIC death), the gap depends on CARP timer detection:
- UPS on each node is the single biggest reliability win. It converts power loss into a graceful shutdown with a full CARP demote (near-zero-packet failover) instead of a 1–3 s detection gap.
- Dedicated pfsync NIC keeps replication off the data plane.
- Tight latency profile when the network is reliable.
Without a UPS, a hard power loss results in:
- TCP sessions: still survive (pfsync replicated state in real time).
- UDP packets in flight: lost (no retransmission semantics).
- Total user-visible outage: 1.5–3 s depending on latency profile.
Split-brain handling
If the pfsync link fails but both nodes stay up, both may temporarily think they are MASTER (a “split brain”). When the link reconnects:
- The
ClusterReplicatoron each side detects the conflict on the next snapshot push: the peer responds with409 Conflictbecause it also believes it is master. The conflict is logged and acluster_failover_eventsrow is recorded with causesplit_brain_detected. - The kernel CARP election resolves the role, not the application layer. With
net.inet.carp.preempt=1(set byapply_ha_rules), the node with the lower effective advskew wins and the other node observes the new advertisements and demotes itself automatically. - Whichever node ends up as BACKUP after the kernel election accepts the surviving master’s next snapshot push, replacing any local edits made during the partition. Those edits are visible in the audit log for forensics.
There is no application-layer node-id tiebreaker; the design relies on CARP’s
deterministic timer comparison plus preempt. If both nodes happen to be
configured with identical advskew (a misconfiguration), CARP itself does not
deterministically resolve and operators must manually demote one node via
aifw cluster demote until the misconfiguration is corrected.
Operations
Planned maintenance / rolling upgrade
Always upgrade the standby first.
# On the standby
aifw update install --restart
aifw update install --restart runs service X restart for each managed
service. The rc.d stop function for aifw_daemon, aifw_api, and aifw_ids
includes a prelude (added in #220) that sets net.inet.carp.demotion=240 and
sleeps 1 second before killing the service, so the peer takes over CARP master
before the local data plane drops.
After the standby is healthy on the new version, fail over manually if needed and repeat on the (now) standby:
aifw cluster demote # on the current master, hands master to peer
aifw update install --restart # on the (now) standby, upgrades the second node
Confirm version drift is gone via aifw cluster nodes list (or the dashboard’s
per-node panel — the software_version field shows the running version of each
node).
Manual promote / demote
aifw cluster demote # this node becomes BACKUP (sysctl carp.demotion=240)
aifw cluster promote # this node becomes MASTER (sysctl carp.demotion=0)
Demote the current master before promoting the standby to avoid a brief split-brain window.
Decommission a node
aifw cluster nodes remove <node-id>
The remaining node continues as a standalone (Standalone role). Obtain the
<node-id> from aifw cluster nodes list.
Force a config sync
aifw cluster sync # this node pulls the current snapshot from the primary
The dashboard’s Force sync from peer button does the same thing. Use this
when the standby’s cluster_snapshot_state.last_applied_hash doesn’t match the
master’s live config hash and the next replicator tick is too far away.
Show cluster status
aifw cluster status
aifw cluster status --json
Verifying
# On either node — exits 0 healthy, exits 1 with reason on failure
aifw cluster verify
# Machine-readable output (used by the harness)
aifw cluster verify --json | python3 -m json.tool
The verify command checks:
pf state-policy floatingis set.pfsync0interface is UP.- At least one CARP VIP is configured (
carp:line inifconfig). /api/v1/cluster/statusreportspeer_reachable: true.- A config snapshot hash is present (replication is not stalled).
Pair-wide check via SSH
sh scripts/ha-verify.sh node-a.example.com node-b.example.com
The harness asserts:
- Both nodes report
aifw cluster verify --jsonwithok: true. - Exactly one node reports MASTER role in the status block.
Exit codes:
| Code | Meaning |
|---|---|
| 0 | Pair healthy |
| 1 | A node was unreachable or aifw cluster verify returned a non-zero exit |
| 2 | At least one node failed its local checks (ok=false) |
| 3 | Expected exactly 1 MASTER, got a different count (0 or 2) |
Out of scope (this release)
- Active-active stateful pf — different architecture entirely; not planned.
- N > 2 node clusters — two-node pairs only.
- WAN-side / multi-site / geographic HA — not supported.
- Out-of-band heartbeat daemon — the
latency_profileschema fields exist onpfsync_configand theAggressiveprofile is documented, but no daemon process consumes the heartbeat yet. Aggressive is reserved for a future release. - NUT (Network UPS Tools) integration — strongly recommended in this doc, but
not built into AiFw. Configure NUT separately and point its shutdown hook at
aifw cluster demote && shutdown -p now.
See also
- Features overview →
- Comparison with pfSense / OPNsense →
- Auth & RBAC → — RBAC perms required for cluster operations
Last updated: