Browse docs
--- title: "Cluster and High Availability" description: "Multi-node cluster configuration, state sync, failover modes, heartbeat monitoring, and zone-aware deployment for Aurora Enterprise." icon: "server" ---
Overview
Enterprise cluster support provides a control-plane layer for managing multi-node Aurora deployments. Nodes communicate via a gossip protocol (powered by memberlist) to replicate state — auth keys, model metadata, guardrails, workflows, and pricing — so every node can serve any request identically.
Each node registers with a unique identity, region, zone, and heartbeat status, giving operators full visibility into cluster health.
Architecture
In production, a load balancer sits in front of all Aurora nodes. Clients point to a single URL; the LB distributes traffic across healthy nodes. Because state is replicated via gossip, no sticky sessions or shared storage are required.
┌─────────────â”
│ Load │ ↠Clients point here (single URL)
│ Balancer │
└──────┬──────┘
┌───────────┼───────────â”
│ │ │
┌────▼────┠┌───▼────┠┌───▼────â”
│ Node A │ │ Node B │ │ Node C │
│ │ │ │ │ │
│ SQLite │ │ SQLite │ │ SQLite │
└────┬────┘ └───┬────┘ └───┬────┘
│ │ │
└───gossip──┼───gossip──┘
sync │ sync
│
┌─────▼──────â”
│ Provider │
│ API │ ↠OpenAI, Anthropic, etc.
└────────────┘State Sync
When a change is made on any node, it is broadcast to all peers via the gossip protocol. The receiving nodes apply the change to their local storage immediately.
Configuration
cluster:
enabled: true
node_id: "enterprise-node-1"
node_name: "enterprise-node-1"
region: "us-east"
zone: "us-east-a"
bind_addr: "0.0.0.0"
bind_port: 7946
sync_bind_addr: "0.0.0.0"
sync_bind_port: 7947
seed_nodes: "enterprise-node-1:7946"
advertise_url: "https://aurora-enterprise.example.com"
heartbeat_interval_seconds: 30
failover_mode: "active_passive"Environment Variables
CLUSTER_ENABLED=true
CLUSTER_NODE_ID=enterprise-node-1
CLUSTER_NODE_NAME=enterprise-node-1
CLUSTER_REGION=us-east
CLUSTER_ZONE=us-east-a
CLUSTER_BIND_ADDR=0.0.0.0
CLUSTER_BIND_PORT=7946
CLUSTER_SYNC_BIND_ADDR=0.0.0.0
CLUSTER_SYNC_BIND_PORT=7947
CLUSTER_SEED_NODES=enterprise-node-1:7946
CLUSTER_ADVERTISE_URL=https://aurora-enterprise.example.com
CLUSTER_HEARTBEAT_INTERVAL_SECONDS=30
CLUSTER_FAILOVER_MODE=active_passiveSettings Reference
Single Node Mode
Aurora works fully without cluster mode. Set cluster.enabled: false (or omit it) and all core features — auth keys, providers, models, routing, guardrails, workflows, pricing, admin UI — function identically. The cluster status endpoint simply returns {"enabled": false} and the dashboard hides cluster controls.
Single-node is suitable for development, testing, and single-server production deployments. No shared storage or gossip coordination is required.
Failover Modes
Single Node
No failover. Each node operates independently with its own local SQLite database. Cluster sync is still available if cluster is enabled with a single node.
cluster:
failover_mode: "single_node"Active-Passive
One active node handles all traffic. Standby nodes monitor health and take over on failure. State is replicated via gossip — no shared database required.
cluster:
failover_mode: "active_passive"- Primary node runs at full replica count
- Secondary nodes run with reduced or zero capacity
- DNS or load balancer failover redirects traffic on primary failure
- Each node has its own local SQLite, kept in sync via gossip
Active-Active
All nodes handle traffic simultaneously. Suitable for multi-region deployments requiring zero-downtime across region failures.
cluster:
failover_mode: "active_active"- Each region runs independent Aurora instances
- Regional load balancers route traffic to the nearest Aurora
- State sync replicates across regions via gossip
- Each node has its own local SQLite, no shared storage needed
Node Heartbeat
Each node sends a heartbeat signal at the configured interval. The cluster service tracks:
- Node online/offline status
- Last heartbeat timestamp
- Region and zone metadata
- Advertised URL and sync address
Heartbeat data is visible from the admin dashboard and available for health monitoring and alerting.
Dashboard
The cluster page at /admin/dashboard/cluster provides:
- Real-time node list with leader badge and health indicators
- Cluster status, node count, and failover mode
- Per-node region, zone, sync address, and last heartbeat timestamp
- A settings section under Settings → Routing for read-only cluster metadata
The useClusterStatus() and useClusterNodes() hooks power the UI with 30-second auto-refresh.
Stateless Design
Aurora stores no session state in-process. State is persisted to local storage (SQLite by default) and replicated between nodes via the gossip protocol. This means:
- Nodes can be added or removed without coordination
- Rolling updates are zero-downtime
- Horizontal scaling is unrestricted
- Any node can serve any request
- No shared database is required — each node runs its own local SQLite
Related Topics
- Sizing and Redundancy — Capacity planning and database HA
- Cross-Region Deployment — Multi-region patterns with cluster integration
- Security Hardening — Production security for cluster nodes