Enterprise / Infrastructure

Cluster

Browse docs

--- title: "Cluster and High Availability" description: "Multi-node cluster configuration, state sync, failover modes, heartbeat monitoring, and zone-aware deployment for Aurora Enterprise." icon: "server" ---

Overview

Enterprise cluster support provides a control-plane layer for managing multi-node Aurora deployments. Nodes communicate via a gossip protocol (powered by memberlist) to replicate state â€” auth keys, model metadata, guardrails, workflows, and pricing â€” so every node can serve any request identically.

Each node registers with a unique identity, region, zone, and heartbeat status, giving operators full visibility into cluster health.

Architecture

In production, a load balancer sits in front of all Aurora nodes. Clients point to a single URL; the LB distributes traffic across healthy nodes. Because state is replicated via gossip, no sticky sessions or shared storage are required.

code

                    â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
                    â”‚  Load       â”‚  â† Clients point here (single URL)
                    â”‚  Balancer   â”‚
                    â””â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”˜
               â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¼â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
               â”‚           â”‚           â”‚
          â”Œâ”€â”€â”€â”€â–¼â”€â”€â”€â”€â” â”Œâ”€â”€â”€â–¼â”€â”€â”€â”€â” â”Œâ”€â”€â”€â–¼â”€â”€â”€â”€â”
          â”‚ Node A  â”‚ â”‚ Node B â”‚ â”‚ Node C â”‚
          â”‚         â”‚ â”‚        â”‚ â”‚        â”‚
          â”‚ SQLite  â”‚ â”‚ SQLite â”‚ â”‚ SQLite â”‚
          â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”¬â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”¬â”€â”€â”€â”€â”˜
               â”‚           â”‚           â”‚
               â””â”€â”€â”€gossipâ”€â”€â”¼â”€â”€â”€gossipâ”€â”€â”˜
                    sync   â”‚   sync
                           â”‚
                     â”Œâ”€â”€â”€â”€â”€â–¼â”€â”€â”€â”€â”€â”€â”
                     â”‚  Provider  â”‚
                     â”‚  API       â”‚  â† OpenAI, Anthropic, etc.
                     â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜

State Sync

When a change is made on any node, it is broadcast to all peers via the gossip protocol. The receiving nodes apply the change to their local storage immediately.

State	Replicated	How
Auth keys	âœ… Create, deactivate, update	Broadcast on every change
Model registry	âœ… Full state on join	Fetched via HTTP sync endpoint
Guardrails	âœ… Full state on join + deltas	Broadcast on update
Workflows	âœ… Full state on join + deltas	Broadcast on update
Pricing	âœ… Full state on join + deltas	Broadcast on update
Node membership	âœ… Continuous	Gossip health probes
Leader election	âœ… Continuous	Deterministic (lowest node ID)

Configuration

yaml

cluster:
  enabled: true
  node_id: "enterprise-node-1"
  node_name: "enterprise-node-1"
  region: "us-east"
  zone: "us-east-a"
  bind_addr: "0.0.0.0"
  bind_port: 7946
  sync_bind_addr: "0.0.0.0"
  sync_bind_port: 7947
  seed_nodes: "enterprise-node-1:7946"
  advertise_url: "https://aurora-enterprise.example.com"
  heartbeat_interval_seconds: 30
  failover_mode: "active_passive"

Environment Variables

env

CLUSTER_ENABLED=true
CLUSTER_NODE_ID=enterprise-node-1
CLUSTER_NODE_NAME=enterprise-node-1
CLUSTER_REGION=us-east
CLUSTER_ZONE=us-east-a
CLUSTER_BIND_ADDR=0.0.0.0
CLUSTER_BIND_PORT=7946
CLUSTER_SYNC_BIND_ADDR=0.0.0.0
CLUSTER_SYNC_BIND_PORT=7947
CLUSTER_SEED_NODES=enterprise-node-1:7946
CLUSTER_ADVERTISE_URL=https://aurora-enterprise.example.com
CLUSTER_HEARTBEAT_INTERVAL_SECONDS=30
CLUSTER_FAILOVER_MODE=active_passive

Settings Reference

Setting	Type	Default	Description
`enabled`	bool	`false`	Enable cluster control-plane
`node_id`	string	`"local"`	Unique identifier for this node
`node_name`	string	`"local"`	Human-readable node name
`region`	string	â€”	Cloud region (e.g., `us-east`, `eu-west`)
`zone`	string	â€”	Availability zone within region
`bind_addr`	string	â€”	Gossip protocol bind address
`bind_port`	int	`7946`	Gossip protocol bind port
`sync_bind_addr`	string	â€”	State sync HTTP server bind address
`sync_bind_port`	int	`7947`	State sync HTTP server bind port
`seed_nodes`	string	â€”	Comma-separated `host:port` of seed nodes for cluster join
`advertise_url`	string	â€”	Publicly reachable URL for this node
`heartbeat_interval_seconds`	int	`30`	Seconds between heartbeat signals
`failover_mode`	string	`"single_node"`	`single_node`, `active_passive`, or `active_active`

Single Node Mode

Aurora works fully without cluster mode. Set cluster.enabled: false (or omit it) and all core features â€” auth keys, providers, models, routing, guardrails, workflows, pricing, admin UI â€” function identically. The cluster status endpoint simply returns {"enabled": false} and the dashboard hides cluster controls.

Single-node is suitable for development, testing, and single-server production deployments. No shared storage or gossip coordination is required.

Failover Modes

Single Node

No failover. Each node operates independently with its own local SQLite database. Cluster sync is still available if cluster is enabled with a single node.

yaml

cluster:
  failover_mode: "single_node"

Active-Passive

One active node handles all traffic. Standby nodes monitor health and take over on failure. State is replicated via gossip â€” no shared database required.

yaml

cluster:
  failover_mode: "active_passive"

Primary node runs at full replica count
Secondary nodes run with reduced or zero capacity
DNS or load balancer failover redirects traffic on primary failure
Each node has its own local SQLite, kept in sync via gossip

Active-Active

All nodes handle traffic simultaneously. Suitable for multi-region deployments requiring zero-downtime across region failures.

yaml

cluster:
  failover_mode: "active_active"

Each region runs independent Aurora instances
Regional load balancers route traffic to the nearest Aurora
State sync replicates across regions via gossip
Each node has its own local SQLite, no shared storage needed

Node Heartbeat

Each node sends a heartbeat signal at the configured interval. The cluster service tracks:

Node online/offline status
Last heartbeat timestamp
Region and zone metadata
Advertised URL and sync address

Heartbeat data is visible from the admin dashboard and available for health monitoring and alerting.

Dashboard

The cluster page at /admin/dashboard/cluster provides:

Real-time node list with leader badge and health indicators
Cluster status, node count, and failover mode
Per-node region, zone, sync address, and last heartbeat timestamp
A settings section under Settings â†’ Routing for read-only cluster metadata

The useClusterStatus() and useClusterNodes() hooks power the UI with 30-second auto-refresh.

Stateless Design

Aurora stores no session state in-process. State is persisted to local storage (SQLite by default) and replicated between nodes via the gossip protocol. This means:

Nodes can be added or removed without coordination
Rolling updates are zero-downtime
Horizontal scaling is unrestricted
Any node can serve any request
No shared database is required â€” each node runs its own local SQLite

Cluster

Browse docs

Overview

Each node registers with a unique identity, region, zone, and heartbeat status, giving operators full visibility into cluster health.

Architecture

code

                    â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
                    â”‚  Load       â”‚  â† Clients point here (single URL)
                    â”‚  Balancer   â”‚
                    â””â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”˜
               â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¼â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
               â”‚           â”‚           â”‚
          â”Œâ”€â”€â”€â”€â–¼â”€â”€â”€â”€â” â”Œâ”€â”€â”€â–¼â”€â”€â”€â”€â” â”Œâ”€â”€â”€â–¼â”€â”€â”€â”€â”
          â”‚ Node A  â”‚ â”‚ Node B â”‚ â”‚ Node C â”‚
          â”‚         â”‚ â”‚        â”‚ â”‚        â”‚
          â”‚ SQLite  â”‚ â”‚ SQLite â”‚ â”‚ SQLite â”‚
          â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”¬â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”¬â”€â”€â”€â”€â”˜
               â”‚           â”‚           â”‚
               â””â”€â”€â”€gossipâ”€â”€â”¼â”€â”€â”€gossipâ”€â”€â”˜
                    sync   â”‚   sync
                           â”‚
                     â”Œâ”€â”€â”€â”€â”€â–¼â”€â”€â”€â”€â”€â”€â”
                     â”‚  Provider  â”‚
                     â”‚  API       â”‚  â† OpenAI, Anthropic, etc.
                     â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜

State Sync

When a change is made on any node, it is broadcast to all peers via the gossip protocol. The receiving nodes apply the change to their local storage immediately.

State	Replicated	How
Auth keys	âœ… Create, deactivate, update	Broadcast on every change
Model registry	âœ… Full state on join	Fetched via HTTP sync endpoint
Guardrails	âœ… Full state on join + deltas	Broadcast on update
Workflows	âœ… Full state on join + deltas	Broadcast on update
Pricing	âœ… Full state on join + deltas	Broadcast on update
Node membership	âœ… Continuous	Gossip health probes
Leader election	âœ… Continuous	Deterministic (lowest node ID)

Configuration

yaml

cluster:
  enabled: true
  node_id: "enterprise-node-1"
  node_name: "enterprise-node-1"
  region: "us-east"
  zone: "us-east-a"
  bind_addr: "0.0.0.0"
  bind_port: 7946
  sync_bind_addr: "0.0.0.0"
  sync_bind_port: 7947
  seed_nodes: "enterprise-node-1:7946"
  advertise_url: "https://aurora-enterprise.example.com"
  heartbeat_interval_seconds: 30
  failover_mode: "active_passive"

Environment Variables

env

CLUSTER_ENABLED=true
CLUSTER_NODE_ID=enterprise-node-1
CLUSTER_NODE_NAME=enterprise-node-1
CLUSTER_REGION=us-east
CLUSTER_ZONE=us-east-a
CLUSTER_BIND_ADDR=0.0.0.0
CLUSTER_BIND_PORT=7946
CLUSTER_SYNC_BIND_ADDR=0.0.0.0
CLUSTER_SYNC_BIND_PORT=7947
CLUSTER_SEED_NODES=enterprise-node-1:7946
CLUSTER_ADVERTISE_URL=https://aurora-enterprise.example.com
CLUSTER_HEARTBEAT_INTERVAL_SECONDS=30
CLUSTER_FAILOVER_MODE=active_passive

Settings Reference

Setting	Type	Default	Description
`enabled`	bool	`false`	Enable cluster control-plane
`node_id`	string	`"local"`	Unique identifier for this node
`node_name`	string	`"local"`	Human-readable node name
`region`	string	â€”	Cloud region (e.g., `us-east`, `eu-west`)
`zone`	string	â€”	Availability zone within region
`bind_addr`	string	â€”	Gossip protocol bind address
`bind_port`	int	`7946`	Gossip protocol bind port
`sync_bind_addr`	string	â€”	State sync HTTP server bind address
`sync_bind_port`	int	`7947`	State sync HTTP server bind port
`seed_nodes`	string	â€”	Comma-separated `host:port` of seed nodes for cluster join
`advertise_url`	string	â€”	Publicly reachable URL for this node
`heartbeat_interval_seconds`	int	`30`	Seconds between heartbeat signals
`failover_mode`	string	`"single_node"`	`single_node`, `active_passive`, or `active_active`

Single Node Mode

Single-node is suitable for development, testing, and single-server production deployments. No shared storage or gossip coordination is required.

Failover Modes

Single Node

No failover. Each node operates independently with its own local SQLite database. Cluster sync is still available if cluster is enabled with a single node.

yaml

cluster:
  failover_mode: "single_node"

Active-Passive

One active node handles all traffic. Standby nodes monitor health and take over on failure. State is replicated via gossip â€” no shared database required.

yaml

cluster:
  failover_mode: "active_passive"

Primary node runs at full replica count
Secondary nodes run with reduced or zero capacity
DNS or load balancer failover redirects traffic on primary failure
Each node has its own local SQLite, kept in sync via gossip

Active-Active

All nodes handle traffic simultaneously. Suitable for multi-region deployments requiring zero-downtime across region failures.

yaml

cluster:
  failover_mode: "active_active"

Each region runs independent Aurora instances
Regional load balancers route traffic to the nearest Aurora
State sync replicates across regions via gossip
Each node has its own local SQLite, no shared storage needed

Node Heartbeat

Each node sends a heartbeat signal at the configured interval. The cluster service tracks:

Node online/offline status
Last heartbeat timestamp
Region and zone metadata
Advertised URL and sync address

Heartbeat data is visible from the admin dashboard and available for health monitoring and alerting.

Dashboard

The cluster page at /admin/dashboard/cluster provides:

Real-time node list with leader badge and health indicators
Cluster status, node count, and failover mode
Per-node region, zone, sync address, and last heartbeat timestamp
A settings section under Settings â†’ Routing for read-only cluster metadata

The useClusterStatus() and useClusterNodes() hooks power the UI with 30-second auto-refresh.

Stateless Design

Aurora stores no session state in-process. State is persisted to local storage (SQLite by default) and replicated between nodes via the gossip protocol. This means:

Nodes can be added or removed without coordination
Rolling updates are zero-downtime
Horizontal scaling is unrestricted
Any node can serve any request
No shared database is required â€” each node runs its own local SQLite

Cluster

Overview

Architecture

State Sync

Configuration

Environment Variables

Settings Reference

Single Node Mode

Failover Modes

Single Node

Active-Passive

Active-Active

Node Heartbeat

Dashboard

Stateless Design

Related Topics

Cluster

Overview

Architecture

State Sync

Configuration

Environment Variables

Settings Reference

Single Node Mode

Failover Modes

Single Node

Active-Passive

Active-Active

Node Heartbeat

Dashboard

Stateless Design

Related Topics