High Availability Patterns
| Field |
Value |
| Version |
1.0 |
| Last Updated |
2026-02 |
| Owner |
Platform Team |
Overview
This document describes high availability (HA) patterns for the AI Control Plane Platform, covering single-region HA, multi-region deployment, and disaster recovery strategies.
1. Availability Targets
| Tier |
Availability |
Monthly Downtime |
RPO |
RTO |
| Standard |
99.9% |
~43 min |
1 hour |
4 hours |
| High |
99.95% |
~22 min |
15 min |
1 hour |
| Critical |
99.99% |
~4 min |
5 min |
15 min |
2. Single-Region HA Architecture
┌─────────────────────────────────────────┐
│ Load Balancer (NLB) │
│ (Multi-AZ, Health Checks) │
└─────────────────────────────────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌─────────▼─────────┐ ┌──────────▼──────────┐ ┌─────────▼─────────┐
│ AZ-1 │ │ AZ-2 │ │ AZ-3 │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ LiteLLM │ │ │ │ LiteLLM │ │ │ │ LiteLLM │ │
│ │ (replica) │ │ │ │ (replica) │ │ │ │ (replica) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │Agent Gateway│ │ │ │Agent Gateway│ │ │ │Agent Gateway│ │
│ │ (replica) │ │ │ │ (replica) │ │ │ │ (replica) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ vLLM │ │ │ │ vLLM │ │ │ │ vLLM │ │
│ │ (engine) │ │ │ │ (engine) │ │ │ │ (engine) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │PostgreSQL │ │◄──────┤ │PostgreSQL │ │◄──────┤ │PostgreSQL │ │
│ │ (primary) │ │ sync │ │ (replica) │ │ sync │ │ (replica) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │Redis │ │◄──────┤ │Redis │ │◄──────┤ │Redis │ │
│ │ (primary) │ │ │ │ (replica) │ │ │ │ (replica) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└───────────────────┘ └─────────────────────┘ └───────────────────┘
2.1 Component Distribution
# Pod anti-affinity to spread across AZs
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm
spec:
replicas: 3
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: litellm
topologyKey: topology.kubernetes.io/zone
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: litellm
2.2 Pod Disruption Budgets
# Ensure minimum availability during updates/node maintenance
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: litellm-pdb
spec:
minAvailable: 2 # At least 2 pods always running
selector:
matchLabels:
app: litellm
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: postgres-pdb
spec:
maxUnavailable: 1 # Allow 1 pod to be disrupted
selector:
matchLabels:
app: postgresql
2.3 Health Checks
# Comprehensive health checks for LiteLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm
spec:
template:
spec:
containers:
- name: litellm
livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
startupProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30 # 2.5 min max startup
3. Multi-Region Architecture
┌─────────────────────────────┐
│ Global DNS (Route53) │
│ Latency-based routing │
└─────────────────────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌─────────▼─────────┐ ┌──────────▼──────────┐ ┌─────────▼─────────┐
│ us-east-1 │ │ us-west-2 │ │ eu-west-1 │
│ (PRIMARY) │ │ (SECONDARY) │ │ (SECONDARY) │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ AI Control Plane │ │ │ │ AI Control Plane │ │ │ │ AI Control Plane │ │
│ │ Stack │ │ │ │ Stack │ │ │ │ Stack │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ PostgreSQL │──────────┤ │ PostgreSQL │ │ │ │ PostgreSQL │ │
│ │ (writer) │ │async │ │ (read-only) │ │ │ │ (read-only) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ │ ▲ │ ▲ │
│ │ │ │ │ │
│ │ └───────────────────────│───────┘ │
│ │ Cross-region │ │
│ │ replication │ │
└───────────────────┘ └──────────────────┘
┌─────────────────────────────┐
│ Vault Enterprise │
│ (Performance Replication) │
└─────────────────────────────┘
3.1 DNS Configuration (Route53)
# Terraform - Global DNS with failover
resource "aws_route53_health_check" "primary" {
fqdn = "gateway.us-east-1.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = {
Name = "ai-gateway-primary-health"
}
}
resource "aws_route53_record" "gateway" {
zone_id = aws_route53_zone.main.zone_id
name = "gateway.example.com"
type = "A"
latency_routing_policy {
region = "us-east-1"
}
set_identifier = "primary"
alias {
name = aws_lb.us_east_1.dns_name
zone_id = aws_lb.us_east_1.zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "gateway_secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "gateway.example.com"
type = "A"
latency_routing_policy {
region = "us-west-2"
}
set_identifier = "secondary"
alias {
name = aws_lb.us_west_2.dns_name
zone_id = aws_lb.us_west_2.zone_id
evaluate_target_health = true
}
}
3.2 Cross-Region Database Replication
# PostgreSQL cross-region replication using CloudNativePG
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-primary
namespace: ai-gateway
spec:
instances: 3
postgresql:
parameters:
max_connections: "200"
shared_buffers: "2GB"
wal_level: "logical"
max_wal_senders: "10"
max_replication_slots: "10"
bootstrap:
initdb:
database: litellm
owner: litellm
storage:
size: 500Gi
storageClass: gp3
replica:
enabled: true
source: postgres-primary
externalClusters:
- name: postgres-secondary-us-west-2
connectionParameters:
host: postgres.us-west-2.internal
user: replication
dbname: litellm
barmanObjectStore:
destinationPath: s3://postgres-backup-us-west-2/
s3Credentials:
accessKeyId:
name: aws-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: aws-creds
key: SECRET_ACCESS_KEY
4. Failover Procedures
4.1 Automatic Failover (Component Level)
| Component |
Failover Mechanism |
Detection Time |
Recovery Time |
| LiteLLM |
Kubernetes + LoadBalancer |
10-30s |
< 1 min |
| Agent Gateway |
Kubernetes + LoadBalancer |
10-30s |
< 1 min |
| vLLM |
KEDA + Router health checks |
30-60s |
1-5 min |
| PostgreSQL |
Patroni/CloudNativePG |
30s |
< 1 min |
| Redis |
Sentinel/Cluster |
5-30s |
< 1 min |
| Vault |
Raft consensus |
10-30s |
< 1 min |
4.2 Manual Regional Failover
#!/bin/bash
# failover-to-secondary.sh
# Execute regional failover when primary region is degraded
set -euo pipefail
PRIMARY_REGION="us-east-1"
SECONDARY_REGION="us-west-2"
echo "=== Regional Failover: $PRIMARY_REGION -> $SECONDARY_REGION ==="
# Step 1: Verify secondary region health
echo "Checking secondary region health..."
HEALTH=$(curl -sf https://gateway.$SECONDARY_REGION.internal/health || echo "FAILED")
if [[ "$HEALTH" != *"healthy"* ]]; then
echo "ERROR: Secondary region unhealthy. Aborting."
exit 1
fi
# Step 2: Promote secondary database
echo "Promoting secondary PostgreSQL..."
kubectl --context $SECONDARY_REGION exec -n ai-gateway postgres-0 -- \
patronictl switchover --master postgres-0 --candidate postgres-1 --force
# Step 3: Update DNS weights
echo "Updating DNS routing..."
aws route53 change-resource-record-sets \
--hosted-zone-id $ZONE_ID \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "gateway.example.com",
"Type": "A",
"SetIdentifier": "primary",
"Weight": 0,
"AliasTarget": {
"HostedZoneId": "'$PRIMARY_ALB_ZONE'",
"DNSName": "'$PRIMARY_ALB_DNS'",
"EvaluateTargetHealth": true
}
}
}]
}'
# Step 4: Scale up secondary
echo "Scaling up secondary region..."
kubectl --context $SECONDARY_REGION scale deployment litellm --replicas=5 -n ai-gateway
kubectl --context $SECONDARY_REGION scale deployment agentgateway --replicas=3 -n ai-gateway
# Step 5: Verify
echo "Verifying failover..."
sleep 30
curl -sf https://gateway.example.com/health
echo "=== Failover Complete ==="
echo "Primary traffic now routed to $SECONDARY_REGION"
echo ""
echo "POST-FAILOVER ACTIONS:"
echo "1. Investigate primary region issue"
echo "2. Monitor error rates and latency"
echo "3. Plan failback when primary recovers"
4.3 Failback Procedure
#!/bin/bash
# failback-to-primary.sh
set -euo pipefail
PRIMARY_REGION="us-east-1"
SECONDARY_REGION="us-west-2"
echo "=== Regional Failback: $SECONDARY_REGION -> $PRIMARY_REGION ==="
# Step 1: Verify primary region recovered
echo "Verifying primary region health..."
HEALTH=$(curl -sf https://gateway.$PRIMARY_REGION.internal/health || echo "FAILED")
if [[ "$HEALTH" != *"healthy"* ]]; then
echo "ERROR: Primary region not ready. Aborting."
exit 1
fi
# Step 2: Sync data back to primary (if needed)
echo "Checking data sync status..."
kubectl --context $PRIMARY_REGION exec -n ai-gateway postgres-0 -- \
psql -c "SELECT pg_last_wal_replay_lsn();"
# Step 3: Gradual traffic shift (canary)
echo "Starting canary traffic shift (10%)..."
aws route53 change-resource-record-sets \
--hosted-zone-id $ZONE_ID \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "gateway.example.com",
"Type": "A",
"SetIdentifier": "primary",
"Weight": 10,
"AliasTarget": {
"HostedZoneId": "'$PRIMARY_ALB_ZONE'",
"DNSName": "'$PRIMARY_ALB_DNS'",
"EvaluateTargetHealth": true
}
}
}]
}'
echo "Monitoring for 5 minutes..."
sleep 300
# Check error rates
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\"}[5m])" | jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "ERROR: High error rate detected. Aborting failback."
exit 1
fi
# Step 4: Increase traffic gradually
for WEIGHT in 25 50 75 100; do
echo "Shifting to $WEIGHT% traffic..."
# Update DNS weight
sleep 300 # 5 min between shifts
done
echo "=== Failback Complete ==="
5. Disaster Recovery
5.1 Backup Strategy
| Data Type |
Backup Frequency |
Retention |
Storage |
| PostgreSQL (full) |
Daily |
30 days |
S3 Cross-Region |
| PostgreSQL (WAL) |
Continuous |
7 days |
S3 Cross-Region |
| Vault (snapshots) |
Hourly |
7 days |
S3 Cross-Region |
| Redis (RDB) |
Hourly |
24 hours |
S3 Same-Region |
| Configuration |
On change |
Unlimited |
Git |
| Secrets metadata |
Daily |
30 days |
Vault backup |
5.2 Recovery Procedures
#!/bin/bash
# disaster-recovery.sh
# Full DR restore from backup
set -euo pipefail
DR_REGION="eu-west-1"
BACKUP_BUCKET="s3://ai-gateway-backups-dr"
RESTORE_TIMESTAMP=${1:-"latest"}
echo "=== Disaster Recovery to $DR_REGION ==="
# Step 1: Bootstrap Kubernetes cluster
echo "Ensuring EKS cluster is ready..."
aws eks update-kubeconfig --region $DR_REGION --name ai-gateway-dr
# Step 2: Deploy base infrastructure
echo "Deploying infrastructure..."
kubectl apply -k kubernetes/overlays/dr/
# Step 3: Restore Vault
echo "Restoring Vault..."
VAULT_SNAPSHOT=$(aws s3 ls $BACKUP_BUCKET/vault/ | tail -1 | awk '{print $4}')
aws s3 cp $BACKUP_BUCKET/vault/$VAULT_SNAPSHOT /tmp/vault-snapshot.snap
kubectl exec -n ai-gateway vault-0 -- vault operator raft snapshot restore /tmp/vault-snapshot.snap
# Step 4: Restore PostgreSQL
echo "Restoring PostgreSQL..."
kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-dr
spec:
instances: 3
bootstrap:
recovery:
source: postgres-backup
externalClusters:
- name: postgres-backup
barmanObjectStore:
destinationPath: $BACKUP_BUCKET/postgres/
s3Credentials:
accessKeyId:
name: aws-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: aws-creds
key: SECRET_ACCESS_KEY
wal:
maxParallel: 8
EOF
# Step 5: Wait for services
echo "Waiting for services to be ready..."
kubectl wait --for=condition=ready pod -l app=litellm -n ai-gateway --timeout=600s
kubectl wait --for=condition=ready pod -l app=postgres -n ai-gateway --timeout=600s
# Step 6: Verify
echo "Running verification tests..."
./scripts/test-all.sh
echo "=== DR Recovery Complete ==="
echo "Update DNS to point to DR region when ready"
5.3 RTO/RPO Verification
#!/bin/bash
# dr-drill.sh
# Run quarterly DR drill
echo "=== DR Drill Started: $(date) ==="
START_TIME=$(date +%s)
# Simulate primary failure
echo "Simulating primary region failure..."
# Execute failover
./failover-to-secondary.sh
FAILOVER_TIME=$(date +%s)
RTO=$((FAILOVER_TIME - START_TIME))
# Verify data
echo "Verifying data integrity..."
LAST_TX_PRIMARY=$(kubectl --context us-east-1 exec postgres-0 -- psql -t -c "SELECT max(created_at) FROM cost_tracking_daily;")
LAST_TX_SECONDARY=$(kubectl --context us-west-2 exec postgres-0 -- psql -t -c "SELECT max(created_at) FROM cost_tracking_daily;")
RPO_SECONDS=$(($(date -d "$LAST_TX_PRIMARY" +%s) - $(date -d "$LAST_TX_SECONDARY" +%s)))
echo ""
echo "=== DR Drill Results ==="
echo "RTO: ${RTO}s (target: 900s)"
echo "RPO: ${RPO_SECONDS}s (target: 300s)"
echo ""
if [ $RTO -le 900 ] && [ $RPO_SECONDS -le 300 ]; then
echo "✓ DR drill PASSED"
else
echo "✗ DR drill FAILED - review and remediate"
fi
# Failback
read -p "Execute failback? (yes/no) " CONFIRM
if [ "$CONFIRM" = "yes" ]; then
./failback-to-primary.sh
fi
6. Monitoring for HA
6.1 Key Availability Metrics
# Prometheus alerting rules for HA
groups:
- name: high-availability
rules:
- alert: InsufficientReplicas
expr: |
kube_deployment_status_replicas_available{deployment=~"litellm|agentgateway"}
< kube_deployment_spec_replicas{deployment=~"litellm|agentgateway"} * 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.deployment }} has < 50% replicas available"
- alert: SingleAZDeployment
expr: |
count by (deployment) (
kube_pod_info{pod=~"litellm.*|agentgateway.*"}
* on(node) group_left(zone)
kube_node_labels
) < 2
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.deployment }} running in single AZ"
- alert: DatabaseReplicationLag
expr: pg_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL replication lag > 30s"
- alert: RegionHealthDegraded
expr: |
sum by (region) (up{job="ai-gateway"})
/ count by (region) (up{job="ai-gateway"}) < 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "Region {{ $labels.region }} health < 80%"
Appendix: HA Checklist
Pre-Production Checklist
- [ ] Multi-AZ deployment verified
- [ ] Pod anti-affinity configured
- [ ] PodDisruptionBudgets in place
- [ ] Health checks tuned and tested
- [ ] Database replication verified
- [ ] Redis HA configured
- [ ] Vault HA configured
- [ ] Load balancer health checks
- [ ] DNS failover tested
- [ ] Backup/restore tested
- [ ] Runbooks documented
- [ ] On-call rotation established
- [ ] DR drill scheduled