Skip to content

Production Sizing Guide#

Document Information#

Field Value
Version 1.0
Last Updated 2026-02
Owner Platform Team

Overview#

This guide provides sizing recommendations for the AI Control Plane Platform based on expected workload characteristics. Sizes are categorized into Small, Medium, Large, and Enterprise tiers.


1. Workload Profiles#

1.1 Profile Definitions#

Profile Requests/sec Concurrent Users Daily Tokens Use Case
Small 1-10 50-100 < 10M Dev/staging, small teams
Medium 10-100 100-500 10M-100M Mid-size org, multiple teams
Large 100-500 500-2000 100M-1B Enterprise, high throughput
Enterprise 500+ 2000+ 1B+ Multi-region, mission-critical

1.2 Request Characteristics#

Metric Typical Range Planning Factor
Avg input tokens 500-2000 Use 1500
Avg output tokens 200-1000 Use 500
Streaming ratio 60-80% Use 70%
Cache hit rate 10-30% Use 15%
Peak/avg ratio 2-5x Use 3x

2. Component Sizing#

2.1 LiteLLM Proxy#

LiteLLM is CPU and memory bound. Scale horizontally for throughput.

Size Replicas CPU (request/limit) Memory (request/limit) Max RPS
Small 2 500m / 1000m 512Mi / 1Gi 50
Medium 3 1000m / 2000m 1Gi / 2Gi 200
Large 5 2000m / 4000m 2Gi / 4Gi 500
Enterprise 10+ 2000m / 4000m 4Gi / 8Gi 1000+

HPA Configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm
  minReplicas: 2   # Small: 2, Medium: 3, Large: 5
  maxReplicas: 20  # 4x min for burst
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

2.2 Agent Gateway#

Agent Gateway (Rust) is highly efficient. Fewer replicas needed.

Size Replicas CPU (request/limit) Memory (request/limit) Max RPS
Small 2 250m / 500m 256Mi / 512Mi 500
Medium 2 500m / 1000m 512Mi / 1Gi 2000
Large 3 1000m / 2000m 1Gi / 2Gi 5000
Enterprise 5+ 2000m / 4000m 2Gi / 4Gi 10000+

2.3 vLLM (Self-Hosted Inference)#

GPU-bound. Size based on model and throughput needs.

Model GPU Type GPU Count Memory Throughput (tok/s)
Llama-3.1-8B A10G 1 24GB 2000-3000
Llama-3.1-8B A100-40GB 1 40GB 4000-5000
Llama-3.1-70B A100-40GB 2 (TP) 80GB 500-800
Llama-3.1-70B A100-80GB 2 (TP) 160GB 1000-1500
Llama-3.1-70B H100 2 (TP) 160GB 2000-3000

Production Stack Configuration:

# vLLM Production Stack Helm values
replicaCount:
  router: 2
  engine:
    min: 1
    max: 4

engine:
  resources:
    requests:
      nvidia.com/gpu: 1
      memory: "32Gi"
      cpu: "8"
    limits:
      nvidia.com/gpu: 1
      memory: "48Gi"
      cpu: "16"

  modelSpec:
    - name: "llama-3.1-8b"
      repository: "meta-llama/Llama-3.1-8B-Instruct"
      tensorParallelSize: 1
      maxModelLen: 32768

  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 4
    metrics:
      - type: prometheus
        prometheus:
          query: vllm:num_requests_running
          threshold: 10

2.4 PostgreSQL#

Size Type vCPU Memory Storage IOPS
Small Single 2 4GB 50GB SSD 3000
Medium Primary + Replica 4 8GB 100GB SSD 6000
Large Primary + 2 Replicas 8 16GB 500GB SSD 12000
Enterprise HA Cluster (3 nodes) 16 32GB 1TB NVMe 20000+

Connection Pool Settings:

# PgBouncer configuration
pool_mode: transaction
default_pool_size: 20        # Small
# default_pool_size: 50      # Medium
# default_pool_size: 100     # Large
max_client_conn: 1000
reserve_pool_size: 5

2.5 Redis#

Size Type Memory Persistence
Small Single 1GB RDB every 5m
Medium Primary + Replica 4GB RDB + AOF
Large Cluster (3 primary) 8GB per node RDB + AOF
Enterprise Cluster (6 nodes) 16GB per node AOF always

2.6 Vault#

Size Type CPU Memory Storage
Small Single (dev) 500m 512Mi 1GB
Medium HA (3 nodes) 1000m 1Gi 10GB
Large HA (5 nodes) 2000m 2Gi 50GB
Enterprise HA + DR 4000m 4Gi 100GB

2.7 Observability Stack#

Component Small Medium Large Enterprise
OTel Collector 1 x 500m/512Mi 2 x 1000m/1Gi 3 x 2000m/2Gi 5 x 4000m/4Gi
Prometheus 1 x 1Gi/4Gi 1 x 2Gi/8Gi 2 x 4Gi/16Gi HA + Thanos
Grafana 1 x 250m/512Mi 2 x 500m/1Gi 2 x 1000m/2Gi 3 x 2000m/4Gi
Jaeger Single All-in-one Distributed Elastic backend

3. Infrastructure Requirements#

3.1 Kubernetes Cluster Sizing#

Size Control Plane Worker Nodes (CPU) Worker Nodes (GPU)
Small Managed (3 node) 3 x m5.xlarge 1 x g4dn.xlarge
Medium Managed (3 node) 5 x m5.2xlarge 2 x g4dn.2xlarge
Large Managed (5 node) 10 x m5.4xlarge 4 x g5.4xlarge
Enterprise Dedicated (5 node) 20 x m5.8xlarge 8 x p4d.24xlarge

3.2 Network Requirements#

Size Ingress Bandwidth Internal Bandwidth VPC Endpoints
Small 100 Mbps 1 Gbps Optional
Medium 500 Mbps 10 Gbps Recommended
Large 1 Gbps 25 Gbps Required
Enterprise 10 Gbps 100 Gbps Required

3.3 Storage Requirements#

Component Small Medium Large Enterprise
PostgreSQL 50 GB 200 GB 1 TB 5 TB
Redis 5 GB 20 GB 50 GB 200 GB
Prometheus 50 GB 200 GB 1 TB 5 TB
Model Cache 100 GB 500 GB 2 TB 10 TB

4. Cost Estimation (AWS, us-east-1)#

4.1 Monthly Infrastructure Cost#

Component Small Medium Large Enterprise
EKS Control Plane $73 $73 $146 $146
EC2 Workers (CPU) $300 $800 $3,200 $12,800
EC2 Workers (GPU) $380 $1,520 $6,000 $48,000
RDS PostgreSQL $50 $200 $800 $3,200
ElastiCache Redis $50 $200 $600 $2,400
EBS Storage $50 $200 $800 $4,000
Data Transfer $50 $200 $1,000 $5,000
Total ~$950 ~$3,200 ~$12,500 ~$75,000

4.2 LLM API Cost (assuming 30% external)#

Size Daily Tokens External (30%) Monthly Cost
Small 10M 3M ~$150
Medium 100M 30M ~$1,500
Large 1B 300M ~$15,000
Enterprise 10B 3B ~$150,000

5. Capacity Planning#

5.1 Growth Model#

Required_Capacity = Current_Load × Growth_Factor × Peak_Factor × Safety_Margin

Where:
- Growth_Factor = (1 + monthly_growth_rate)^months_ahead
- Peak_Factor = typically 2-3x average
- Safety_Margin = 1.2-1.5x

5.2 Scaling Triggers#

Metric Warning Critical Action
CPU utilization >70% >85% Scale horizontally
Memory utilization >75% >90% Scale vertically first
Request latency (P95) >2s >5s Investigate bottleneck
Error rate >1% >5% Incident response
Queue depth >100 >500 Add capacity
GPU utilization >80% >95% Add GPU nodes

5.3 Quarterly Review Checklist#

  • [ ] Review actual vs projected usage
  • [ ] Analyze cost per request trends
  • [ ] Evaluate cache effectiveness
  • [ ] Check for hotspots/bottlenecks
  • [ ] Update capacity projections
  • [ ] Plan infrastructure changes

Appendix A: Instance Type Reference#

AWS Instance Types#

Use Case Instance Type vCPU Memory Network Notes
Gateway m5.xlarge 4 16GB Up to 10 Gbps Balanced
Gateway c5.2xlarge 8 16GB Up to 10 Gbps CPU optimized
Database r5.2xlarge 8 64GB Up to 10 Gbps Memory optimized
GPU (Small) g4dn.xlarge 4 16GB + T4 Up to 25 Gbps Cost effective
GPU (Medium) g5.4xlarge 16 64GB + A10G Up to 25 Gbps Good balance
GPU (Large) p4d.24xlarge 96 1152GB + 8xA100 400 Gbps High performance

GCP Instance Types#

Use Case Instance Type vCPU Memory GPU Notes
Gateway n2-standard-4 4 16GB - Standard
Gateway c2-standard-8 8 32GB - Compute optimized
Database n2-highmem-8 8 64GB - Memory optimized
GPU a2-highgpu-1g 12 85GB 1xA100 Single GPU
GPU a2-highgpu-4g 48 340GB 4xA100 Multi-GPU