Production Sizing Guide#

Document Information#

Field	Value
Version	1.0
Last Updated	2026-02
Owner	Platform Team

Overview#

This guide provides sizing recommendations for the AI Control Plane Platform based on expected workload characteristics. Sizes are categorized into Small, Medium, Large, and Enterprise tiers.

1. Workload Profiles#

1.1 Profile Definitions#

Profile	Requests/sec	Concurrent Users	Daily Tokens	Use Case
Small	1-10	50-100	< 10M	Dev/staging, small teams
Medium	10-100	100-500	10M-100M	Mid-size org, multiple teams
Large	100-500	500-2000	100M-1B	Enterprise, high throughput
Enterprise	500+	2000+	1B+	Multi-region, mission-critical

1.2 Request Characteristics#

Metric	Typical Range	Planning Factor
Avg input tokens	500-2000	Use 1500
Avg output tokens	200-1000	Use 500
Streaming ratio	60-80%	Use 70%
Cache hit rate	10-30%	Use 15%
Peak/avg ratio	2-5x	Use 3x

2. Component Sizing#

2.1 LiteLLM Proxy#

LiteLLM is CPU and memory bound. Scale horizontally for throughput.

Size	Replicas	CPU (request/limit)	Memory (request/limit)	Max RPS
Small	2	500m / 1000m	512Mi / 1Gi	50
Medium	3	1000m / 2000m	1Gi / 2Gi	200
Large	5	2000m / 4000m	2Gi / 4Gi	500
Enterprise	10+	2000m / 4000m	4Gi / 8Gi	1000+

HPA Configuration:

kindmet  spec

href="#__codelineno-0-1">apiVersion: autoscaling/v2 class="p">: HorizontalPodAutoscaler adata: name: litellm class="p">: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: litellm minReplicas: 2 # Small: 2, Medium: 3, Large: 5 maxReplicas: 20 # 4x min for burst metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 25 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 100 periodSeconds: 15

2.2 Agent Gateway#

Agent Gateway (Rust) is highly efficient. Fewer replicas needed.

Size	Replicas	CPU (request/limit)	Memory (request/limit)	Max RPS
Small	2	250m / 500m	256Mi / 512Mi	500
Medium	2	500m / 1000m	512Mi / 1Gi	2000
Large	3	1000m / 2000m	1Gi / 2Gi	5000
Enterprise	5+	2000m / 4000m	2Gi / 4Gi	10000+

2.3 vLLM (Self-Hosted Inference)#

GPU-bound. Size based on model and throughput needs.

Model	GPU Type	GPU Count	Memory	Throughput (tok/s)
Llama-3.1-8B	A10G	1	24GB	2000-3000
Llama-3.1-8B	A100-40GB	1	40GB	4000-5000
Llama-3.1-70B	A100-40GB	2 (TP)	80GB	500-800
Llama-3.1-70B	A100-80GB	2 (TP)	160GB	1000-1500
Llama-3.1-70B	H100	2 (TP)	160GB	2000-3000

Production Stack Configuration:

# vLLM Production Stack Helm values
replicaCount:
  router: 2
  engine:
    min: 1
    max: 4

engine:
  resources:
    requests:
      nvidia.com/gpu: 1
      memory: "32Gi"
      cpu: "8"
    limits:
      nvidia.com/gpu: 1
      memory: "48Gi"
      cpu: "16"

  modelSpec:
    - name: "llama-3.1-8b"
      repository: "meta-llama/Llama-3.1-8B-Instruct"
      tensorParallelSize: 1
      maxModelLen: 32768

  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 4
    metrics:
      - type: prometheus
        prometheus:
          query: vllm:num_requests_running
          threshold: 10

2.4 PostgreSQL#

Size	Type	vCPU	Memory	Storage	IOPS
Small	Single	2	4GB	50GB SSD	3000
Medium	Primary + Replica	4	8GB	100GB SSD	6000
Large	Primary + 2 Replicas	8	16GB	500GB SSD	12000
Enterprise	HA Cluster (3 nodes)	16	32GB	1TB NVMe	20000+

Connection Pool Settings:

# PgBouncer configuration
pool_mode: transaction
default_pool_size: 20        # Small
# default_pool_size: 50      # Medium
# default_pool_size: 100     # Large
max_client_conn: 1000
reserve_pool_size: 5

2.5 Redis#

Size	Type	Memory	Persistence
Small	Single	1GB	RDB every 5m
Medium	Primary + Replica	4GB	RDB + AOF
Large	Cluster (3 primary)	8GB per node	RDB + AOF
Enterprise	Cluster (6 nodes)	16GB per node	AOF always

2.6 Vault#

Size	Type	CPU	Memory	Storage
Small	Single (dev)	500m	512Mi	1GB
Medium	HA (3 nodes)	1000m	1Gi	10GB
Large	HA (5 nodes)	2000m	2Gi	50GB
Enterprise	HA + DR	4000m	4Gi	100GB

2.7 Observability Stack#

Component	Small	Medium	Large	Enterprise
OTel Collector	1 x 500m/512Mi	2 x 1000m/1Gi	3 x 2000m/2Gi	5 x 4000m/4Gi
Prometheus	1 x 1Gi/4Gi	1 x 2Gi/8Gi	2 x 4Gi/16Gi	HA + Thanos
Grafana	1 x 250m/512Mi	2 x 500m/1Gi	2 x 1000m/2Gi	3 x 2000m/4Gi
Jaeger	Single	All-in-one	Distributed	Elastic backend

3. Infrastructure Requirements#

3.1 Kubernetes Cluster Sizing#

Size	Control Plane	Worker Nodes (CPU)	Worker Nodes (GPU)
Small	Managed (3 node)	3 x m5.xlarge	1 x g4dn.xlarge
Medium	Managed (3 node)	5 x m5.2xlarge	2 x g4dn.2xlarge
Large	Managed (5 node)	10 x m5.4xlarge	4 x g5.4xlarge
Enterprise	Dedicated (5 node)	20 x m5.8xlarge	8 x p4d.24xlarge

3.2 Network Requirements#

Size	Ingress Bandwidth	Internal Bandwidth	VPC Endpoints
Small	100 Mbps	1 Gbps	Optional
Medium	500 Mbps	10 Gbps	Recommended
Large	1 Gbps	25 Gbps	Required
Enterprise	10 Gbps	100 Gbps	Required

3.3 Storage Requirements#

Component	Small	Medium	Large	Enterprise
PostgreSQL	50 GB	200 GB	1 TB	5 TB
Redis	5 GB	20 GB	50 GB	200 GB
Prometheus	50 GB	200 GB	1 TB	5 TB
Model Cache	100 GB	500 GB	2 TB	10 TB

4. Cost Estimation (AWS, us-east-1)#

4.1 Monthly Infrastructure Cost#

Component	Small	Medium	Large	Enterprise
EKS Control Plane	$73	$73	$146	$146
EC2 Workers (CPU)	$300	$800	$3,200	$12,800
EC2 Workers (GPU)	$380	$1,520	$6,000	$48,000
RDS PostgreSQL	$50	$200	$800	$3,200
ElastiCache Redis	$50	$200	$600	$2,400
EBS Storage	$50	$200	$800	$4,000
Data Transfer	$50	$200	$1,000	$5,000
Total	~$950	~$3,200	~$12,500	~$75,000

4.2 LLM API Cost (assuming 30% external)#

Size	Daily Tokens	External (30%)	Monthly Cost
Small	10M	3M	~$150
Medium	100M	30M	~$1,500
Large	1B	300M	~$15,000
Enterprise	10B	3B	~$150,000

5. Capacity Planning#

5.1 Growth Model#

Required_Capacity = Current_Load × Growth_Factor × Peak_Factor × Safety_Margin

Where:
- Growth_Factor = (1 + monthly_growth_rate)^months_ahead
- Peak_Factor = typically 2-3x average
- Safety_Margin = 1.2-1.5x

5.2 Scaling Triggers#

Metric	Warning	Critical	Action
CPU utilization	>70%	>85%	Scale horizontally
Memory utilization	>75%	>90%	Scale vertically first
Request latency (P95)	>2s	>5s	Investigate bottleneck
Error rate	>1%	>5%	Incident response
Queue depth	>100	>500	Add capacity
GPU utilization	>80%	>95%	Add GPU nodes

5.3 Quarterly Review Checklist#

[ ] Review actual vs projected usage
[ ] Analyze cost per request trends
[ ] Evaluate cache effectiveness
[ ] Check for hotspots/bottlenecks
[ ] Update capacity projections
[ ] Plan infrastructure changes

Appendix A: Instance Type Reference#

AWS Instance Types#

Use Case	Instance Type	vCPU	Memory	Network	Notes
Gateway	m5.xlarge	4	16GB	Up to 10 Gbps	Balanced
Gateway	c5.2xlarge	8	16GB	Up to 10 Gbps	CPU optimized
Database	r5.2xlarge	8	64GB	Up to 10 Gbps	Memory optimized
GPU (Small)	g4dn.xlarge	4	16GB + T4	Up to 25 Gbps	Cost effective
GPU (Medium)	g5.4xlarge	16	64GB + A10G	Up to 25 Gbps	Good balance
GPU (Large)	p4d.24xlarge	96	1152GB + 8xA100	400 Gbps	High performance

GCP Instance Types#

Use Case	Instance Type	vCPU	Memory	GPU	Notes
Gateway	n2-standard-4	4	16GB	-	Standard
Gateway	c2-standard-8	8	32GB	-	Compute optimized
Database	n2-highmem-8	8	64GB	-	Memory optimized
GPU	a2-highgpu-1g	12	85GB	1xA100	Single GPU
GPU	a2-highgpu-4g	48	340GB	4xA100	Multi-GPU