Skip to content

API Integration Guide#

Scutum exposes an OpenAI-compatible LLM endpoint on port 4000. Any application that works with the OpenAI API can connect by changing the base URL and API key — no other code changes required.

Authentication#

All requests to the Scutum proxy require a Bearer token in the Authorization header. Use SCUTUM_API_KEY from your config/.env file, or a per-team / per-user scoped key created from the Admin Console:

Authorization: Bearer $SCUTUM_API_KEY

For production, generate per-user or per-team API keys through the Admin UI or Admin API at http://localhost:8086.

Base URL#

The Scutum proxy listens on port 4000 of whichever host you deployed to. Examples:

Environment Base URL
Local http://localhost:4000
Your VM http://<your-host>:4000 (or behind your TLS terminator)
Admin UI http://<your-host>:5173

All OpenAI-compatible endpoints live under /v1/: - POST /v1/chat/completions -- chat completions (streaming and non-streaming) - POST /v1/embeddings -- text embeddings - GET /v1/models -- list available models

Code Examples#

curl#

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $SCUTUM_API_KEY" \
  -d '{
    "model": "claude-sonnet-4.5",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Python (OpenAI SDK)#

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="$SCUTUM_API_KEY",
)

response = client.chat.completions.create(
    model="claude-sonnet-4.5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

TypeScript (OpenAI SDK)#

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:4000/v1",
  apiKey: "$SCUTUM_API_KEY",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet-4.5",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is the capital of France?" },
  ],
  max_tokens: 256,
  temperature: 0.7,
});

console.log(response.choices[0].message.content);

Go (OpenAI SDK)#

package main

import (
    "context"
    "fmt"
    openai "github.com/sashabaranov/go-openai"
)

func main() {
    config := openai.DefaultConfig("$SCUTUM_API_KEY")
    config.BaseURL = "http://localhost:4000/v1"

    client := openai.NewClientWithConfig(config)

    resp, err := client.CreateChatCompletion(
        context.Background(),
        openai.ChatCompletionRequest{
            Model: "claude-sonnet-4.5",
            Messages: []openai.ChatCompletionMessage{
                {Role: "system", Content: "You are a helpful assistant."},
                {Role: "user", Content: "What is the capital of France?"},
            },
            MaxTokens:   256,
            Temperature: 0.7,
        },
    )
    if err != nil {
        panic(err)
    }

    fmt.Println(resp.Choices[0].Message.Content)
}

Model Aliases#

Instead of specifying a provider-specific model, you can use group aliases that route across multiple providers automatically:

Alias Routes to Best for
fast gpt-5-mini, claude-haiku-4.5, gemini-3-flash, grok-3-mini Low-latency responses
smart gpt-5, claude-sonnet-4.5, gemini-3-pro, grok-4 Balanced quality & speed
powerful gpt-5.2, claude-opus-4.5, o3-pro, grok-4-heavy Maximum capability
reasoning o3, o3-pro, deepseek-r1 Complex reasoning tasks
coding claude-sonnet-4.5, deepseek-coder, codellama Code generation & review
cost-effective gpt-5-mini, claude-haiku-4.5, gemini-2.5-flash-lite, deepseek-v3 Budget-friendly
# Use a group alias -- the gateway picks the best available model
response = client.chat.completions.create(
    model="fast",
    messages=[{"role": "user", "content": "Summarize this text..."}],
)

Provider-specific aliases are also available: openai, anthropic, google, xai, deepseek, bedrock, vertex, azure.

Streaming#

Enable streaming by setting stream: true. The gateway returns Server-Sent Events (SSE) in the standard OpenAI format:

Python Streaming#

stream = client.chat.completions.create(
    model="claude-sonnet-4.5",
    messages=[{"role": "user", "content": "Write a short poem about APIs."}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

TypeScript Streaming#

const stream = await client.chat.completions.create({
  model: "claude-sonnet-4.5",
  messages: [{ role: "user", content: "Write a short poem." }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

Embeddings#

Generate text embeddings through the same gateway:

response = client.embeddings.create(
    model="azure-text-embedding-3-large",
    input="The quick brown fox jumps over the lazy dog",
)

vector = response.data[0].embedding
print(f"Embedding dimension: {len(vector)}")

Available embedding models include azure-text-embedding-3-large and azure-text-embedding-3-small. These require AZURE_API_KEY and AZURE_API_BASE to be set in config/.env.

Error Handling#

The gateway returns standard HTTP error codes with JSON bodies:

Status Meaning Common cause
400 Bad Request Invalid JSON or missing required fields
401 Unauthorized Missing or invalid API key
404 Not Found Unknown model name
429 Rate Limited Too many requests (RPM/TPM exceeded)
500 Internal Server Error Upstream provider failure
503 Service Unavailable All providers in fallback chain failed

Python Error Handling#

from openai import OpenAI, APIError, RateLimitError, APIConnectionError

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="$SCUTUM_API_KEY",
)

try:
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": "Hello"}],
    )
except RateLimitError:
    print("Rate limited -- back off and retry")
except APIError as e:
    print(f"API error {e.status_code}: {e.message}")
except APIConnectionError:
    print("Could not connect to the gateway")

Rate Limits#

The gateway enforces rate limits at multiple levels:

  • Global: 1000 requests/minute (configurable in Admin UI > Settings)
  • Per key: 100 RPM and 100,000 TPM by default (configurable per API key)
  • Per model: Provider-specific limits are respected automatically

When rate limited, the response includes a Retry-After header. The OpenAI SDKs handle this automatically with exponential backoff.

Fallback Chains#

When a primary model is unavailable or returns an error, the gateway automatically tries fallback models. For example, a request to gpt-5 will fall back to gpt-5.2, then claude-opus-4.5, then grok-4.

Fallback happens transparently -- your application receives a response without needing to implement retry logic. The response metadata includes the actual model used:

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Hello"}],
)

# Check which model actually served the request
print(response.model)  # Could be "gpt-5" or a fallback like "claude-opus-4.5"

Key fallback chains configured by default:

Primary Model Fallback Sequence
gpt-5 gpt-5.2, claude-opus-4.5, grok-4
claude-opus-4.5 claude-sonnet-4.5, gpt-5, grok-4
claude-sonnet-4.5 claude-opus-4.5, gpt-5, gemini-3-pro
gemini-3-pro gemini-2.5-pro, claude-sonnet-4.5, gpt-5
deepseek-r1 o3, deepseek-v3

Admin API#

The Admin API at http://localhost:8086 (production: https://api.aicontrolplane.dev/admin/api/v1) provides management endpoints for programmatic configuration. It uses JWT authentication:

# Get a JWT token
TOKEN=$(curl -s http://localhost:8086/auth/login \
  -H "Content-Type: application/json" \
  -d '{"api_key": "$SCUTUM_API_KEY"}' | jq -r '.access_token')

Available Endpoints#

Method Endpoint Description
GET /api/v1/models List all model configurations
PUT /api/v1/models/{model_id} Update a model configuration
GET /api/v1/budgets List all budgets
POST /api/v1/budgets Create a budget
PUT /api/v1/budgets/{id} Update a budget
GET /api/v1/teams List all teams
POST /api/v1/teams Create a team
PUT /api/v1/teams/{team_id} Update a team
DELETE /api/v1/teams/{team_id} Delete a team
GET /api/v1/guardrail-assignments List guardrail-to-team assignments
GET /api/v1/keys List all API keys
POST /api/v1/keys/generate Generate a new API key
POST /api/v1/keys/update Update an API key
POST /api/v1/keys/delete Delete API keys
GET /api/v1/mcp-servers List MCP server configs
POST /api/v1/mcp-servers Create an MCP server config
PUT /api/v1/mcp-servers/{id} Update an MCP server config
POST /api/v1/mcp-servers/{id}/test Test MCP server connectivity
GET /api/v1/mcp-servers/sync/preview Preview Agent Gateway config
POST /api/v1/mcp-servers/sync Deploy MCP configs to Agent Gateway
GET /api/v1/workflows List workflow definitions
POST /api/v1/workflows Create a workflow
POST /api/v1/workflow-executions Execute a workflow
GET /api/v1/workflow-executions List workflow executions
GET /api/v1/workflow-executions/{id} Get execution details
GET /api/v1/routing-policies List routing policies
POST /api/v1/routing-policies Create a routing policy
GET /api/v1/metrics/realtime Get real-time platform metrics
GET /api/v1/settings Get platform settings
PUT /api/v1/settings Update platform settings
# Example: List models
curl http://localhost:8086/api/v1/models \
  -H "Authorization: Bearer $TOKEN"

# Example: Generate an API key
curl -X POST http://localhost:8086/api/v1/keys/generate \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"key_alias": "my-service", "max_budget": 100, "duration": "90d"}'

# Example: Preview Agent Gateway config
curl http://localhost:8086/api/v1/mcp-servers/sync/preview \
  -H "Authorization: Bearer $TOKEN"

See the Admin UI Guide for the full web console walkthrough.