SRE Agent#

LLM-driven incident remediation with human-in-loop approval. The SRE agent watches platform events, proposes remediations, and executes them only after a human approves — risk-scored 0–100 so trivial actions can be auto-approved while anything destructive is gated.

Tier: Available on Business and Enterprise licenses. Not included in Trial or Team.

How it works#

incident event ──► SRE agent ──► proposed action ──► risk score
                                       │
                          ┌────────────┴───────────┐
                  score ≤ threshold        score > threshold
                          │                         │
                  auto-execute (if              push to operator
                  SRE_AUTO_APPROVE)             via Slack/email/UI
                                                    │
                                              operator approves ──► execute
                                                    │
                                              operator rejects ──► log + close

The agent reasons about the incident with a configured LLM (default claude-sonnet-4-6), proposes a concrete action from the action library (e.g., "circuit-break provider X for 5 minutes", "raise rate limit for team Y", "page on-call"), assigns a risk score, and routes it appropriately.

Trigger events#

The agent subscribes to four event types from admin-api's event bus:

Event	Fires when
`provider.unhealthy`	Error rate exceeds 25% across 5+ requests in a 5-minute window
`budget.exceeded`	A team or org breaches a soft or hard budget limit
`sla.violation`	p95/p99 latency or success-rate target is breached for a configured SLA
`latency.spike`	Provider response time crosses 3× the rolling 1-hour median

Action library#

A curated set of safe, reversible operations the agent can propose. Each action has a baseline risk score; the agent adjusts up or down based on context (time of day, blast radius, team affected, etc.).

Action	Baseline risk	Reversible
Circuit-break provider for ≤ 5 min	15	Yes (auto-restores)
Re-route model group to different provider	25	Yes
Raise rate limit by ≤ 50% (one team)	30	Yes
Page on-call via PagerDuty	35	One-shot
Disable a guardrail rule	60	Yes
Bump a hard budget cap	75	Yes (audit trail)
Disable an SLA contract	85	Yes

Anything score > SRE_RISK_THRESHOLD (default 40) requires human approval.

Enable#

The agent ships behind a Compose profile so it doesn't run on deployments that don't need it.

./scutum up --profile sre

Or to bring up the full stack:

./scutum up --profile full

Verify it's healthy:

./scutum ps | grep sre-agent
./scutum logs sre-agent

Configuration#

Set in config/.env:

# Model the agent uses to reason about incidents (must be in your routing config).
SRE_AGENT_MODEL=claude-sonnet-4-6

# Risk threshold (0-100) above which an action requires human approval.
# Lower = more conservative. Default 40 means almost everything is gated.
SRE_RISK_THRESHOLD=40

# When true, low-risk actions execute without waiting for approval.
# Default false — every action goes through human-in-loop until you've
# observed the agent for a few weeks and tuned the threshold to your taste.
SRE_AUTO_APPROVE=false

Approval flow#

When the agent proposes a gated action, three things happen in parallel:

The proposal lands in the SRE Agent page in the Admin Console with the full reasoning, risk score, and a one-click approve / reject.
A notification fires through your configured event channels — Slack, PagerDuty, email, or a webhook — with a deep-link to the proposal.
The proposal is logged to the audit trail with the operator who eventually approves or rejects it.

Approvals are scoped to single proposals — there's no blanket "approve all" mode. The audit trail is immutable and retained according to your license tier (30 days on Trial, 365 on Business, 7 years on Enterprise).

Disabling#

To stop the agent from running:

docker compose --env-file config/.env --profile sre stop sre-agent

Pending proposals stay in the queue and resume processing when the service starts again — they aren't lost.

Audit + compliance#

Every proposal, score, approval, rejection, and execution is logged to the sre_actions audit trail with:

The triggering event payload
The full LLM reasoning chain
The risk score and which heuristics moved it
The operator who approved/rejected
The actual remediation result

You can export a CSV of the trail from the Admin Console → Audit Log filtered by source: sre-agent. SOC2 reviewers usually ask for this within 30 minutes of starting an audit; have it ready.