Skip to content
This repository was archived by the owner on Mar 6, 2026. It is now read-only.

imran-siddique/agent-sre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Agent SRE

Reliability Engineering for AI Agent Systems

SLOs · Error Budgets · Chaos Testing · Progressive Delivery · Cost Guardrails

GitHub Stars Sponsor License Python CI PyPI Downloads Downloads OWASP Agent-OS Compatible AgentMesh Compatible Part of Agent Governance Ecosystem Discussions awesome-opentelemetry

If this project helps you, please star it! It helps others discover Agent SRE.

🔗 Part of the Agent Governance Ecosystem — Works with Agent OS (governance), AgentMesh (identity & trust), and Agent Hypervisor (runtime sessions)

📦 Install the full stack: pip install ai-agent-governance[full]PyPI | GitHub

Quick StartArchitectureExamplesBenchmarksDocsAgent OSAgentMeshAgent Hypervisor

Trusted By — Part of the AgentMesh Governance Ecosystem

Dify LlamaIndex Agent-Lightning LangGraph OpenAI Agents OpenClaw

Reliability layer across 170K+ combined GitHub stars of integrated projects — Dify (65K ⭐), LlamaIndex (47K ⭐), Agent-Lightning (15K ⭐), LangGraph, OpenAI Agents, and OpenClaw.


📊 By The Numbers

1,089+

Tests Passing

12+

Framework Adapters
LangChain · CrewAI · AutoGen
LangGraph · Dify · more

11

Observability Platforms
Langfuse · LangSmith · Arize
Datadog · Prometheus · more

OpenTelemetry

Native OTLP Export

7

SRE Engines

9

Chaos Fault Templates

7

SLI Types

100%

Test Coverage
on Core Engines

💡 Why Agent SRE?

The problem: AI agents fail silently, have no error budgets, and cascading failures propagate unchecked. Your APM says "HTTP 200, all green" while your agent just approved a fraudulent transaction.

Our solution: Apply proven SRE principles to AI agents — SLOs, error budgets, chaos testing, and circuit breakers. The same discipline that keeps Google, Netflix, and Spotify reliable, adapted for non-deterministic agent workloads.

Built for the $47B AI agent market — the reliability layer that makes autonomous agents production-ready.

🛡️ OWASP Agentic Security Coverage

Agent SRE directly addresses OWASP Agentic Security Initiative risk ASI08 — Cascading Failures:

OWASP Risk Agent SRE Coverage
ASI08: Cascading Failures Circuit breakers, error budgets, fault isolation, chaos testing to prove resilience
ASI07: Uncontrolled Costs Per-task cost limits, org budgets, anomaly detection, auto-throttle, kill switch
ASI09: Lack of Observability 7 SLI types, OpenTelemetry export, 11 observability platform integrations
ASI10: Inadequate Testing Chaos engineering with 9 fault templates, progressive delivery with shadow & canary

See full OWASP Agentic Top 10 mapping →


🏗️ Architecture Diagram

flowchart LR
    subgraph Agent["🤖 Your AI Agents"]
        A1[Agent A]
        A2[Agent B]
    end

    subgraph SRE["⚙️ Agent SRE"]
        SLO["📊 SLO Engine\n7 SLI Types"]
        EB["📉 Error Budget\nBurn Rate Alerts"]
        CHAOS["💥 Chaos Engine\n9 Fault Templates"]
        CB["🔌 Circuit Breaker\nOpen / Half-Open / Closed"]
        CANARY["🐤 Canary Deploy\nShadow → 5% → 25% → 100%"]
        COST["💰 Cost Guard\nPer-task + Org Budgets"]
        INC["🚨 Incident Manager\nCorrelation + Postmortem"]
    end

    subgraph Observe["📡 Observability"]
        OTEL["OpenTelemetry"]
        GRAF["Grafana"]
        PROM["Prometheus"]
        LF["Langfuse"]
        LS["LangSmith"]
    end

    subgraph Ecosystem["🌐 Agent Governance Ecosystem"]
        OS["Agent OS\nPolicy & Audit"]
        MESH["AgentMesh\nIdentity & Trust"]
        HV["Agent Hypervisor\nRuntime Sessions"]
    end

    A1 & A2 --> SLO
    SLO --> EB
    EB -->|Budget Exhausted| CB
    EB -->|Budget Healthy| CANARY
    CHAOS -->|Inject Faults| A1 & A2
    CHAOS -->|Measure Impact| SLO
    COST -->|Limit Exceeded| CB
    CB -->|Trip| INC
    INC -->|Alert| Observe
    SLO --> OTEL
    OTEL --> GRAF & PROM & LF & LS
    OS -->|Policy Violations| SLO
    MESH -->|Trust Scores| SLO
    HV -->|Session Events| SLO
Loading

⚡ Quick Start in 30 Seconds

pip install agent-sre
from agent_sre import SLO, ErrorBudget
from agent_sre.slo.indicators import TaskSuccessRate, CostPerTask, HallucinationRate

# Define what "reliable" means for your agent
slo = SLO(
    name="my-agent",
    indicators=[
        TaskSuccessRate(target=0.95, window="24h"),
        CostPerTask(target_usd=0.50, window="24h"),
        HallucinationRate(target=0.05, window="24h"),
    ],
    error_budget=ErrorBudget(total=0.05),
)

# After each agent task
slo.indicators[0].record_task(success=True)
slo.indicators[1].record_cost(cost_usd=0.35)
slo.indicators[2].record_evaluation(hallucinated=False)
slo.record_event(good=True)

# Check health
status = slo.evaluate()  # HEALTHY, WARNING, CRITICAL, or EXHAUSTED
print(f"Budget remaining: {slo.error_budget.remaining_percent:.1f}%")

That's it. Your agent now has SLOs, error budgets, and burn rate alerts. See all examples →


The Problem

AI agents in production fail differently than traditional services:

Failure Mode Traditional Service AI Agent
Crash Stack trace, restart Same — but rare
Wrong answer N/A Returns "success" but the answer is wrong
Silent degradation Latency spike Reasoning quality drops, no metric moves
Cost explosion Predictable Runaway tool loops burn $10K in minutes
Cascade failure Service A → B Agent A trusts Agent B who hallucinates
Tool drift API versioning MCP server schema changes silently break workflows

Your APM dashboard says "HTTP 200, latency 150ms, all green" while your agent just approved a fraudulent transaction.

Traditional monitoring catches crashes. Agent SRE catches everything else.

The Solution

Agent SRE brings Site Reliability Engineering to AI agents — the same discipline that keeps Google, Netflix, and Spotify reliable, adapted for non-deterministic agent workloads.

┌─────────────────────────────────────────────────────────────────┐
│                      Your AI Agents                             │
├─────────────────────────────────────────────────────────────────┤
│  Agent SRE — The Reliability Lifecycle                          │
│                                                                 │
│  1. DEFINE    SLOs — what does "reliable" mean?                  │
│  2. MEASURE   SLIs — are we meeting those targets?              │
│  3. PROTECT   Cost Guard + Circuit Breaker — prevent disasters  │
│  4. SHIP      Shadow + Canary — deploy changes safely           │
│  5. BREAK     Chaos Engine — prove resilience before prod does  │
│  6. RESPOND   Incidents + Postmortem — recover fast             │
│  7. LEARN     Replay + Diff — understand exactly what happened  │
├─────────────────────────────────────────────────────────────────┤
│  AgentMesh — Identity, Trust, Routing                           │
├─────────────────────────────────────────────────────────────────┤
│  Agent OS — Policy Enforcement, Audit, Compliance               │
└─────────────────────────────────────────────────────────────────┘

Core Capabilities

1. SLO Engine — Define What "Reliable" Means

Traditional SRE defines SLOs for services (99.9% uptime). Agent SRE defines SLOs for agent behavior:

SLI (Indicator) Example SLO What It Catches
Task Success Rate 99.5% of tasks correct Silent reasoning failures
Tool Call Accuracy 99.9% correct tool selection Wrong tool, wrong arguments
Response Latency (P95) < 5s single-step Stuck in reasoning loops
Cost Per Task < $0.50 mean Runaway tool loops
Policy Compliance 100% adherence Safety violations
Scope Chain Depth ≤ 3 hops Unbounded delegation
Hallucination Rate < 1% factual errors Confident wrong answers
from agent_sre import SLO, ErrorBudget
from agent_sre.slo.indicators import TaskSuccessRate, CostPerTask, HallucinationRate

slo = SLO(
    name="customer-support-agent",
    indicators=[
        TaskSuccessRate(target=0.995, window="30d"),
        CostPerTask(target_usd=0.50, window="24h"),
        HallucinationRate(target=0.05, window="24h"),
    ],
    error_budget=ErrorBudget(
        total=0.005,
        burn_rate_alert=2.0,      # Alert at 2x normal burn
        burn_rate_critical=10.0,  # Page at 10x burn
    )
)

slo.record_event(good=True)
status = slo.evaluate()  # HEALTHY | WARNING | CRITICAL | EXHAUSTED

2. Replay Engine — Time-Travel Debugging for Agents

Capture every decision point and replay it exactly:

from agent_sre.replay.capture import TraceCapture, SpanKind, TraceStore

# Capture mode: records all decisions, tool calls, costs
with TraceCapture(agent_id="support-bot-v3", task_input="Refund order #12345") as capture:
    span = capture.start_span("tool_call", SpanKind.TOOL_CALL,
                              input_data={"tool": "lookup_order", "order_id": "12345"})
    span.finish(output={"status": "found", "amount": 49.99}, cost_usd=0.02)

    span = capture.start_span("llm_inference", SpanKind.LLM_INFERENCE,
                              input_data={"prompt": "Process refund for $49.99"})
    span.finish(output={"decision": "approve_refund"}, cost_usd=0.15)

# Save trace, replay later, diff with production
store = TraceStore()
store.save(capture.trace)

Features: deterministic replay, trace diffing, trace comparison, multi-agent distributed traces, automatic PII redaction.

3. Progressive Delivery — Ship Agent Changes Safely

# agent-sre.yaml — GitOps deployment spec
apiVersion: agent-sre/v1
kind: AgentRollout
metadata:
  name: support-bot-v4
spec:
  strategy:
    type: canary
    steps:
      - shadow: 100%     # Route all traffic to v4 in preview mode
        duration: 1h
        analysis:
          - metric: task_success_rate
            threshold: 0.99
      - canary: 5%        # 5% real traffic to v4
        duration: 2h
        analysis:
          - metric: response_quality_score
            threshold: 0.95
          - metric: cost_per_task
            max_increase: 20%
      - canary: 25%
        duration: 4h
      - canary: 100%      # Full rollout
    rollback:
      automatic: true
      on:
        - error_budget_burn_rate > 5.0
        - policy_violations > 0
        - cost_anomaly_detected

4. Chaos Engineering — Break Agents on Purpose

from agent_sre.chaos.engine import ChaosExperiment, Fault, AbortCondition

experiment = ChaosExperiment(
    name="tool-failure-resilience",
    target_agent="research-agent",
    faults=[
        Fault.tool_timeout("web_search", delay_ms=30_000),
        Fault.tool_error("database_query", error="connection_refused", rate=0.5),
        Fault.llm_latency("openai", p99_ms=15_000),
        Fault.delegation_reject("analyzer", rate=0.1),
    ],
    duration_seconds=1800,
    abort_conditions=[
        AbortCondition(metric="task_success_rate", threshold=0.80, comparator="lte"),
        AbortCondition(metric="cost_per_task", threshold=5.00, comparator="gte"),
    ],
)

experiment.start()
for fault in experiment.faults:
    experiment.inject_fault(fault, applied=True)

resilience = experiment.calculate_resilience(
    baseline_success_rate=0.98,
    experiment_success_rate=0.88,
    recovery_time_ms=2500,
)
print(f"Fault Impact Score: {resilience.overall:.0f}/100")

9 pre-built experiment templates: tool timeout, error storms, LLM degradation, cascading failures, cost explosions, and more.

5. Cost Guard — Prevent $10K Surprises

from agent_sre.cost.guard import CostGuard

guard = CostGuard(
    per_task_limit=2.00,          # Hard cap per task
    per_agent_daily_limit=100.00, # Per agent per day
    org_monthly_budget=5000.00,   # Organization total
    anomaly_detection=True,       # Alert on unusual patterns
    auto_throttle=True,           # Slow down agents approaching limits
    kill_switch_threshold=0.95,   # Kill at 95% budget
)

# Before each task
allowed, reason = guard.check_task("my-agent", estimated_cost=0.50)
if not allowed:
    print(f"Blocked: {reason}")

# After each task
alerts = guard.record_cost("my-agent", "task-42", cost_usd=0.35)
for alert in alerts:
    print(f"⚠️ {alert.severity.value}: {alert.message}")

Anomaly detection uses Z-score, IQR, and EWMA methods with severity scoring.

6. Incident Manager — When Agents Fail in Production

from agent_sre.incidents.detector import IncidentDetector, Signal, SignalType

detector = IncidentDetector(correlation_window_seconds=300)

# Register automated responses
detector.register_response("slo_breach", ["manual_rollback", "notify_oncall"])
detector.register_response("cost_anomaly", ["throttle_agent", "create_postmortem_template"])

# Ingest signals from your monitoring
signal = Signal(
    signal_type=SignalType.ERROR_BUDGET_EXHAUSTED,
    source="support-agent",
    message="Error budget consumed — freeze deployments",
)

incident = detector.ingest_signal(signal)
if incident:
    print(f"🚨 {incident.severity.value}: {incident.title}")

Features: signal correlation, deduplication, circuit breaker per agent, postmortem template generation with timeline and action items.


Ecosystem Integration

Agent SRE completes the governance-to-reliability stack:

Layer Project What It Does
Reliability Agent SRE (this) SLOs, chaos testing, canary deploys, cost guard, replay
Runtime Agent Hypervisor Session isolation, execution rings, saga orchestration
Networking AgentMesh Identity, trust, routing, delegation
Kernel Agent OS Policy enforcement, audit, compliance

With Agent OS

  • Policy violations → SLO breaches (every violation counts against error budget)
  • Audit trail → Replay engine (raw data for deterministic replay)
  • Preview mode → Progressive delivery pipeline

With AgentMesh

  • Trust scores → SLI indicators (mesh trust becomes an SLI)
  • Scope chains → Distributed traces (every hop is a span)
  • Identity rotation → Deployment events (tracked as reliability events)

With OpenTelemetry

  • Native OTLP export for all SLIs and traces
  • Custom semantic conventions for agent-specific telemetry
  • Compatible with Grafana, Prometheus, Jaeger, and other OTLP-compatible backends

Architecture

agent-sre/
├── src/agent_sre/
│   ├── slo/               # SLO definitions, SLI collectors, error budgets
│   │   ├── indicators.py  # 7 built-in SLIs (task success, cost, hallucination, etc.)
│   │   ├── objectives.py  # SLO engine with burn rate alerts
│   │   └── dashboard.py   # SLO dashboard with compliance history
│   ├── replay/            # Deterministic capture and replay engine
│   │   ├── capture.py     # Trace capture with PII redaction
│   │   ├── engine.py      # Replay, diff, trace comparison
│   │   ├── visualization.py  # Execution graphs, critical path
│   │   └── distributed.py # Multi-agent trace reconstruction
│   ├── delivery/          # Progressive delivery (shadow, canary, rollback)
│   │   ├── rollout.py     # Preview mode, staged rollouts, traffic splitting
│   │   └── gitops.py      # Declarative rollout specs (YAML)
│   ├── chaos/             # Chaos engineering and fault injection
│   │   ├── engine.py      # Experiment state machine, fault impact scoring
│   │   └── library.py     # 9 pre-built experiment templates
│   ├── cost/              # Cost tracking, budgets, anomaly detection
│   │   ├── guard.py       # Hierarchical budgets, auto-throttle, kill switch
│   │   └── anomaly.py     # Z-score, IQR, EWMA anomaly detection
│   ├── incidents/         # Detection, response, postmortem generation
│   │   ├── detector.py    # Signal correlation, deduplication, routing
│   │   ├── circuit_breaker.py  # Per-agent circuit breaker (CLOSED/OPEN/HALF_OPEN)
│   │   └── postmortem.py  # Postmortem template with timeline + action items
│   ├── integrations/      # Ecosystem bridges
│   │   ├── agent_os/      # Agent OS policy + audit → SLI bridge
│   │   ├── agent_mesh/    # AgentMesh trust score → SLI bridge
│   │   ├── otel/          # OpenTelemetry export
│   │   ├── langchain/     # LangChain callback handler
│   │   ├── llamaindex/    # LlamaIndex callback handler
│   │   ├── langfuse/      # Langfuse SLO scoring + cost export
│   │   ├── langsmith/     # LangSmith trace + feedback export
│   │   ├── arize/         # Arize/Phoenix span export
│   │   ├── braintrust/    # Braintrust eval + experiment export
│   │   ├── helicone/      # Helicone header injection + logging
│   │   ├── datadog/       # Datadog metrics + events export
│   │   ├── agentops/      # AgentOps session + event recording
│   │   ├── prometheus/    # Prometheus /metrics text format
│   │   └── mcp/           # MCP drift detection
│   ├── mcp/               # MCP server (agent self-monitoring tools)
│   ├── cli/               # CLI tool (agent-sre command)
│   └── alerts/            # Webhook alerting (Slack, PagerDuty, OpsGenie, Teams)
├── dashboards/            # Pre-built Grafana dashboards
├── operator/              # Kubernetes CRDs (AgentSLO, CostBudget)
├── .github/actions/       # GitHub Actions (canary deployment)
├── examples/              # 4 runnable demos
├── tests/                 # 1,089 tests
├── docs/                  # Getting started, concepts, integration guide
└── specs/                 # SLO templates (coming soon)

How It Differs

Observability tools (LangSmith, Langfuse, Arize) tell you what happened. Agent SRE tells you if it was within budget and what to do about it.

Observability Tools Agent SRE
Tracing ✅ Core strength ✅ Trace capture + deterministic replay
Evaluation ✅ LLM-as-judge ✅ SLI recording
SLOs & Error Budgets ✅ Define reliability targets
Canary Deployments ✅ Compare agent versions safely
Chaos Testing ✅ Inject faults, measure resilience
Cost Guardrails ❌ (cost tracking only) ✅ Per-task limits, auto-throttle, kill switch
Incident Detection ✅ SLO breach → auto-incident → postmortem
Progressive Rollout ✅ Preview mode, traffic splitting, rollback

Use both together: observability for deep trace debugging, Agent SRE for production reliability operations.

AI-powered SRE tools (Cleric, Resolve, SRE.ai) use AI to help humans debug infrastructure. Agent SRE applies SRE principles to AI agent systems. Completely different target.

Traditional APM (Prometheus, Grafana, Jaeger) monitors infrastructure. Your dashboard says "HTTP 200, latency 150ms, all green" while your agent just approved a fraudulent transaction. Agent SRE catches reasoning failures, not infrastructure failures.


Status & Maturity

✅ Fully Implemented (20,000+ lines, 1,089 tests)

Component Status Description
SLO Engine ✅ Stable 7 SLI types, error budgets, burn rate alerts, auto-fire to AlertManager
Replay Engine ✅ Stable Capture, replay, diff, trace comparison, distributed traces
Progressive Delivery ✅ Stable Preview mode, staged rollouts, analysis gates, manual rollback
Chaos Engine ✅ Stable 9 fault templates, fault impact scoring, abort conditions
Cost Guard ✅ Stable Hierarchical budgets, anomaly detection, auto-throttle
Incident Manager ✅ Stable Signal correlation, circuit breaker, postmortem template
Agent OS Bridge ✅ Stable Policy violations → SLI, audit entries → signals
AgentMesh Bridge ✅ Stable Trust scores → SLI, mesh events → signals
OpenTelemetry ✅ Stable Full span/metric export with OTEL SDK
LangChain Callbacks ✅ Stable Duck-typed callback handler for SLI collection
LlamaIndex Callbacks ✅ Stable Query/retriever/LLM tracking for RAG pipelines
Langfuse ✅ Stable SLO scoring and cost observation export
LangSmith ✅ Stable Run tracing and evaluation feedback export
Arize/Phoenix ✅ Stable Phoenix span export + evaluation import
Braintrust ✅ Stable Eval-driven monitoring and experiment export
Helicone ✅ Stable Header injection for proxy-based cost/latency tracking
Datadog ✅ Stable Metrics and events export for LLM monitoring
AgentOps ✅ Stable Session recording and event tracking
W&B ✅ Stable Experiment tracking with SRE metrics
MLflow ✅ Stable Experiment logging with SLO data
Prometheus ✅ Stable Native /metrics endpoint + Grafana dashboards
MCP Drift Detection ✅ Stable Tool schema fingerprinting, change severity classification
MCP Server ✅ Stable Agent self-monitoring tools (SLO check, cost budget, rollout status)
Webhook Alerting ✅ Stable Slack, PagerDuty, OpsGenie, Microsoft Teams + deduplication
Alert Persistence ✅ Stable SQLite-backed alert history for audit trail
Framework Adapters ✅ Stable LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Semantic Kernel, Dify
CLI Tool ✅ Stable agent-sre CLI for SLO status, cost summary, system info
GitHub Actions ✅ Stable Canary deployment action for CI/CD pipelines
K8s CRDs ✅ Stable AgentSLO and CostBudget custom resource definitions
LLM-as-Judge Evals ✅ Stable RulesJudge + JudgeProtocol, 5 criteria, 3 suite presets
SLO Templates ✅ Stable 4 domain-specific templates (support, coding, research, pipeline)
REST API ✅ Stable Zero-dependency HTTP API for SLO status, incidents, cost, traces
Fleet Management ✅ Stable Multi-agent registry, heartbeats, aggregate health, filtering
Helm Chart ✅ Stable Deployment, Service, CRD templates with configurable values
Benchmark Suite ✅ Stable 10 scenarios across 6 categories with scoring and reporting
Certification ✅ Stable Bronze/Silver/Gold reliability tiers with evidence-based evaluation
A/B Testing ✅ Stable Experiment engine with Welch's t-test and traffic splitting
Protocol Tracing ✅ Stable A2A/MCP-aware distributed tracing with W3C context propagation

Examples

Example Description Command
Quickstart SLO + cost + incident in one script python examples/quickstart.py
LangChain Monitor LangChain RAG agent with SLOs + evals python examples/langchain_monitor.py
Cost Guard Budget enforcement with throttling python examples/cost_guard.py
Canary Rollout Preview + staged rollout with manual rollback python examples/canary_rollout.py
Chaos Test Fault injection and fault impact scoring python examples/chaos_test.py

Docker:

docker compose up quickstart          # Quick demo
docker compose up langchain-monitor   # LangChain + SLOs + LLM-as-Judge
docker compose up api                 # REST API on port 8080

Kubernetes:

helm install agent-sre ./deployments/helm/agent-sre

REST API

Full FastAPI REST API with 27 endpoints and interactive Swagger docs:

pip install agent-sre[api]
uvicorn agent_sre.api.server:app
# Open http://localhost:8000/docs for Swagger UI

Endpoints: SLOs, Cost, Chaos, Incidents, Delivery, Health, Metrics.

Visualization Dashboard

Interactive Streamlit dashboard with 5 tabs:

cd examples/dashboard
pip install -r requirements.txt
streamlit run app.py

Tabs: SLO Health | Cost Management | Chaos Engineering | Incidents | Progressive Delivery


Documentation

  • Getting Started — Install and define your first SLO in 5 minutes
  • Deployment Guide — Docker, integration patterns, production checklist
  • Security Model — Threat model, attack vectors, best practices
  • Concepts — Why agent reliability is different from infrastructure reliability
  • Integration Guide — Use with Agent OS, AgentMesh, and OpenTelemetry
  • Comparison — Detailed comparison with other tools

Frequently Asked Questions

Why do AI agents need SRE? AI agents in production are services that can fail, degrade, or cost too much -- just like any other service. Agent SRE applies proven Site Reliability Engineering practices (SLOs, error budgets, chaos testing, staged rollouts) specifically to AI agent systems, catching reliability issues before they impact users.

How does chaos engineering work for AI agents? Agent SRE injects failures like increased latency, dropped responses, corrupted outputs, and resource exhaustion at specific points in agent workflows. It measures impact on SLOs, triggers automated rollbacks when error budgets are exceeded, and provides replay debugging to analyze failure cascades.

What SLOs can I define for AI agents? Agent SRE supports SLOs for response time, accuracy, cost per inference, safety compliance, and custom metrics. Each SLO has an error budget that burns down when violated. Burn rate alerts notify you before the budget is exhausted, enabling proactive intervention.

How does Agent SRE integrate with existing monitoring? Agent SRE exports metrics via OpenTelemetry and Prometheus. It integrates with 11 observability platforms (Langfuse, LangSmith, Arize, Datadog, AgentOps, W&B, MLflow, and more). It's part of the Agent Governance Ecosystem with 4,310+ tests across 4 repos.


Contributing

git clone https://github.com/imran-siddique/agent-sre.git
cd agent-sre
pip install -e ".[dev]"
pytest

See CONTRIBUTING.md for guidelines.

🗺️ Roadmap

Quarter Milestone
Q1 2026 ✅ Core 7 engines, OTel integration, Prometheus dashboards
Q2 2026 Kubernetes operator, PagerDuty/OpsGenie integration
Q3 2026 ML-powered anomaly detection, auto-remediation
Q4 2026 Managed cloud service, SOC2 compliance automation

License

MIT — See LICENSE for details.


Observability tells you what happened. Agent SRE tells you if it was within budget.


🌐 Agent Governance Ecosystem

Repository Purpose Stars
Agent OS Governance kernel — policy enforcement, audit, compliance Stars
Agent SRE Reliability — SLOs, chaos testing, cost guard (this repo) Stars
AgentMesh Networking — identity, trust, routing, delegation Stars
Agent Hypervisor Runtime — session isolation, execution rings, sagas Stars
Agent Governance Unified installer — pip install ai-agent-governance[full] Stars

GitHub · Docs · PyPI · Discussions · Sponsor