k8s-ai-sre Documentation
Start the product, trigger an incident, and review the approval loop without digging through the whole repo.
`k8s-ai-sre` is an AI-assisted Kubernetes incident investigator with guarded remediation. Use this page to choose the fastest path for your role.
Quick Start
Run a local demo, trigger an investigation, and see the response shape in minutes.
Deploy to Kubernetes
Install with Helm or manifests, verify readiness, and keep rollback steps nearby.
Validate the Full Loop
Follow the test runbook for investigate, notify, approve, and execute flows.
Input channels
HTTP API, Alertmanager webhooks, and Telegram approvals.
Guardrails
Explicit approval, namespace allow-lists, and `kubectl auth can-i` checks.
Operator view
Readable incident summary, action proposals, and audit-friendly status flows.
Choose the right path
I want to try it locally
Install dependencies, configure model access, create the demo scenario, and hit `/investigate`.
I need the cluster runbook
Use the canonical deploy, preflight, validation, and rollback instructions.
I’m changing the product
Start with the contributor path, then drop into setup and validation docs only where needed.
I need the system model
Review the architecture, evidence flow, and core moving parts before deeper changes.
Current product scope
- Investigates pods and deployments with real Kubernetes reads.
- Collects evidence from object state, events, logs, and optional Prometheus queries.
- Accepts manual investigations at
/investigateand Alertmanager webhooks at/webhooks/alertmanager. - Stores incidents and pending actions in SQLite by default at
/tmp/k8s-ai-sre-store.sqlite3. - Sends Telegram notifications and supports
/incident,/status,/approve, and/reject. - Requires explicit approval before any remediation action executes.
Operator loop
- An alert or manual request targets a Kubernetes object.
- The agent gathers evidence and explains the likely cause.
- The system proposes one or more remediation actions.
- An operator approves or rejects the proposal.
- Approved actions execute through the configured guardrails.
If you only need one next step, start with Quick Start. It is now the first path on this site and the fastest way to validate the product.
Source Of Truth
This docs site must stay aligned with repository sources:
- product behavior:
README.md - validation runbook:
TESTING.md - deploy/rollback and startup contract:
docs/deployment.md - active backlog and priorities: Paperclip issues for this project
When these sources change, update matching docs pages in the same pull request.