k8s-ai-sre Documentation

Investigate, propose, approve, execute

Start the product, trigger an incident, and review the approval loop without digging through the whole repo.

`k8s-ai-sre` is an AI-assisted Kubernetes incident investigator with guarded remediation. Use this page to choose the fastest path for your role.

Input channels HTTP API, Alertmanager webhooks, and Telegram approvals.
Guardrails Explicit approval, namespace allow-lists, and `kubectl auth can-i` checks.
Operator view Readable incident summary, action proposals, and audit-friendly status flows.

Choose the right path

Current product scope

  • Investigates pods and deployments with real Kubernetes reads.
  • Collects evidence from object state, events, logs, and optional Prometheus queries.
  • Accepts manual investigations at /investigate and Alertmanager webhooks at /webhooks/alertmanager.
  • Stores incidents and pending actions in SQLite by default at /tmp/k8s-ai-sre-store.sqlite3.
  • Sends Telegram notifications and supports /incident, /status, /approve, and /reject.
  • Requires explicit approval before any remediation action executes.

Operator loop

  1. An alert or manual request targets a Kubernetes object.
  2. The agent gathers evidence and explains the likely cause.
  3. The system proposes one or more remediation actions.
  4. An operator approves or rejects the proposal.
  5. Approved actions execute through the configured guardrails.

If you only need one next step, start with Quick Start. It is now the first path on this site and the fastest way to validate the product.

Source Of Truth

This docs site must stay aligned with repository sources:

  • product behavior: README.md
  • validation runbook: TESTING.md
  • deploy/rollback and startup contract: docs/deployment.md
  • active backlog and priorities: Paperclip issues for this project

When these sources change, update matching docs pages in the same pull request.