Quick Start

Local Development

1. Install dependencies

uv sync

2. Configure model access

Required:

export MODEL_NAME=openai/gpt-oss-20b
export MODEL_API_KEY=your-api-key
export MODEL_PROVIDER=groq                    # or openai, anthropic, etc.
export MODEL_BASE_URL=https://api.groq.com/openai/v1
export WRITE_ALLOWED_NAMESPACES=ai-sre-demo

Using a different provider

Configure MODEL_PROVIDER and MODEL_BASE_URL to use any OpenAI-compatible API:

# Groq (default)
export MODEL_PROVIDER=groq
export MODEL_BASE_URL=https://api.groq.com/openai/v1

# OpenAI
export MODEL_PROVIDER=openai
export MODEL_BASE_URL=https://api.openai.com/v1

# Anthropic
export MODEL_PROVIDER=anthropic
export MODEL_BASE_URL=https://api.anthropic.com/v1

# Custom/gateway (e.g., Portkey, local LLM, etc.)
export MODEL_PROVIDER=custom
export MODEL_BASE_URL=https://your-gateway.com/v1

See docs/portkey.md for Portkey-specific configuration.

3. Create demo scenario

kubectl create namespace ai-sre-demo --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f examples/kind-bad-deploy.yaml

4. Start the service

uv run main.py

5. Trigger investigation

Manual endpoint:

curl -X POST http://127.0.0.1:8080/investigate \
  -H 'Content-Type: application/json' \
  -d '{"kind":"deployment","namespace":"ai-sre-demo","name":"bad-deploy"}'

Alertmanager-style webhook:

curl -X POST http://127.0.0.1:8080/webhooks/alertmanager \
  -H 'Content-Type: application/json' \
  --data @examples/alertmanager-bad-deploy.json

Expected API response shape:

{
  "incident_id": "a1b2c3d4e5",
  "kind": "deployment",
  "namespace": "ai-sre-demo",
  "name": "bad-deploy",
  "answer": "Summary: image pull failure",
  "evidence": "",
  "source": "manual",
  "action_ids": ["abc12345"],
  "proposed_actions": [
    {
      "action_id": "abc12345",
      "action_type": "rollout-restart",
      "namespace": "ai-sre-demo",
      "name": "bad-deploy",
      "params": {},
      "expires_at": null,
      "approve_command": "/approve abc12345",
      "reject_command": "/reject abc12345"
    }
  ],
  "notification_status": "Telegram notification sent."
}

Kubernetes Deployment

Deploy k8s-ai-sre on Kubernetes using the Helm chart published from this repository.

Prerequisites: Kubernetes cluster, kubectl, Helm.

1. Add the Helm repository

helm repo add k8s-ai-sre https://raw.githubusercontent.com/kmjayadeep/k8s-ai-sre/gh-pages/
helm repo update

2. Create the credentials secret

Create a Secret named k8s-ai-sre-env in the ai-sre-system namespace. Replace the values below with your own.

kubectl -n ai-sre-system create secret generic k8s-ai-sre-env \
  --from-literal=MODEL_API_KEY="your-api-key" \
  --from-literal=MODEL_NAME="openai/gpt-oss-20b" \
  --from-literal=WRITE_ALLOWED_NAMESPACES="ai-sre-demo" \
  --dry-run=client -o yaml | kubectl apply -f -

Required secret keys:

Key	Required	Default
`MODEL_API_KEY`	Yes	—
`MODEL_NAME`	Yes	—
`WRITE_ALLOWED_NAMESPACES`	Yes	—
`MODEL_PROVIDER`	No	`groq`
`MODEL_BASE_URL`	No	`https://api.groq.com/openai/v1`
`TELEGRAM_BOT_TOKEN`	No	—
`TELEGRAM_CHAT_ID`	No	—
`OPERATOR_API_TOKEN`	No	—

Add any optional keys with extra --from-literal= flags in the same command.

3. Install the chart

helm install k8s-ai-sre k8s-ai-sre/k8s-ai-sre \
  --namespace ai-sre-system \
  --create-namespace \
  --set secretMode=existing \
  --set existingSecret.name=k8s-ai-sre-env \
  --set writeAllowedNamespaces[0]=ai-sre-demo \
  --timeout 2m \
  --wait

To add more write-allowed namespaces, append --set writeAllowedNamespaces[1]=production.

To enable Telegram notifications, ensure your secret contains TELEGRAM_BOT_TOKEN and TELEGRAM_CHAT_ID before installing, then add --set secretMode=existing as shown above.

4. Verify

kubectl -n ai-sre-system get pods,svc
kubectl -n ai-sre-system rollout status deploy/k8s-ai-sre
kubectl -n ai-sre-system get secret k8s-ai-sre-env   # confirm the secret is present
curl -s $(kubectl -n ai-sre-system get svc k8s-ai-sre -o jsonpath='{.spec.clusterIP}')/healthz

Expected: {"status":"ok","config":"ok"} — confirms the service started and credentials are valid.

Upgrading

helm repo update
helm upgrade --install k8s-ai-sre k8s-ai-sre/k8s-ai-sre \
  --namespace ai-sre-system \
  --set secretMode=existing \
  --set existingSecret.name=k8s-ai-sre-env \
  --set writeAllowedNamespaces[0]=ai-sre-demo \
  --timeout 2m \
  --wait

Uninstalling

helm uninstall k8s-ai-sre --namespace ai-sre-system
# Does not delete the credentials secret or write namespaces.

Alternatives

Inline secrets (credentials in values file)

If you prefer to keep credentials in a local file instead of a pre-created Secret, use secretMode=inline with a values file:

helm install k8s-ai-sre k8s-ai-sre/k8s-ai-sre \
  --namespace ai-sre-system \
  --create-namespace \
  --values my-values.yaml

Example my-values.yaml:

secretMode: inline
secretData:
  MODEL_API_KEY: "your-api-key"
  MODEL_NAME: "openai/gpt-oss-20b"
  WRITE_ALLOWED_NAMESPACES: "ai-sre-demo"
  MODEL_PROVIDER: "groq"
  MODEL_BASE_URL: "https://api.groq.com/openai/v1"

See chart/examples/with-inline-secret.yaml for the full example including Telegram and operator token options.

External Secret (credentials managed outside the chart)

If your cluster uses an external secrets manager (Vault, ESO, Sealed Secrets, etc.), create the Secret yourself and reference it by name:

helm install k8s-ai-sre k8s-ai-sre/k8s-ai-sre \
  --namespace ai-sre-system \
  --create-namespace \
  --set secretMode=existing \
  --set existingSecret.name=your-secret-name \
  --set writeAllowedNamespaces[0]=ai-sre-demo \
  --timeout 2m \
  --wait

See chart/examples/with-existing-secret.yaml for reference.

For production, see docs/deployment.md for full deployment runbook including rollback procedures.

For local development, see docs/developer.md.

Telegram Approval Experience

The /incident <incident-id> command returns a concise, action-first operator summary:

Incident a1b2c3d4e5
Target: deployment ai-sre-demo/bad-deploy
Cluster: prod-cluster  (shown when K8S_CLUSTER_NAME is set)
Quick summary: image pull failure
Root cause: image pull failure
Action items:
1. Automated option: rollout-restart ai-sre-demo/bad-deploy
   approve: /approve abc12345
   reject: /reject abc12345

Use /status <incident-id> to confirm the notification state and action IDs, then /approve <action-id> or /reject <action-id> to decide the proposal.

Note: Set K8S_CLUSTER_NAME (or CLUSTER_NAME, KUBE_CLUSTER_NAME, KUBERNETES_CLUSTER_NAME) to include the cluster name in Telegram output. Model <think> reasoning blocks are automatically stripped before sending.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search