Prometheus Alerting Rules for k8s-ai-sre
These rules target the k8s-ai-sre metrics endpoint (GET /metrics) and assume
a Prometheus scrape job configured to pull from the service.
Prometheus scrape config
scrape_configs:
- job_name: k8s-ai-sre
static_configs:
- targets: ["k8s-ai-sre.ai-sre-system.svc.cluster.local:8080"]
metrics_path: /metrics
Alert rules
groups:
- name: k8s-ai-sre.approval-loop
rules:
# ─── Investigation health ───────────────────────────────────────────
- alert: K8sAISREInvestigationsFailing
expr: |
sum(rate(k8s_ai_sre_investigation_latency_seconds_count[5m])) == 0
for: 5m
labels:
severity: warning
annotations:
summary: "No k8s-ai-sre investigations completed in 5 minutes"
description: |
Prometheus has not observed any investigation latency observations
in the last 5 minutes. Either no traffic is arriving or the
investigation loop is silently failing before completing.
- alert: K8sAISREInvestigationLatencyHigh
expr: |
histogram_quantile(0.95,
sum(rate(k8s_ai_sre_investigation_latency_seconds_bucket[5m])) by (le)
) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "k8s-ai-sre investigation P95 latency exceeds 60s"
description: |
The 95th-percentile investigation duration is above 60 seconds.
Check model API latency and cluster connectivity.
# ─── Action proposal health ──────────────────────────────────────────
- alert: K8sAISREActionProposalRateZero
expr: |
sum(rate(k8s_ai_sre_action_proposals_total[10m])) == 0
for: 10m
labels:
severity: info
annotations:
summary: "No action proposals in 10 minutes"
description: |
No remediation actions have been proposed in the last 10 minutes.
This is informational if the cluster is healthy.
# ─── Approval loop health ────────────────────────────────────────────
- alert: K8sAISREApprovalLatencyHigh
expr: |
histogram_quantile(0.95,
sum(rate(k8s_ai_sre_approval_latency_seconds_bucket[5m])) by (le)
) > 300
for: 5m
labels:
severity: warning
annotations:
summary: "k8s-ai-sre approval P95 latency exceeds 5 minutes"
description: |
The 95th-percentile time between action proposal and operator
decision (approve/reject) is above 5 minutes. Verify Telegram
connectivity or that the HTTP operator token path is not blocked.
- alert: K8sAISREApprovalDecisionRateZero
expr: |
sum(rate(k8s_ai_sre_action_execution_outcomes_total[30m])) == 0
for: 30m
labels:
severity: info
annotations:
summary: "No approval decisions in 30 minutes"
description: |
No terminal action decisions (approved/rejected/failed) have been
recorded in 30 minutes. Informational — cluster may be healthy.
# ─── Execution outcomes ──────────────────────────────────────────────
- alert: K8sAISREActionFailureRateHigh
expr: |
(
sum(rate(k8s_ai_sre_action_execution_outcomes_total{status="failed"}[5m]))
/
sum(rate(k8s_ai_sre_action_execution_outcomes_total[5m]))
) > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Over 50% of k8s-ai-sre action executions are failing"
description: |
More than half of action executions are returning failed status.
Inspect k8s-ai-sre logs for RBAC failures, network errors,
or target object not found errors.
- alert: K8sAISREActionRejectionRateHigh
expr: |
(
sum(rate(k8s_ai_sre_action_execution_outcomes_total{status="rejected"}[5m]))
/
sum(rate(k8s_ai_sre_action_execution_outcomes_total[5m]))
) > 0.3
for: 5m
labels:
severity: warning
annotations:
summary: "Over 30% of k8s-ai-sre action executions are operator-rejected"
description: |
More than 30% of proposed actions are being rejected by operators.
Review whether proposals are too aggressive or targeting wrong resources.
# ─── Service availability ───────────────────────────────────────────
- alert: K8sAISREServiceDown
expr: |
up{job="k8s-ai-sre"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "k8s-ai-sre is unreachable"
description: |
Prometheus cannot scrape k8s-ai-sre. The service may be down
or the /metrics endpoint may be returning errors.
Grafana dashboard
A minimal dashboard JSON is available at docs/grafana-dashboard.json.
Import it into your Grafana instance pointing at the Prometheus datasource.
Dashboard panels
| Panel | Query | Description |
|---|---|---|
| Investigation throughput | sum(rate(k8s_ai_sre_investigation_latency_seconds_count[5m])) |
Investigations/min |
| Investigation P95 latency | histogram_quantile(0.95, rate(...investigation_latency_seconds_bucket[5m])) |
Seconds |
| Action proposal rate | sum(rate(k8s_ai_sre_action_proposals_total[5m])) |
Proposals/min |
| Approval P95 latency | histogram_quantile(0.95, rate(...approval_latency_seconds_bucket[5m])) |
Seconds |
| Execution outcome split | sum(rate(k8s_ai_sre_action_execution_outcomes_total[5m])) by (status) |
approved/failed/rejected/min |
| Failure rate | failed / total from above |
% |