Skip to main content

Operational Runbooks

Purpose: For operators, provides production runbooks for common incident scenarios organized by category and severity.

Runbook Format

Each runbook follows a standard structure:

SectionContent
SymptomsObservable indicators (alerts, error messages, user reports)
ImpactWhat is affected and severity
DiagnosisCommands to confirm root cause
ResolutionStep-by-step fix procedure
VerificationHow to confirm the issue is resolved
PreventionActions to prevent recurrence

Severity Levels

LevelResponse TimeCriteriaExamples
SEV-1 CriticalImmediate (< 15 min)Cluster unreachable, data loss risk, full service outageetcd quorum loss, API server down, all nodes NotReady
SEV-2 High< 30 minDegraded service, partial outage, single component failureControl plane node down, FluxCD stuck, certificate expired
SEV-3 Medium< 2 hoursPerformance degradation, non-critical component failureHigh memory pressure, backup failure, drift detected
SEV-4 LowNext business dayCosmetic issues, warnings, planned maintenanceLog volume growing, non-critical alert noise

Runbook Categories

Cluster Issues

RunbookSeverityTrigger
etcd Quorum LossSEV-1etcdMembersDown alert, API server 5xx
Control Plane Node FailureSEV-2KubeNodeNotReady on CP node
Worker Node FailureSEV-3KubeNodeNotReady on worker
API Server UnresponsiveSEV-1KubeAPIDown alert
Kubelet Crash LoopSEV-2Node flapping Ready/NotReady

Networking

RunbookSeverityTrigger
Pod-to-Pod Connectivity LossSEV-1Calico/CNI failure
DNS Resolution FailureSEV-2CoreDNS pods down
Load Balancer UnhealthySEV-2External traffic blocked
Ingress Certificate ExpiredSEV-2CertManagerCertNotReady alert

Storage

RunbookSeverityTrigger
PersistentVolume StuckSEV-3PVC in Pending state
etcd Disk FullSEV-1etcdBackendQuotaLowSpace alert
Backup FailureSEV-3VeleroBackupFailure alert
Volume Snapshot FailureSEV-3CSI snapshot errors

GitOps

RunbookSeverityTrigger
FluxCD Reconciliation FailureSEV-2FluxReconciliationFailure alert
SOPS Decryption FailureSEV-2Kustomize controller errors
Git Source UnreachableSEV-2GitRepository not ready
Helm Release FailedSEV-3HelmRelease stuck

Security

RunbookSeverityTrigger
Certificate ExpiredSEV-1TLS errors, API server cert invalid
Compromised CredentialsSEV-1Suspicious activity, key leak
SOPS Key CompromiseSEV-1Key material exposed
Unauthorized Access DetectedSEV-2Audit log anomalies

On-Call Escalation

┌──────────────────────────────────────────────┐
│ Alert fires (Prometheus → Alertmanager) │
└─────────────────────┬────────────────────────┘


┌──────────────────────────────────────────────┐
│ L1: On-call operator (< 15 min response) │
│ • Acknowledge alert │
│ • Execute runbook │
│ • Escalate if unresolved in 30 min │
└─────────────────────┬────────────────────────┘
│ (unresolved)

┌──────────────────────────────────────────────┐
│ L2: Platform engineer (< 30 min response) │
│ • Deep diagnosis │
│ • Infrastructure-level fixes │
│ • Escalate if unresolved in 1 hour │
└─────────────────────┬────────────────────────┘
│ (unresolved)

┌──────────────────────────────────────────────┐
│ L3: Architecture / vendor support │
│ • Root cause analysis │
│ • Vendor engagement if needed │
└──────────────────────────────────────────────┘

Escalation Criteria

FromToWhen
L1 → L230 min unresolved, or SEV-1Runbook steps exhausted, infrastructure issue suspected
L2 → L31 hour unresolved, or data lossRequires architectural change or vendor support

Quick Reference: Common Commands

# Cluster status
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
flux get kustomizations -A

# Logs
kubectl logs -n <namespace> deploy/<name> --since=10m
flux logs --level=error --since=10m
journalctl -u kubelet --since="10 minutes ago" # on node

# Restart
kubectl rollout restart deployment/<name> -n <namespace>
flux reconcile kustomization <name> --with-source

# Drain/cordon
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# Force reconciliation
flux reconcile source git flux-system
flux reconcile kustomization flux-system --with-source

Incident Response Template

Use this template when opening an incident ticket:

## Incident: [Title]

**Severity:** SEV-[1-4]
**Detected:** [timestamp]
**Resolved:** [timestamp]
**Duration:** [minutes]

### Summary
[One sentence describing the incident]

### Impact
[What was affected, how many users/services impacted]

### Timeline
- HH:MM — Alert fired
- HH:MM — Operator acknowledged
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service restored

### Root Cause
[Technical explanation]

### Resolution
[Steps taken to resolve]

### Action Items
- [ ] Prevention measure 1
- [ ] Prevention measure 2
- [ ] Runbook update

Runbook Details

etcd Quorum Loss

Runbook content planned — SEV-1 response procedure for etcd quorum loss.

Control Plane Node Failure

Runbook content planned — SEV-2 response procedure for control plane node failure.

Worker Node Failure

Runbook content planned — SEV-3 response procedure for worker node failure.

API Server Unresponsive

Runbook content planned — SEV-1 response procedure for unresponsive API server.

Kubelet Crash Loop

Runbook content planned — SEV-2 response procedure for kubelet crash looping.

Pod-to-Pod Connectivity Loss

Runbook content planned — SEV-1 response procedure for pod networking failure.

DNS Resolution Failure

Runbook content planned — SEV-2 response procedure for DNS resolution failure.

Load Balancer Unhealthy

Runbook content planned — SEV-2 response procedure for unhealthy load balancer.

Ingress Certificate Expired

Runbook content planned — SEV-2 response procedure for expired ingress certificate.

PersistentVolume Stuck

Runbook content planned — SEV-3 response procedure for stuck PersistentVolumes.

etcd Disk Full

Runbook content planned — SEV-1 response procedure for etcd disk full.

Backup Failure

Runbook content planned — SEV-3 response procedure for backup failure.

Volume Snapshot Failure

Runbook content planned — SEV-3 response procedure for volume snapshot failure.

FluxCD Reconciliation Failure

Runbook content planned — SEV-2 response procedure for FluxCD reconciliation failure.

SOPS Decryption Failure

Runbook content planned — SEV-2 response procedure for SOPS decryption failure.

Git Source Unreachable

Runbook content planned — SEV-2 response procedure for unreachable git source.

Helm Release Failed

Runbook content planned — SEV-3 response procedure for failed Helm release.

Certificate Expired

Runbook content planned — SEV-1 response procedure for expired certificates.

Compromised Credentials

Runbook content planned — SEV-1 response procedure for compromised credentials.

SOPS Key Compromise

Runbook content planned — SEV-1 response procedure for SOPS key compromise.

Unauthorized Access Detected

Runbook content planned — SEV-2 response procedure for unauthorized access.