Skip to main content

SRE / Operator Learning Path

Purpose: For SREs and operators, provides a guided reading order focused on day-2 operations, monitoring, upgrades, and incident response.

Reading Order

#PhaseTopicLinkTime
1FoundationsArchitecture overviewArchitecture10 min
2FoundationsPlatform services (20+ services, versions)Service Catalog10 min
3FoundationsCLI commands referenceCLI Commands10 min
4ObservabilityMonitoring stack (Prometheus + Grafana + Alertmanager)Stack Overview10 min
5ObservabilityLoki for logsLoki10 min
6ObservabilityTempo for tracesTempo10 min
7ObservabilityDashboards & alertsDashboards15 min
8OperationsDay 2 overviewDay 210 min
9OperationsHealth checks (opencenter cluster doctor)Health10 min
10OperationsDrift detection (opencenter cluster drift detect)Drift10 min
11UpgradesKubernetes upgradesK8s Upgrades15 min
12UpgradesService upgrades (gitops-base tag pinning)Service Upgrades10 min
13ReliabilityBackup & restore with VeleroBackup15 min
14ReliabilityDisaster recoveryDR10 min
15ScalingAdd worker pools (--server-pool flag)Workers10 min
16ScalingNode replacementReplace10 min
17SecretsKey lifecycle (check, rotate, sync, validate)Key Rotation10 min
18TroubleshootingFluxCD reconciliation issuesFluxCD10 min
19TroubleshootingNetworking issuesNetwork10 min
20TroubleshootingCLI errorsCLI Errors10 min

Daily Operations CLI Commands

# Health & status
opencenter cluster status my-cluster
opencenter cluster doctor my-cluster
opencenter cluster drift detect my-cluster

# Secrets lifecycle
opencenter secrets keys check # Shows days until expiration
opencenter secrets validate my-cluster # Detect drift
opencenter secrets sync my-cluster # Re-encrypt after changes

# Service management
opencenter cluster service status # All service states
opencenter cluster service enable <svc> # Enable a service
opencenter cluster service disable <svc> # Disable a service

# Backup
opencenter cluster backup create my-cluster
opencenter cluster backup restore <id>

# FluxCD
flux get kustomizations # Check reconciliation
flux reconcile source git flux-system # Force source refresh
flux reconcile kustomization <name> # Force kustomization apply

Observability Stack (from openCenter-gitops-base)

ServiceVersionPurpose
kube-prometheus-stack77.6.0Prometheus, Grafana, Alertmanager
Loki6.45.2Log aggregation
Mimir6.0.3Long-term metrics storage
Tempo1.55.0Distributed tracing
OpenTelemetry0.11.1Telemetry collection pipeline

Runbook Index

After completing this path, familiarize yourself with the Runbooks for standardized incident response procedures.