Disaster Recovery
Purpose: For operators, documents disaster recovery procedures including RTO/RPO targets, recovery tiers, and step-by-step restoration for openCenter clusters.
RTO/RPO Targets
| Tier | Scenario | RPO | RTO | Strategy |
|---|---|---|---|---|
| 1 | Single worker node failure | 0 | 5 min | Auto-healing via node replacement |
| 2 | Single control plane node failure | 0 | 15 min | etcd quorum maintained, node replacement |
| 3 | Control plane quorum loss | Last etcd snapshot | 30–60 min | etcd restore from backup |
| 4 | Full cluster loss | Last Velero backup | 2–4 hours | Full re-provision + restore |
| 5 | Site disaster (region loss) | Last offsite backup | 4–8 hours | New region provision + restore |
Prerequisites
- etcd snapshots available in S3 (configured via
etcd-backupservice) - Velero backups in S3-compatible storage
- GitOps repository accessible (contains cluster configuration and manifests)
- openCenter CLI with cluster configuration
- Access to cloud provider APIs
Recovery Tier 1: Single Worker Node
Worker nodes are stateless. Kubernetes reschedules workloads automatically.
Automated Recovery
If the node does not recover within 5 minutes, Kubernetes marks pods as Terminating and reschedules them. No operator action required for stateless workloads.
Manual Node Replacement
# 1. Cordon and drain the failed node
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s
# 2. Remove the node from the cluster
kubectl delete node <node-name>
# 3. Provision a replacement (via openCenter)
opencenter cluster deploy <cluster-name> --from-step opentofu-apply
# 4. Verify new node joins
kubectl get nodes -w
Recovery Tier 2: Single Control Plane Node
With 3+ control plane nodes, losing one maintains etcd quorum.
Recovery Steps
# 1. Verify etcd quorum is maintained
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://<healthy-cp-node>:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem
# 2. Remove failed etcd member
ETCDCTL_API=3 etcdctl member remove <member-id> \
--endpoints=https://<healthy-cp-node>:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem
# 3. Remove failed node from Kubernetes
kubectl delete node <failed-cp-node>
# 4. Provision replacement control plane node
# Update inventory and run Kubespray scale playbook
ansible-playbook -i inventory/hosts.yaml scale.yml --become --limit=<new-node>
# 5. Verify cluster health
kubectl get nodes
kubectl get pods -n kube-system
Recovery Tier 3: etcd Restore
When etcd quorum is lost (majority of control plane nodes down), restore from snapshot.
Restore Procedure
# 1. Stop kube-apiserver on all remaining control plane nodes
ssh <cp-node> "sudo systemctl stop kube-apiserver"
# 2. Download latest etcd snapshot from S3
aws s3 cp s3://<cluster>-etcd-backups/$(aws s3 ls s3://<cluster>-etcd-backups/ | sort | tail -1 | awk '{print $4}') /tmp/etcd-snapshot.db
# 3. Restore etcd on the first control plane node
sudo ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restore \
--name=<node-name> \
--initial-cluster=<node-name>=https://<node-ip>:2380 \
--initial-advertise-peer-urls=https://<node-ip>:2380
# 4. Replace etcd data directory
sudo systemctl stop etcd
sudo mv /var/lib/etcd /var/lib/etcd.bak
sudo mv /var/lib/etcd-restore /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd
sudo systemctl start etcd
# 5. Start kube-apiserver
sudo systemctl start kube-apiserver
# 6. Verify cluster state
kubectl get nodes
kubectl get pods -A
For multi-node etcd restore, repeat step 3 on each control plane node with the appropriate --initial-cluster configuration.
Verify etcd Health Post-Restore
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem
Recovery Tier 4: Full Cluster Restore
Complete cluster loss requires re-provisioning infrastructure and restoring application state.
Step 1: Re-provision Infrastructure
# Re-deploy cluster from existing configuration
opencenter cluster deploy <cluster-name> --restart
This runs through the full bootstrap: OpenTofu infrastructure provisioning, Kubespray K8s installation, and FluxCD bootstrap.
Step 2: GitOps Re-bootstrap
FluxCD reconciles the cluster state from Git automatically after bootstrap. Verify:
# Check FluxCD is reconciling
flux get kustomizations -A
# Wait for all services to deploy
watch kubectl get pods -A
Step 3: Restore Application Data (Velero)
# Verify Velero can access backups
velero backup-location get
velero backup get
# Restore from the latest backup
velero restore create full-restore \
--from-backup <latest-backup-name> \
--include-namespaces="*" \
--exclude-namespaces=velero,flux-system,kube-system \
--wait
# Check restore status
velero restore describe full-restore --details
Step 4: Restore Persistent Volumes
# Restore PVs separately if needed
velero restore create pv-restore \
--from-backup <latest-backup-name> \
--include-resources=persistentvolumes,persistentvolumeclaims \
--wait
Step 5: Verify
# All nodes ready
kubectl get nodes
# All platform services running
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Flux fully reconciled
flux get kustomizations -A --status-selector ready=false
# Application health
kubectl get deployments -A
Recovery Tier 5: Site Disaster (New Region)
Step 1: Prepare New Region Configuration
# Clone the cluster config for a new region
opencenter cluster init <new-cluster-name> \
--provider openstack \
--organization <org>
# Update configuration with new region parameters
opencenter cluster edit <new-cluster-name>
Update: region, network IDs, image IDs, floating IP pools, and DNS settings.
Step 2: Deploy to New Region
opencenter cluster generate <new-cluster-name>
opencenter cluster deploy <new-cluster-name>
Step 3: Restore Data
Follow the Velero restore procedure from Tier 4 (Steps 3–5). Velero backups stored in a different region/bucket are accessible from the new cluster.
Step 4: Update DNS
Point DNS records to the new cluster's ingress/load balancer IPs.
DR Testing
Test disaster recovery procedures quarterly. Procedure:
- Create a test cluster in a non-production environment
- Simulate failure (delete nodes, corrupt etcd, destroy infrastructure)
- Execute recovery following this runbook
- Measure RTO (time from failure detection to service restoration)
- Verify RPO (compare restored data with expected state)
- Document findings and update procedures
Automated DR Test Script
#!/bin/bash
# Run in a test environment only
CLUSTER="dr-test-cluster"
BACKUP=$(velero backup get -o json | jq -r '.items[-1].metadata.name')
echo "Testing restore from backup: $BACKUP"
echo "Start time: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
velero restore create dr-test-$(date +%s) \
--from-backup "$BACKUP" \
--include-namespaces=production \
--wait
echo "Restore complete: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Verify critical services
kubectl get pods -n production --field-selector=status.phase!=Running
openCenter Backup Commands
# Create an on-demand cluster backup (config, keys, state)
opencenter cluster backup create <cluster-name>
# List available backups
opencenter cluster backup list <cluster-name>
# Restore from a backup
opencenter cluster backup restore <backup-id> --passphrase <passphrase>
# Schedule periodic backups
opencenter cluster backup schedule <cluster-name> --interval 6h --retention 720h
Related Docs
- Backup & Restore — Configure etcd and Velero backup schedules
- Certificate Rotation — Rotate expired certificates during recovery
- Runbooks — Incident response procedures