Observability at Scale
Purpose: For platform engineers, provides guidance on scaling the observability stack for large clusters without drowning in cardinality or storage costs.
Prometheus at Scale
Cardinality Management
| Cluster Size | Expected Active Series | Recommended Action |
|---|---|---|
| Small (≤10 nodes) | < 200K | Default config sufficient |
| Medium (11–50 nodes) | 200K–1M | Add recording rules, drop unused metrics |
| Large (51–200 nodes) | 1M–5M | Federation, metric relabeling, shorter retention |
Key Tuning
- Recording rules: Pre-aggregate high-cardinality metrics (container_* → namespace-level summaries)
- Metric relabeling: Drop
container_network_*per-pod metrics; keep only namespace aggregates - Retention: 15d local, remote-write to long-term storage for historical queries
- Sharding: Deploy Prometheus per node-pool above 100 nodes; federate into global view
Loki at Scale
| Parameter | Small | Large | Effect |
|---|---|---|---|
ingester.chunk-target-size | 1.5 MB | 2.5 MB | Fewer, larger chunks = less index pressure |
ingester.chunk-idle-period | 30m | 1h | Reduces flush frequency |
limits_config.max_entries_limit_per_query | 5,000 | 10,000 | Larger query windows |
compactor.retention-enabled | true | true | Always enable |
compactor.retention-period | 30d | 14d | Reduce for large clusters |
Log Volume Reduction
- Set namespace-level log shipping (exclude
kube-systemverbose logs) - Use structured JSON logging — drops 40% index size vs. unstructured
- Filter health-check logs at the collector level
Tempo at Scale
Sampling Strategies
| Strategy | Capture Rate | Use Case |
|---|---|---|
| Head-based (probabilistic) | 1–10% | General workloads |
| Tail-based (error/latency) | 100% of errors, 5% baseline | Production debugging |
| Always-on (critical paths) | 100% | Payment, auth flows |
Storage Optimization
- Use S3-compatible backend (MinIO) for trace storage
- Set
compactor.compaction-window: 4hto batch compactions - Limit trace duration to 5 minutes max to prevent unbounded spans
OpenTelemetry Collector Scaling
For clusters above 50 nodes, deploy the collector as a DaemonSet (node-level) with a gateway (cluster-level):
Pods → Node Collector (DaemonSet) → Gateway Collector (Deployment, 3 replicas) → Backends
This architecture:
- Reduces per-node memory from unbounded to ~256 MB
- Centralizes sampling and export retry logic
- Enables batch processing with
batch/timeout: 5s