Observability at Scale

Purpose: For platform engineers, provides guidance on scaling the observability stack for large clusters without drowning in cardinality or storage costs.

Prometheus at Scale

Cardinality Management

Cluster Size	Expected Active Series	Recommended Action
Small (≤10 nodes)	< 200K	Default config sufficient
Medium (11–50 nodes)	200K–1M	Add recording rules, drop unused metrics
Large (51–200 nodes)	1M–5M	Federation, metric relabeling, shorter retention

Key Tuning

Recording rules: Pre-aggregate high-cardinality metrics (container_* → namespace-level summaries)
Metric relabeling: Drop container_network_* per-pod metrics; keep only namespace aggregates
Retention: 15d local, remote-write to long-term storage for historical queries
Sharding: Deploy Prometheus per node-pool above 100 nodes; federate into global view

Loki at Scale

Parameter	Small	Large	Effect
`ingester.chunk-target-size`	1.5 MB	2.5 MB	Fewer, larger chunks = less index pressure
`ingester.chunk-idle-period`	30m	1h	Reduces flush frequency
`limits_config.max_entries_limit_per_query`	5,000	10,000	Larger query windows
`compactor.retention-enabled`	true	true	Always enable
`compactor.retention-period`	30d	14d	Reduce for large clusters

Log Volume Reduction

Set namespace-level log shipping (exclude kube-system verbose logs)
Use structured JSON logging — drops 40% index size vs. unstructured
Filter health-check logs at the collector level

Tempo at Scale

Sampling Strategies

Strategy	Capture Rate	Use Case
Head-based (probabilistic)	1–10%	General workloads
Tail-based (error/latency)	100% of errors, 5% baseline	Production debugging
Always-on (critical paths)	100%	Payment, auth flows

Storage Optimization

Use S3-compatible backend (MinIO) for trace storage
Set compactor.compaction-window: 4h to batch compactions
Limit trace duration to 5 minutes max to prevent unbounded spans

OpenTelemetry Collector Scaling

For clusters above 50 nodes, deploy the collector as a DaemonSet (node-level) with a gateway (cluster-level):

Pods → Node Collector (DaemonSet) → Gateway Collector (Deployment, 3 replicas) → Backends

This architecture:

Reduces per-node memory from unbounded to ~256 MB
Centralizes sampling and export retry logic
Enables batch processing with batch/timeout: 5s

Prometheus at Scale​

Cardinality Management​

Key Tuning​

Loki at Scale​

Log Volume Reduction​

Tempo at Scale​

Sampling Strategies​

Storage Optimization​

OpenTelemetry Collector Scaling​