Skip to main content

Observability at Scale

Purpose: For platform engineers, provides guidance on scaling the observability stack for large clusters without drowning in cardinality or storage costs.

Prometheus at Scale

Cardinality Management

Cluster SizeExpected Active SeriesRecommended Action
Small (≤10 nodes)< 200KDefault config sufficient
Medium (11–50 nodes)200K–1MAdd recording rules, drop unused metrics
Large (51–200 nodes)1M–5MFederation, metric relabeling, shorter retention

Key Tuning

  • Recording rules: Pre-aggregate high-cardinality metrics (container_* → namespace-level summaries)
  • Metric relabeling: Drop container_network_* per-pod metrics; keep only namespace aggregates
  • Retention: 15d local, remote-write to long-term storage for historical queries
  • Sharding: Deploy Prometheus per node-pool above 100 nodes; federate into global view

Loki at Scale

ParameterSmallLargeEffect
ingester.chunk-target-size1.5 MB2.5 MBFewer, larger chunks = less index pressure
ingester.chunk-idle-period30m1hReduces flush frequency
limits_config.max_entries_limit_per_query5,00010,000Larger query windows
compactor.retention-enabledtruetrueAlways enable
compactor.retention-period30d14dReduce for large clusters

Log Volume Reduction

  • Set namespace-level log shipping (exclude kube-system verbose logs)
  • Use structured JSON logging — drops 40% index size vs. unstructured
  • Filter health-check logs at the collector level

Tempo at Scale

Sampling Strategies

StrategyCapture RateUse Case
Head-based (probabilistic)1–10%General workloads
Tail-based (error/latency)100% of errors, 5% baselineProduction debugging
Always-on (critical paths)100%Payment, auth flows

Storage Optimization

  • Use S3-compatible backend (MinIO) for trace storage
  • Set compactor.compaction-window: 4h to batch compactions
  • Limit trace duration to 5 minutes max to prevent unbounded spans

OpenTelemetry Collector Scaling

For clusters above 50 nodes, deploy the collector as a DaemonSet (node-level) with a gateway (cluster-level):

Pods → Node Collector (DaemonSet) → Gateway Collector (Deployment, 3 replicas) → Backends

This architecture:

  • Reduces per-node memory from unbounded to ~256 MB
  • Centralizes sampling and export retry logic
  • Enables batch processing with batch/timeout: 5s