Reddit’s arguing again: observability in 2026 looks like OTel plus correlation
Quietly, OpenTelemetry won. The fights now happen over sampling, cardinality, and whether your Collector falls over at 2 a.m.
Community take: what people actually complain about
I keep seeing the same threads in Discord and GitHub issues. Folks love vendor-neutral OTLP, then they hit reality. Traces drop under load, metrics explode from one “helpful” label, and somebody asks why the Collector needs so much memory for tail sampling.
On the Kubernetes side, the consensus seems to be: most teams are upgrading their OTel setup, but they are doing it in slices. As one SRE put it in a Slack thread, “I’ll take boring metrics with clean ownership over 100% traces I cannot afford.” I agree, even if it feels less cool.
- Tail sampling scares people: buffering whole traces feels risky, especially when traffic spikes and the incident commander asks you to “turn up tracing.” I have watched teams do that and then OOM the gateway Collector.
- Cardinality still bites: someone adds user_id as a metric label, Prometheus cries, and the postmortem reads like a crime scene. The thing nobody mentions is how often this happens during “just instrument it” rollouts.
- Correlation beats collection: teams do not lack data. They lack clickable paths from “p99 went red” to “this trace” to “these logs” to “that Kubernetes event.”
Hot take: ignore the GitHub commit count on observability projects. It’s a vanity metric. Read the breaking changes and the open issues instead.
Official facts: what the docs and specs actually say
OpenTelemetry (OTel) gives you a vendor-neutral standard for collecting telemetry. It does not store or visualize anything. It ships the data model, SDKs, auto-instrumentation, OTLP, and the Collector pipeline.
Most engineers I see treat this as the current “default stack shape.” Instrument apps once, then choose storage later. That part feels real.
- Signals: traces, metrics, logs. Profiling and events also matter in practice, even when teams do not call them “pillars.”
- OTLP transport: gRPC and HTTP/protobuf show up everywhere. That makes backend swaps less painful, at least compared to the old agent zoo.
- Collector pipeline model: receivers ingest, processors mutate and sample, exporters ship data out. Connectors can translate between pipelines, for example span-to-metrics.
Distributed tracing still works the same way. A trace groups spans, spans carry attributes and timestamps, and context propagation pushes trace IDs through headers like traceparent. B3 still shows up because nothing ever really dies.
So what changed by 2026? The “three pillars” talk stopped helping
🔔 Never Miss a Breaking Change
Monthly release roundup — breaking changes, security patches, and upgrade guides across your stack.
✅ You're in! Check your inbox for confirmation.
People still say “logs, metrics, traces.” Fine. But I have watched teams with all three signals still take four hours to find a bad deploy because each signal lives in a different UI with a different naming scheme.
Correlation is the whole job. If you cannot jump from a p99 spike to a trace exemplar and then to a log line, you built a telemetry museum, not an on-call tool.
- Add events to the chain: Kubernetes events like OOMKilled and FailedMount often explain the “why” faster than application logs do.
- Profiles answer “why slow”: traces tell you “where time went.” profiles usually tell you “which line ate the CPU.” That difference matters when a regex goes wild.
My synthesis: what most teams deploy first (and what I would do)
Start small. Really.
The consensus seems to be: most teams get value fastest by instrumenting one user-facing service, wiring a Collector, and proving they can follow one request across two downstream calls. After that, they add cost controls. Then they expand.
Here’s a pragmatic rollout that matches how people talk about it in issues and war rooms.
- Week 1, traces only: enable auto-instrumentation on one service, send to a simple trace backend. Verify propagation across service boundaries before you celebrate.
- Weeks 2-3, add “owned” spans and RED metrics: auto-instrumentation misses business logic. Add two or three spans you care about, then generate rate, errors, and duration from spans.
- Weeks 4-8, scale the pipeline: run a DaemonSet agent for most workloads, forward to a gateway Collector for tail sampling and egress. Keep the gateway boring and overprovisioned.
Some teams skip canaries for patch-level Collector upgrades. I do not, but I get it. If your on-call pain comes from dropped telemetry, a canary Collector pays for itself in one night.
If you cannot test your CNI and your Collector pipeline in staging, you should not run Kubernetes in production. Harsh, but I stand by it.
Collector deployment: the choice that shows up in every postmortem
DaemonSet usually wins. One Collector per node keeps configs consistent and makes it easy to grab host-level signals. Sidecars still make sense for noisy services or multi-tenant isolation, but they multiply operational work fast.
Tail sampling changes the game. It also changes your memory profile. Buffering traces costs RAM. Plan for it, then measure it under load. I cannot give you one number that fits every cluster, and anyone who does probably sells something.
- Do configure memory limits: run memory_limiter, set sane limits, and test a traffic spike. Collectors crash at the worst time.
- Do redact early: drop auth headers, hash db.statement if needed, and treat PII as a production incident, not a backlog item.
- Do not label metrics with UUIDs: keep high-cardinality stuff on spans and logs, not on metric dimensions.
Tooling reality: pick your backend based on who carries the pager
I see three common paths. Teams with strong platform engineering run Grafana plus Prometheus or Mimir, Tempo, Loki, and maybe Pyroscope. Teams that hate running stateful infra pay for SaaS. Teams that only need tracing keep Jaeger around longer than they expected.
None of these choices fix bad instrumentation. They only change how quickly you notice it.
Other stuff in this release: dependency bumps, some image updates, the usual.
Wrap-up
OTel Collector config that actually works
Here’s a production-ready Collector config for Kubernetes, starting with the most common pattern:
# otel-collector-config.yaml — DaemonSet agent config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512 # Hard limit — prevents OOM
spike_limit_mib: 128
k8sattributes:
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
exporters:
otlp:
endpoint: "otel-gateway:4317"
tls:
insecure: false
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
# Deploy OTel Collector as DaemonSet on Kubernetes
# Using the official Helm chart:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm install otel-agent open-telemetry/opentelemetry-collector \
--set mode=daemonset \
--set config.receivers.otlp.protocols.grpc.endpoint="0.0.0.0:4317" \
--set resources.limits.memory=512Mi \
--namespace monitoring
# Verify it's running:
kubectl -n monitoring get pods -l app.kubernetes.io/name=opentelemetry-collector
kubectl -n monitoring logs -l app.kubernetes.io/name=opentelemetry-collector --tail=20
# PromQL queries for monitoring your OTel pipeline health
# Collector throughput (spans per second):
rate(otelcol_receiver_accepted_spans[5m])
# Collector drops (THIS is the one to alert on):
rate(otelcol_processor_dropped_spans[5m]) > 0
# Memory usage vs limit:
otelcol_process_memory_rss / 1024 / 1024 # MB
# Exporter queue depth (backs up before drops):
otelcol_exporter_queue_size
# Cardinality check — top 10 metric names by series count:
topk(10, count by (__name__)({__name__=~".+"}))
Full Collector config reference: OpenTelemetry Collector docs. For the Kubernetes operator approach, see the OTel Operator on GitHub. The OTLP spec is at opentelemetry.io/docs/specs/otlp. Check your monitoring stack’s version health with the Stack Health Scorecard.
Related Reading
- What is eBPF? A Practical Guide for Kubernetes — The kernel-level tracing that complements OTel instrumentation
- Container Image Scanning in 2026 — Security observability before your containers even run
- Kubernetes Statistics and Adoption Trends in 2026 — The scale driving observability investment
- Blameless Postmortems That Actually Change Your System — What happens after observability catches the problem
- PostgreSQL Performance in 2026 — Database observability patterns that pair with OTel tracing
By 2026, the argument stopped being “should we adopt OpenTelemetry?” Most teams already did. The argument moved to “can we correlate signals fast enough to debug prod without burning cash?” If you wire correlation first, the rest gets easier. Probably.
🛠️ Try These Free Tools
Paste your Kubernetes YAML to detect deprecated APIs before upgrading.
Paste a Dockerfile for instant security and best-practice analysis.
Paste your dependency file to check for end-of-life packages.