By Aisha Johnson, Security Engineer
What Distributed Tracing Actually Solves
A microservices architecture running at scale is a distributed system. That means a single user request can touch a dozen services, three databases, two message queues, and an external API before returning a response. When that request takes 4.2 seconds instead of 200ms, or silently fails, you need to reconstruct exactly what happened across all of those hops. Logs alone cannot do this. Metrics tell you that something is broken, not where or why.
Distributed tracing solves this by attaching a unique trace ID to a request at its entry point and propagating it through every service the request touches. Each service records a “span” covering its portion of the work, timing it and attaching context. At the end, you have a complete picture: a waterfall view showing which service added latency, which database query exploded, and which downstream call timed out.
The audience for these tools spans SREs debugging production incidents, developers profiling new deployments, and security engineers (my own territory) who use trace data to detect anomalous request patterns and trace the blast radius of a breach.
This comparison covers the platforms that matter in 2026: both open-source options you self-host and commercial SaaS products that handle the operational burden for you.
OpenTelemetry: The Standard Underneath Everything
Before comparing backends, it is worth naming the substrate all modern tracing platforms share: OpenTelemetry (OTel). OTel is the CNCF-graduated instrumentation standard that defines how you instrument your application to emit traces, metrics, and logs in a vendor-neutral format. According to CNCF survey data, approximately 48% of cloud-native end-user companies have adopted OpenTelemetry, with the project ranking as the second most active in the entire CNCF ecosystem.
The practical consequence: if you instrument your application with the OpenTelemetry SDK today, you can switch backends without re-instrumenting. Every platform in this comparison either speaks OTLP (OpenTelemetry Protocol) natively or accepts it as a primary input.
A minimal OpenTelemetry setup for a Python service looks like this:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://your-backend:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process-payment") as span:
span.set_attribute("user.id", user_id)
span.set_attribute("payment.amount", amount)
# your logic here
The endpoint is the only value that changes between platforms. Everything else stays constant.
Platform Comparisons
Jaeger (v2)
Jaeger is a CNCF-graduated open-source project originally created by Uber. In November 2024, the team shipped Jaeger v2, a complete architectural overhaul that replaces the Jaeger-specific agent and collector with the OpenTelemetry Collector framework at its core. Jaeger v1 reached end-of-support in January 2026.
The v2 release ships as a single binary (down from multiple separate binaries), reducing the container image from 40MB to 30MB. Storage backends include Elasticsearch, Apache Cassandra, ClickHouse (now on the official roadmap), and Kafka as an intermediate queue. The new storage abstraction layer means batched exports to ClickHouse are now possible, which materially improves performance at high trace volumes.
Pros: Mature, CNCF-graduated, strong community, now fully OpenTelemetry-native, rich tag-based search and query UI, free to self-host.
Cons: Requires managing Elasticsearch or Cassandra at scale (significant operational overhead), UI is capable but not as polished as commercial alternatives, high-cardinality filtering still lags behind Honeycomb-style wide-event models.
Best for: Teams that need a battle-tested, self-hosted tracing backend with strong search capabilities and are already managing an Elasticsearch cluster.
Docker Compose quickstart for Jaeger v2:
services:
jaeger:
image: jaegertracing/jaeger:2
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- SPAN_STORAGE_TYPE=memory # swap for elasticsearch or cassandra in production
Grafana Tempo
Grafana Tempo is purpose-built for high-volume, cost-conscious tracing. Its key architectural decision: it stores no index. Instead of building a search index over span attributes (which requires expensive storage like Elasticsearch), Tempo stores raw trace data in object storage (S3, GCS, or Azure Blob) and relies on trace ID lookups. This makes it dramatically cheaper to operate at scale.
The tradeoff is real: if you do not know a trace ID, your search options are limited. Tempo integrates with Grafana’s TraceQL query language and supports linking traces to Loki logs and Prometheus metrics within Grafana dashboards, which is its primary value proposition. The Grafana-native experience, with logs, metrics, and traces correlated in a single view, is excellent.
Tempo is open-source and free to self-host. Grafana Cloud includes a Tempo-backed tracing tier.
Pros: Extremely low storage costs (object storage only), no operational database to manage, first-class Grafana integration, TraceQL for attribute-based search (partially offsets the no-index limitation).
Cons: Trace search without a trace ID is limited compared to Jaeger, least compelling as a standalone tool outside the Grafana ecosystem.
Best for: Teams already using Grafana, Loki, and Prometheus who want unified observability at minimal storage cost.
Honeycomb
Honeycomb is the commercial platform that most aggressively challenged the trace-plus-metrics model. Its data model centers on “wide events”: structured events that can carry an arbitrarily large number of fields. You store everything about a request in a single event, and then use BubbleUp (anomaly detection) and the query builder to find patterns across high-cardinality dimensions. Honeycomb calls this approach “observability-driven development.”
Pricing is event-volume-based, which is a meaningful differentiator: no per-host charges and no per-user charges. The free tier covers up to 20 million events per month. The Pro plan starts at $130/month for up to 100 million events annually ($1,300/year). Enterprise pricing requires contacting sales.
Pros: Best-in-class high-cardinality query engine, no per-seat pricing surprises, BubbleUp anomaly detection is genuinely useful for finding unknown unknowns, excellent developer experience.
Cons: Fully SaaS with no self-hosted option (data leaves your environment), can get expensive at very high volumes, less useful as a traditional APM.
Best for: Developer-centric teams who debug production issues with trace data daily and want the most powerful query capabilities without managing infrastructure.
SigNoz
SigNoz is an open-source observability platform built natively on OpenTelemetry, combining traces, metrics, and logs into a single application backed by ClickHouse (the same OLAP database used by Uber and Cloudflare). It is marketed as a self-hostable alternative to Datadog and New Relic.
Pricing for the hosted cloud tier recently dropped to $49/month for the Teams plan, with usage-based costs of $0.30 per GB of ingested logs and traces and $0.10 per million metric samples. A startup program offers 50% off. The community edition is fully self-hostable at no cost.
The ClickHouse backend performs well on high-cardinality queries that would be expensive on Elasticsearch. SigNoz’s UI covers the core APM workflow: service maps, trace search, alerts, and dashboards.
Pros: Fully open-source with self-hosting option, OpenTelemetry-native from day one, ClickHouse backend handles high-cardinality efficiently, competitive cloud pricing.
Cons: Smaller community than Jaeger or Grafana, fewer integrations than commercial platforms, still maturing compared to Datadog.
Best for: Teams that want the full observability stack (traces, logs, metrics) in a single tool they can self-host, without Datadog’s pricing risk.
Datadog APM
Datadog is the enterprise observability standard in many organizations, and its APM product is deeply integrated with the rest of the Datadog platform (logs, metrics, security, synthetic monitoring, error tracking). Trace correlation across all of these surfaces is a genuine operational advantage at scale.
Pricing is host-based: $36/host/month for APM, $41 for APM Pro, and $47 for APM Enterprise (billed annually). On top of that, indexed spans cost $1.70 per million per month, and ingested spans are billed by the gigabyte. At scale, Datadog APM bills can grow significantly and unpredictably, which is a frequent source of friction for engineering teams.
Pros: Deepest integration across the full observability stack, mature UI, strong alerting and SLO tracking, enterprise SLAs.
Cons: Expensive and complex pricing model, data lives in Datadog’s cloud (compliance implications for regulated industries), vendor lock-in risk.
Best for: Enterprise teams already standardized on Datadog who want unified observability without stitching together multiple tools.
Zipkin
Zipkin is the original open-source distributed tracing system, originally developed at Twitter. It remains actively maintained by the community through 2025, but with no dedicated paid development. Recent updates have added security improvements (Trivy scanner integration for container images), but the feature set has not kept pace with Jaeger, Tempo, or SigNoz.
Zipkin accepts data via its own wire format or via Zipkin-compatible exporters, and storage backends include Elasticsearch, Cassandra, and MySQL. The UI is functional but basic. Unless you have existing Zipkin instrumentation you cannot migrate, there is little reason to choose it for a new deployment in 2026.
Pros: Lightweight, simple to operate, historically widely supported.
Cons: Feature-frozen relative to modern alternatives, UI lacks advanced filtering and correlation, purely volunteer-maintained.
Comparison Table
| Tool | Best For | Pricing | Open Source? | Key Strength |
|---|---|---|---|---|
| Jaeger v2 | Self-hosted, OpenTelemetry-native tracing | Free (self-host infra costs apply) | Yes (CNCF graduated) | Rich tag search; OTel Collector-native |
| Grafana Tempo | Cost-conscious teams in the Grafana stack | Free OSS; Grafana Cloud pricing varies | Yes | Object storage backend; near-zero index cost |
| Honeycomb | High-cardinality debugging at SaaS teams | Free (20M events/mo); Pro from $130/mo | No | Wide-events model; BubbleUp anomaly detection |
| SigNoz | Full-stack OSS APM with self-host option | Free OSS; Cloud from $49/mo | Yes | ClickHouse backend; OpenTelemetry-first |
| Datadog APM | Enterprise teams on unified Datadog stack | $36-$47/host/month + span costs | No | Deepest cross-signal correlation |
| Zipkin | Legacy systems with existing Zipkin instrumentation | Free (self-host) | Yes | Simple to operate; wide legacy support |
Security Considerations You Cannot Ignore
This is the section most tracing comparisons skip, and it is the one I care most about.
Traces contain request-level data. That means they can contain authentication headers, session tokens, query parameters with user PII, internal service credentials passed in HTTP headers, and request bodies. If your spans are ingesting this data unfiltered, you have a compliance problem and a potential breach surface.
Several risks deserve explicit attention:
Span attribute scrubbing: Before spans leave your application, filter sensitive fields. The OpenTelemetry Collector supports a redaction processor for exactly this purpose:
processors:
redaction:
allow_all_keys: false
allowed_keys:
- http.method
- http.status_code
- http.route
- db.system
- db.operation
blocked_values:
- "(?i)(authorization|x-api-key|cookie|token)"
Backend access control: A Jaeger or Tempo deployment with no authentication on port 16686 is a direct window into your internal architecture. Apply mTLS or a reverse proxy with authentication in front of any trace UI. Both Jaeger and Tempo support OAuth2 proxy setups.
Trace data as an attack surface: The serialization layer in observability pipelines deserves the same scrutiny as any data ingestion path. CVE-2025-68664 (CVSS 9.3) in LangChain Core demonstrated how serialization in orchestration and tracing-adjacent pipelines can enable secret extraction when untrusted data passes through. The principle generalizes: validate and sanitize data entering any observability pipeline, treat the collector as a security boundary, and audit what lands in your span attributes.
Regulated environments: If you operate under HIPAA, SOC 2, or PCI-DSS, the data residency implications of SaaS tracing platforms (Honeycomb, Datadog) require review. Self-hosted Jaeger or SigNoz keeps trace data inside your control boundary.
Recommendations by Use Case
Startups and small teams: Start with SigNoz Cloud at $49/month. You get traces, logs, and metrics in one tool without managing infrastructure. If budget is the constraint and you are already on Grafana, add Tempo to your existing stack.
Developer-focused SaaS products: Honeycomb is the strongest choice if your engineers actively use traces for debugging and you can accept SaaS data residency. The query ergonomics and high-cardinality capabilities justify the cost.
Self-hosting required (compliance, air-gapped, regulated industries): Jaeger v2 for a tracing-only deployment, or SigNoz for a full observability stack. Both are production-grade and actively maintained. Run them behind an authenticated reverse proxy.
Enterprise Datadog shops: Datadog APM is the pragmatic choice if your team is already standardized there. The cross-signal correlation is a real productivity multiplier. Control costs by setting ingestion sampling rates and indexing only high-value spans.
Grafana-native infrastructure teams: Tempo is the natural fit. Add TraceQL, link to Loki for log correlation, and you have a complete observability stack for the cost of object storage.
Kubernetes operators at scale: Jaeger v2 with ClickHouse (once it graduates to official support) is worth evaluating. ClickHouse handles high write throughput and wide-column queries significantly better than Elasticsearch at the volumes common in large Kubernetes environments.
The migration path in every case is the same: instrument with OpenTelemetry, ship to an OTel Collector, and route to your chosen backend. The backend becomes replaceable. The only defensible lock-in is the one you choose deliberately.
🛠️ Try These Free Tools
Paste your Kubernetes YAML to detect deprecated APIs before upgrading.
Paste a Dockerfile for instant security and best-practice analysis.
Paste your docker-compose.yml to audit image versions and pinning.
Track These Releases