Jaeger Reference: Distributed Tracing with OpenTelemetry, Sampling & Production Setup
Jaeger is a CNCF-graduated distributed tracing system. It shows how requests flow through microservices — latency per service, which service is slow, where errors originate. Works natively with OpenTelemetry SDKs; OTel is now the standard way to instrument apps for Jaeger.
1. Jaeger Architecture & OTel Integration
How Jaeger works with OpenTelemetry
| Component | Role |
|---|---|
| OTel SDK | Instruments your app code, creates spans |
| OTel Collector | Receives spans from SDK, processes/routes them |
| Jaeger Collector | Receives spans from OTel Collector, stores them |
| Jaeger Storage | Backend: Elasticsearch, Cassandra, Badger (in-memory dev), OpenSearch |
| Jaeger Query | Serves the UI + API for trace search |
# Modern stack: # App → OTel SDK → OTel Collector → Jaeger Collector → Storage → Jaeger UI # Install Jaeger on Kubernetes (Helm — all-in-one for dev, production uses Elasticsearch): helm repo add jaegertracing https://jaegertracing.github.io/helm-charts helm install jaeger jaegertracing/jaeger --namespace tracing --create-namespace --set allInOne.enabled=true \ # dev mode: all-in-one with in-memory storage --set storage.type=none # no external storage for dev # Production (Elasticsearch backend): helm install jaeger jaegertracing/jaeger --namespace tracing --set storage.type=elasticsearch --set storage.elasticsearch.host=elasticsearch.logging:9200 # Access Jaeger UI: kubectl port-forward -n tracing svc/jaeger-query 16686 # Open: http://localhost:16686
2. Instrument Apps with OpenTelemetry SDK
Python, Go, and Node.js instrumentation sending to Jaeger
# Python (auto-instrumentation — no code changes for common frameworks):
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install # installs instrumentation for detected frameworks
OTEL_SERVICE_NAME=my-service OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 OTEL_TRACES_EXPORTER=otlp opentelemetry-instrument python app.py
# Auto-instruments: Flask, Django, FastAPI, SQLAlchemy, requests, etc.
# Python manual spans:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(
endpoint="http://otel-collector:4317"
)))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-service")
with tracer.start_as_current_span("my-operation") as span:
span.set_attribute("user.id", 123)
span.set_attribute("db.query", "SELECT * FROM orders")
# Your code here
# Go:
import "go.opentelemetry.io/otel"
import "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
exporter, _ := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint("otel-collector:4317"))
tp := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(tp)
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "http-request")
defer span.End()
span.SetAttributes(attribute.String("http.url", url))
3. OTel Collector Configuration
Route spans from apps to Jaeger via the OTel Collector
# OTel Collector config (otel-collector-config.yaml):
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # gRPC — apps send here
http:
endpoint: 0.0.0.0:4318 # HTTP — browser or simple clients
processors:
batch: # batch spans before sending to Jaeger
timeout: 5s
send_batch_size: 1000
resource:
attributes:
- action: upsert
key: deployment.environment
value: production
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317 # forward to Jaeger collector
tls:
insecure: true
# Also export to Prometheus (metrics) alongside traces:
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [otlp/jaeger]
# Deploy OTel Collector:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector --set mode=deployment --set config.override=true --set-file config.configMap.collectorConfig=otel-collector-config.yaml
4. Jaeger UI — Finding Traces
Search for slow requests, errors, and trace analysis
# Jaeger UI at http://localhost:16686 # Basic search: # Service: select your service (e.g. "checkout-service") # Operation: select specific endpoint (e.g. "POST /checkout") # Tags: filter by attributes (error=true, http.status_code=500, user.id=123) # Lookback: 1h, 3h, 1d # Min/Max Duration: find slow requests (e.g. minDuration: 500ms) # Trace detail view: # - Waterfall chart: each span as a horizontal bar (duration + timing) # - Parent-child relationships show which service called which # - Click a span: see attributes (HTTP URL, DB query, user ID, error message) # - Red spans = errors (span.status.code = ERROR) # Trace comparison (2 traces side by side): # Search for 2 traces, click "Compare" button # Shows timing differences per service — useful for "why was THIS request slow?" # Jaeger query API (for automation): curl "http://localhost:16686/api/traces?service=checkout-service&limit=20" curl "http://localhost:16686/api/traces?service=checkout-service&tags=%7B%22error%22%3A%22true%22%7D" curl "http://localhost:16686/api/services" # list all services
5. Sampling & Production Tuning
Control trace volume — probabilistic, rate-limiting, and adaptive
# Sampling strategies — set in Jaeger Collector or OTel Collector:
# 1. Probabilistic (sample X% of traces):
# In OTel SDK:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.1) # 10% of traces
provider = TracerProvider(sampler=sampler)
# 2. Always-on for errors (parent-based with head sampling):
from opentelemetry.sdk.trace.sampling import ParentBased, ALWAYS_ON
# Always sample if parent span was sampled; always sample new traces
# 3. Tail-based sampling in OTel Collector (sample based on full trace outcome):
# otel-collector-config.yaml:
processors:
tail_sampling:
decision_wait: 10s # wait 10s for all spans before deciding
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]} # always keep error traces
- name: slow-policy
type: latency
latency: {threshold_ms: 1000} # always keep traces > 1s
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 5} # sample 5% of everything else
# Production tips:
# - Start with 10% probabilistic, then tune based on storage costs
# - Always keep 100% of error traces
# - Always keep 100% of slow traces (> P95 latency)
# - Use parent-based sampling for consistent trace completeness
# - Jaeger Elasticsearch storage: ~1KB per span, plan storage accordingly
Track Jaeger, OpenTelemetry, and observability tool releases.
ReleaseRun monitors Kubernetes, Docker, and 13+ DevOps technologies.
Related: OpenTelemetry Reference | Prometheus Reference | Grafana Reference
🔍 Free tool: K8s YAML Security Linter — check your Jaeger and K8s workload manifests for 12 security misconfigurations.
Founded
2023 in London, UK
Contact
hello@releaserun.com