Jaeger Reference: Distributed Tracing with OpenTelemetry, Sampling & Production Setup

Jaeger is a CNCF-graduated distributed tracing system. It shows how requests flow through microservices — latency per service, which service is slow, where errors originate. Works natively with OpenTelemetry SDKs; OTel is now the standard way to instrument apps for Jaeger.

1. Jaeger Architecture & OTel Integration

How Jaeger works with OpenTelemetry

Component	Role
OTel SDK	Instruments your app code, creates spans
OTel Collector	Receives spans from SDK, processes/routes them
Jaeger Collector	Receives spans from OTel Collector, stores them
Jaeger Storage	Backend: Elasticsearch, Cassandra, Badger (in-memory dev), OpenSearch
Jaeger Query	Serves the UI + API for trace search

# Modern stack:
# App → OTel SDK → OTel Collector → Jaeger Collector → Storage → Jaeger UI

# Install Jaeger on Kubernetes (Helm — all-in-one for dev, production uses Elasticsearch):
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm install jaeger jaegertracing/jaeger   --namespace tracing --create-namespace   --set allInOne.enabled=true \          # dev mode: all-in-one with in-memory storage
  --set storage.type=none                # no external storage for dev

# Production (Elasticsearch backend):
helm install jaeger jaegertracing/jaeger   --namespace tracing   --set storage.type=elasticsearch   --set storage.elasticsearch.host=elasticsearch.logging:9200

# Access Jaeger UI:
kubectl port-forward -n tracing svc/jaeger-query 16686
# Open: http://localhost:16686

2. Instrument Apps with OpenTelemetry SDK

Python, Go, and Node.js instrumentation sending to Jaeger

# Python (auto-instrumentation — no code changes for common frameworks):
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install    # installs instrumentation for detected frameworks

OTEL_SERVICE_NAME=my-service OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 OTEL_TRACES_EXPORTER=otlp opentelemetry-instrument python app.py
# Auto-instruments: Flask, Django, FastAPI, SQLAlchemy, requests, etc.

# Python manual spans:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(
    endpoint="http://otel-collector:4317"
)))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-service")

with tracer.start_as_current_span("my-operation") as span:
    span.set_attribute("user.id", 123)
    span.set_attribute("db.query", "SELECT * FROM orders")
    # Your code here

# Go:
import "go.opentelemetry.io/otel"
import "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"

exporter, _ := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint("otel-collector:4317"))
tp := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(tp)

tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "http-request")
defer span.End()
span.SetAttributes(attribute.String("http.url", url))

3. OTel Collector Configuration

Route spans from apps to Jaeger via the OTel Collector

# OTel Collector config (otel-collector-config.yaml):
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317     # gRPC — apps send here
      http:
        endpoint: 0.0.0.0:4318     # HTTP — browser or simple clients

processors:
  batch:                           # batch spans before sending to Jaeger
    timeout: 5s
    send_batch_size: 1000
  resource:
    attributes:
      - action: upsert
        key: deployment.environment
        value: production

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317  # forward to Jaeger collector
    tls:
      insecure: true

  # Also export to Prometheus (metrics) alongside traces:
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/jaeger]

# Deploy OTel Collector:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector   --set mode=deployment   --set config.override=true   --set-file config.configMap.collectorConfig=otel-collector-config.yaml

4. Jaeger UI — Finding Traces

Search for slow requests, errors, and trace analysis

# Jaeger UI at http://localhost:16686

# Basic search:
# Service: select your service (e.g. "checkout-service")
# Operation: select specific endpoint (e.g. "POST /checkout")
# Tags: filter by attributes (error=true, http.status_code=500, user.id=123)
# Lookback: 1h, 3h, 1d
# Min/Max Duration: find slow requests (e.g. minDuration: 500ms)

# Trace detail view:
# - Waterfall chart: each span as a horizontal bar (duration + timing)
# - Parent-child relationships show which service called which
# - Click a span: see attributes (HTTP URL, DB query, user ID, error message)
# - Red spans = errors (span.status.code = ERROR)

# Trace comparison (2 traces side by side):
# Search for 2 traces, click "Compare" button
# Shows timing differences per service — useful for "why was THIS request slow?"

# Jaeger query API (for automation):
curl "http://localhost:16686/api/traces?service=checkout-service&limit=20"
curl "http://localhost:16686/api/traces?service=checkout-service&tags=%7B%22error%22%3A%22true%22%7D"
curl "http://localhost:16686/api/services"   # list all services

5. Sampling & Production Tuning

Control trace volume — probabilistic, rate-limiting, and adaptive

# Sampling strategies — set in Jaeger Collector or OTel Collector:

# 1. Probabilistic (sample X% of traces):
# In OTel SDK:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.1)   # 10% of traces
provider = TracerProvider(sampler=sampler)

# 2. Always-on for errors (parent-based with head sampling):
from opentelemetry.sdk.trace.sampling import ParentBased, ALWAYS_ON
# Always sample if parent span was sampled; always sample new traces

# 3. Tail-based sampling in OTel Collector (sample based on full trace outcome):
# otel-collector-config.yaml:
processors:
  tail_sampling:
    decision_wait: 10s             # wait 10s for all spans before deciding
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}  # always keep error traces
      - name: slow-policy
        type: latency
        latency: {threshold_ms: 1000}  # always keep traces > 1s
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 5}  # sample 5% of everything else

# Production tips:
# - Start with 10% probabilistic, then tune based on storage costs
# - Always keep 100% of error traces
# - Always keep 100% of slow traces (> P95 latency)
# - Use parent-based sampling for consistent trace completeness
# - Jaeger Elasticsearch storage: ~1KB per span, plan storage accordingly

Track Jaeger, OpenTelemetry, and observability tool releases.
ReleaseRun monitors Kubernetes, Docker, and 13+ DevOps technologies.

🔍 Free tool: K8s YAML Security Linter — check your Jaeger and K8s workload manifests for 12 security misconfigurations.

Founded

2023 in London, UK

Contact

hello@releaserun.com