Skip to content

Span-Level Tracing

Span-level tracing decomposes every AI pipeline execution into individual spans — retrieval, prompt construction, inference, tool calls, postprocessing — each tagged with timing, token counts, cost, and metadata. It extends OpenTelemetry-style distributed tracing to the specific operations that matter in AI systems.

When an AI pipeline returns a slow or bad response, the first question is “which step broke?” Without span-level tracing:

  • You know total latency but not whether the bottleneck is retrieval, prompt assembly, or inference.
  • You know total token count but not how many tokens went to context vs. instructions vs. output.
  • You know the final output but not what retrieved context or intermediate reasoning led to it.
  • Debugging requires adding print statements, reproducing the issue, and manually stepping through the pipeline.

Traditional APM tools instrument HTTP calls and database queries. They do not instrument prompt construction, retrieval scoring, or token consumption — the operations that dominate AI pipeline performance.

Trace: user_query_abc123
├── Span: input_validation (2ms)
├── Span: retrieval (145ms)
│ ├── Span: embedding (35ms)
│ │ └── tokens: 24
│ ├── Span: vector_search (89ms)
│ │ └── results: 12 candidates
│ └── Span: reranking (21ms)
│ └── results: 5 selected
├── Span: prompt_construction (3ms)
│ └── token_count: 3,847
├── Span: inference (1,240ms)
│ ├── model: claude-3-5-sonnet
│ ├── input_tokens: 3,847
│ ├── output_tokens: 412
│ ├── time_to_first_token: 380ms
│ └── cost: $0.019
└── Span: output_validation (8ms)
└── pii_detected: false
Total: 1,398ms | Cost: $0.019 | Tokens: 4,283
  1. Each pipeline execution creates a trace with a unique ID.
  2. Every significant operation within the pipeline creates a child span.
  3. Spans capture operation-specific attributes: token counts for embedding calls, result counts for retrieval, model name and cost for inference.
  4. Spans are nested to show parent-child relationships.
  5. Completed traces are exported to a trace backend for visualization, querying, and alerting.
  6. Aggregated span metrics enable identifying systemic bottlenecks across many requests.
  • You are running multi-step AI pipelines (RAG, agents, chains) and need to understand per-step performance.
  • Latency or cost debugging requires more granularity than request-level metrics provide.
  • You need to answer questions like “why was this response slow?” or “which step consumed the most tokens?”
  • You are optimizing pipeline performance and need data on where time and tokens are actually spent.
  • Your AI integration is a single, direct LLM API call with no pipeline. A single API call does not benefit from span-level decomposition. Request-level metrics are sufficient.
  • You are in early prototyping where the pipeline changes daily. Instrumentation code becomes maintenance overhead that slows iteration. Add tracing when the pipeline stabilizes.
  • Your trace backend cannot handle the volume. Each request generates multiple spans. At high throughput, consider sampling rather than skipping tracing entirely.
  1. Instrumentation overhead — Adding span creation and attribute recording to every pipeline step adds code and a small latency overhead (typically sub-millisecond per span).
  2. Storage costs — Traces are verbose. A single RAG pipeline execution can generate 5-15 spans. At 10,000 requests/day, that is 50,000-150,000 spans/day. Sampling strategies are essential at scale.
  3. Schema maintenance — As the pipeline evolves, span names and attributes must be kept consistent. Stale or renamed spans create gaps in dashboards and alerts.
  4. Privacy considerations — Traces may capture prompt content and model outputs. Ensure sensitive data is redacted before export to the trace backend.
import time
from contextlib import contextmanager
from dataclasses import dataclass, field
from uuid import uuid4
@dataclass
class Span:
name: str
trace_id: str
span_id: str = field(default_factory=lambda: uuid4().hex[:16])
parent_id: str | None = None
start_time: float = 0.0
end_time: float = 0.0
attributes: dict = field(default_factory=dict)
children: list["Span"] = field(default_factory=list)
@property
def duration_ms(self) -> float:
return (self.end_time - self.start_time) * 1000
class Tracer:
def __init__(self):
self._traces: dict[str, Span] = {}
@contextmanager
def start_trace(self, name: str):
trace_id = uuid4().hex[:32]
root = Span(name=name, trace_id=trace_id)
root.start_time = time.monotonic()
self._traces[trace_id] = root
ctx = _TraceContext(root_span=root, current_span=root)
try:
yield ctx
finally:
root.end_time = time.monotonic()
def get_trace(self, trace_id: str) -> Span | None:
return self._traces.get(trace_id)
class _TraceContext:
def __init__(self, root_span: Span, current_span: Span):
self.root_span = root_span
self._current = current_span
@contextmanager
def span(self, name: str, **attributes):
child = Span(
name=name,
trace_id=self.root_span.trace_id,
parent_id=self._current.span_id,
attributes=attributes,
)
child.start_time = time.monotonic()
self._current.children.append(child)
previous = self._current
self._current = child
try:
yield child
finally:
child.end_time = time.monotonic()
self._current = previous
def set_attribute(self, key: str, value) -> None:
self._current.attributes[key] = value
def format_trace(span: Span, indent: int = 0) -> str:
prefix = " " * indent
line = f"{prefix}{span.name}: {span.duration_ms:.1f}ms"
if span.attributes:
attrs = ", ".join(f"{k}={v}" for k, v in span.attributes.items())
line += f" ({attrs})"
lines = [line]
for child in span.children:
lines.append(format_trace(child, indent + 1))
return "\n".join(lines)
async def traced_rag_pipeline(tracer: Tracer, query: str) -> dict:
with tracer.start_trace("rag_pipeline") as ctx:
ctx.set_attribute("query", query)
with ctx.span("retrieval") as retrieval_span:
with ctx.span("embedding", model="text-embedding-3-small"):
query_embedding = await embed(query)
ctx.set_attribute("tokens", len(query.split()))
with ctx.span("vector_search"):
results = await search(query_embedding, top_k=20)
ctx.set_attribute("candidates", len(results))
with ctx.span("reranking"):
reranked = await rerank(query, results, top_k=5)
ctx.set_attribute("selected", len(reranked))
with ctx.span("prompt_construction"):
prompt = build_prompt(query, reranked)
ctx.set_attribute("token_count", count_tokens(prompt))
with ctx.span("inference", model="claude-3-5-sonnet") as llm_span:
response = await generate(prompt)
ctx.set_attribute("input_tokens", response["usage"]["input_tokens"])
ctx.set_attribute("output_tokens", response["usage"]["output_tokens"])
with ctx.span("output_validation"):
validated = validate_output(response)
ctx.set_attribute("passed", validated)
return response

For production use, integrate with OpenTelemetry SDKs instead of a custom implementation. The example demonstrates the concept; real systems should use established tracing infrastructure.

ToolTypeNotes
LangfuseOpen-source LLM observabilityPurpose-built for LLM tracing with cost tracking and evaluation
Arize PhoenixOpen-sourceLLM traces with embedding drift detection
LangsmithManaged (LangChain)Deep integration with LangChain, traces chain executions
HeliconeManaged proxyRequest-level tracing at the gateway, less granular than span-level
OpenTelemetry + JaegerOpen-source infrastructureGeneric tracing adapted for AI with custom span attributes
Datadog LLM ObservabilityManaged APMExtends existing Datadog with LLM-specific spans