Agent Observability: Logging, Tracing & Debugging

TL;DR — Agent observability is not “log the prompt and response.” You need spans for every model call and tool call, the full message history at each step, token and cost per span, and a stable trace ID linking it all. Without this, a failed agent run is a black box. With it, you find the broken step in seconds. OpenTelemetry plus one tracing tool (Langfuse, LangSmith, Arize Phoenix) covers most of it.

The Black Box Problem

A user reports your agent gave a wrong answer. You open your logs and find… the final response. Nothing about which of the eight steps went sideways, what the model saw at step 4, or why it called the wrong tool. You’re debugging blind.

Agent observability exists to kill that black box. Traditional app observability assumes deterministic code: same input, same path, same output. Agents break that assumption. The same input can take a different number of steps, call different tools, and cost a different amount every run. You can’t debug what you can’t see, and an agent’s decision process is invisible unless you deliberately instrument it.

I’ve spent enough late nights staring at agent traces to have strong opinions about what to capture. Here’s the setup that actually lets you debug production agents, and the things people instrument that turn out to be noise.

What a Single Agent Run Actually Contains

Before you instrument, get the mental model right. One agent run is a tree, not a line:

flowchart TD
    T[Trace: one user request] --> S1[Span: model call #1]
    S1 --> S2[Span: tool call - search]
    S2 --> S3[Span: model call #2]
    S3 --> S4[Span: tool call - fetch]
    S4 --> S5[Span: model call #3 - final]

Each box is a span. The whole tree is a trace, tied together by a trace ID. This is the same model OpenTelemetry uses for microservices, and that’s not a coincidence: an agent is a distributed system where the “services” are model calls and tools. Once you see it this way, the tooling decisions get obvious.

The Three Layers You Must Capture

1. Structured Logs (the what)

Forget print(response). Every event should be a structured record you can query later.

import json, time, uuid

def log_event(trace_id, span_id, event_type, payload):
    record = {
        "ts": time.time(),
        "trace_id": trace_id,
        "span_id": span_id,
        "type": event_type,        # model_call | tool_call | error
        "payload": payload,
    }
    print(json.dumps(record))      # ship to your log pipeline

trace_id = str(uuid.uuid4())
log_event(trace_id, "span-1", "model_call", {
    "model": "claude-sonnet-4",
    "messages": messages,          # the FULL history the model saw
    "tools_offered": [t.name for t in tools],
    "tokens_in": 1820, "tokens_out": 240,
    "latency_ms": 1430,
})

The non-negotiable field is the full message history the model saw at that step. When an agent misbehaves, 90% of the time the cause is in the context: a stale memory got injected, a tool returned garbage that poisoned the next turn, or the system prompt got truncated. If you only log the user query and final answer, you’ll never find it.

2. Distributed Traces (the when and how long)

Logs tell you what happened; traces tell you the shape and timing. A trace shows you that step 3 took 8 seconds (a slow tool), or that the agent looped between steps 4 and 5 three times. Use OpenTelemetry spans so the data is portable across tools.

from opentelemetry import trace

tracer = trace.get_tracer("agent")

def run_agent(query):
    with tracer.start_as_current_span("agent_run") as root:
        root.set_attribute("user.query", query)
        for step in range(MAX_STEPS):
            with tracer.start_as_current_span(f"step_{step}") as span:
                resp = call_model(history)
                span.set_attribute("tokens.in", resp.usage.input)
                span.set_attribute("tokens.out", resp.usage.output)
                if not resp.tool_calls:
                    span.set_attribute("terminal", True)
                    return resp.content

3. Cost and Token Tracking (the how much)

Attach token counts to every span and your trace doubles as a cost ledger. This is where observability pays for itself directly: you find the agent that quietly takes 15 steps when it should take 3, or the tool that returns 40KB of JSON the model has to re-read on every subsequent turn. Both are invisible in aggregate dashboards and obvious in a single trace. You can’t optimize a token bill you can’t attribute, and the design patterns that control cost depend on seeing where it goes first.

Pick Your Tools

You don’t have to build this from scratch. The ecosystem in 2026 is solid.

Tool	Strength	Best for
Langfuse	Open-source, self-hostable, OTel-native	Teams wanting data ownership
LangSmith	Deep LangChain/LangGraph integration	LangChain-heavy stacks
Arize Phoenix	Strong eval + tracing combo	Teams doing offline evals too
OpenTelemetry (raw)	Vendor-neutral standard	Routing to existing infra (Grafana, Datadog)

My default: instrument with OpenTelemetry, export to one of the above. That way you’re not locked in. If you outgrow your tracing vendor, you swap the exporter, not your entire instrumentation.

What People Over-Instrument

Three things teams capture that mostly create noise:

Embedding vectors. Logging the raw 1536-dim vector for every retrieval is gigabytes of data you will never read. Log the retrieved text and scores, not the vectors.
Every token of streaming output. You want the final assembled output and the token count, not 200 individual delta events. Aggregate at the span boundary.
Verbose framework internals. LangChain and friends emit a flood of internal callbacks. Capturing all of them buries the three events that matter. Filter to model calls, tool calls, and errors.

The skill in observability isn’t capturing more, it’s capturing the right things at the right granularity. A trace you can read in 30 seconds beats one with everything that takes 30 minutes to parse.

Debugging Workflow That Actually Works

When a run fails, here’s the order that finds the bug fastest:

Open the trace, find the last good step. Where did the agent’s state stop making sense?
Read the full input to the next step. This is where it usually breaks: bad context, poisoned tool output, missing memory.
Check the tool result, not just the tool call. The agent called the right tool with the right args, but the tool returned an error string the model treated as data. Extremely common.
Look for loops. Same tool, same args, multiple times = the agent is stuck and not noticing. Your design patterns should cap this, but observability is how you catch when the cap is too high.
Compare token counts across steps. A sudden jump means context bloat, often from a tool dumping too much data.

A Note on Production Safety

Agent logs are a privacy surface. The full message history often contains user PII, internal data, and secrets that leaked into context. Before you ship structured logging, decide what gets redacted, set retention limits, and make sure your tracing vendor’s data residency matches your compliance needs. The OWASP LLM Top 10 flags sensitive-information disclosure as a primary risk: observability that quietly stores secrets in plain text is its own incident waiting to happen, especially when paired with code-execution agents that need secure sandboxing.

FAQ

Do I need a dedicated tool, or can I use my existing APM? You can route OpenTelemetry spans to Datadog or Grafana, and that’s fine for latency and errors. But agent-specific tools (Langfuse, LangSmith, Phoenix) render the message history and tool I/O in a way generic APMs don’t, which is exactly the data you need for debugging quality issues.

What’s the single most important thing to log? The full message array the model saw at each step. If you log only one thing, log that. Almost every agent bug is a context bug.

How much does observability overhead cost? Tracing adds negligible latency (microseconds per span) if you export asynchronously. The real cost is storage, which is why you filter to model calls, tool calls, and errors rather than logging everything.

How do I trace a multi-agent system? Propagate the same trace ID across agents and let each agent’s work be a child span. The trace tree then shows the full collaboration, including which sub-agent caused a failure.

Can I use this to evaluate quality, not just debug? Yes. Tools like Phoenix and Langfuse let you attach eval scores to traces, so you can run a judge model over past traces and track quality regressions over time, not just catch one-off failures.