Figure 1: Week 3 architecture — Langfuse tracing, LiteLLM proxy, and promptfoo eval

From Building to Observing: Why Tracing Your AI Agent Changes Everything

Week 3 of the AI Triage Agent journey — adding Langfuse observability, a model proxy, and an eval suite


In the first post, I walked through replacing a fragile keyword classifier with LLM reasoning. The agent could classify intents, but I had no idea what it was actually doing inside. How long did each call take? How many tokens did it burn? Was the model getting better or worse over time?

Week 3 is about answering those questions — and building the observability foundation that every production AI system needs.

The Architecture: Agent State vs Conversation Memory

Before adding observability, I stepped back to clarify the architecture. A chatbot AI involves two distinct memory systems that must never be conflated:

Agent State (per-prompt, ephemeral)

Each prompt creates a private execution context. The agent stores its execution plan, tool inputs, tool outputs, and LLM I/O in this state. Once a response is generated, the state is discarded. It is never shared between agents — even in a multi-agent system, each agent gets its own private state.

Conversation Memory (per-session, persistent)

The chatbot orchestrator maintains a separate memory — the user's prompt history and the agent's final responses. This persists across turns and survives session restarts. It is provided as input to agents when needed, never as shared memory.

Key insight: Agent state is private and ephemeral. Conversation memory is shared and persistent. Confusing the two leads to memory leaks, context pollution, and hard-to-debug behavior.

Langfuse: Making the Invisible Visible

I integrated Langfuse — an open-source observability platform — to trace every layer of this architecture:

@observe()
def classify(message: str) -> ClassifierOutput:
    llm_response = litellm.completion(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": message},
        ],
        temperature=0,
    )
    ...

@observe()
@app.post("/triage", response_model=TriageResponse)
def triage(request: TriageRequest):
    if request.session_id:
        with propagate_attributes(session_id=request.session_id):
            result = classify(request.message)
    else:
        result = classify(request.message)
    return TriageResponse(...)

Agent state is tracked via @observe() spans on the classify() function. Each LLM call creates a span capturing input, output, and execution metrics. The span lives for one request and is then discarded — exactly matching the ephemeral agent state pattern.

Conversation memory is tracked via Langfuse sessions. When a session_id is passed in the request body, all traces sharing that ID are grouped into one Langfuse session — representing a full conversation across multiple turns.

Three requests sharing the same session_id now appear as one conversation in the Langfuse dashboard, with per-request latency, token usage, and cost visible at a glance.

The LiteLLM Proxy: Cost Tracking and Model Routing

The next layer was a LiteLLM proxy — a lightweight gateway that sits between the agent and the LLM provider:

model_list:
  - model_name: cheap-classifier
    litellm_params:
      model: deepseek/deepseek-chat
      rpm: 60
      cost_per_token: 0.00000014   # ~$0.14/1M tokens

  - model_name: expensive-fallback
    litellm_params:
      model: claude-haiku-4-5-20251001
      cost_per_token: 0.000003      # ~$3/1M tokens

The proxy gives me:

  • Cost per request — every token logged with a dollar amount
  • Model routing — DeepSeek as primary (cheap), Claude Haiku as fallback (expensive)
  • Single endpoint — the agent talks to localhost:4000 regardless of provider
  • Langfuse integration — proxy logs flow directly into the same traces

The agent code changed by only a few lines to support this:

model = os.getenv("LLM_MODEL", "cheap-classifier")
api_base = os.getenv("LITELLM_PROXY_URL", None)
api_key = os.getenv("LITELLM_MASTER_KEY", None)
litellm.completion(
    model=model,
    api_base=api_base,
    api_key=api_key,
    ...
)

If the proxy isn't running, the agent falls back to calling the LLM directly — zero downtime.

The Eval Suite: 83% Baseline and Growing

Observability tells you what happened. Eval tells you how well it worked.

I added promptfoo — an eval framework that runs all 12 test cases against the classifier and generates an HTML report:

promptfoo eval -c tests/promptfooconfig.yaml

Baseline results: 10/12 passing (83%).

Two failures were genuine LLM edge cases — messages where the model interpreted intent differently than the labeled expectation. This is the value of evals: not just passing or failing, but understanding where your model struggles.

The 83% baseline will improve as I expand the eval suite and tune the system prompt. More importantly, every future change — new model, new prompt, new routing logic — can be measured against this baseline. Regressions are caught immediately.

Engineering Decisions

Why Langfuse v4 OTel instead of the litellm callback?
The litellm callback for Langfuse had a version conflict with the Langfuse SDK v4. Rather than pinning an older SDK, I switched to Langfuse's native @observe() decorator — which uses OpenTelemetry under the hood. This is the forward-looking approach and aligns with industry standards.

Why a proxy for a single-agent system?
The proxy adds zero overhead when it's not needed (direct mode), and cost tracking + model routing become available when it is. Going multi-agent in Week 5 will make the proxy essential — every agent talks to one endpoint, and the proxy handles rate limiting, cost allocation, and fallback centrally.

Why JavaScript assertions in promptfoo?
The HTTP provider parses JSON responses automatically. Using javascript assertions (output.intent === 'billing') is cleaner and more debuggable than JSON schema matching.

Why This Matters for Saudi Arabia's AI Future

Saudi Vision 2030 identifies AI as a cornerstone of economic diversification. The Kingdom needs engineers who can build production AI systems — not just prototypes, but systems that can be monitored, measured, and improved over time.

Observability is not a nice-to-have. It is the difference between a demo and a deployed system. This week's work — tracing, cost tracking, eval baselines — is what separates a portfolio project from a production architecture.

Every request is now traced. Every token is accounted for. Every regression is caught before it reaches users.

Week 3 is done. Week 4 starts now.


Built with FastAPI, LiteLLM, Langfuse, and promptfoo. Deployed across DeepSeek and Claude Haiku.
github.com/Yashamsan/ai-triage-agent