Figure 1: Week 3 architecture — Langfuse tracing, LiteLLM proxy, and promptfoo eval
From Building to Observing: Why Tracing Your AI Agent Changes Everything
Week 3 of the AI Triage Agent journey — adding Langfuse observability, a model proxy, and an eval suite
In the first post, I walked through replacing a fragile keyword classifier with LLM reasoning. The agent could classify intents, but I had no idea what it was actually doing inside. How long did each call take? How many tokens did it burn? Was the model getting better or worse over time?
Week 3 is about answering those questions — and building the observability foundation that every production AI system needs.
The Architecture: Agent State vs Conversation Memory
Before adding observability, I stepped back to clarify the architecture. A chatbot AI involves two distinct memory systems that must never be conflated:
Agent State (per-prompt, ephemeral)
Each prompt creates a private execution context. The agent stores its execution plan, tool inputs, tool outputs, and LLM I/O in this state. Once a response is generated, the state is discarded. It is never shared between agents — even in a multi-agent system, each agent gets its own private state.
Conversation Memory (per-session, persistent)
The chatbot orchestrator maintains a separate memory — the user's prompt history and the agent's final responses. This persists across turns and survives session restarts. It is provided as input to agents when needed, never as shared memory.
Key insight: Agent state is private and ephemeral. Conversation memory is shared and persistent. Confusing the two leads to memory leaks, context pollution, and hard-to-debug behavior.
Langfuse: Making the Invisible Visible
I integrated Langfuse — an open-source observability platform — to trace every layer of this architecture:
@observe()
def classify(message: str) -> ClassifierOutput:
llm_response = litellm.completion(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": message},
],
temperature=0,
)
...
@observe()
@app.post("/triage", response_model=TriageResponse)
def triage(request: TriageRequest):
if request.session_id:
with propagate_attributes(session_id=request.session_id):
result = classify(request.message)
else:
result = classify(request.message)
return TriageResponse(...)
Agent state is tracked via @observe() spans on
the classify() function. Each LLM call creates a span capturing
input, output, and execution metrics. The span lives for one request and is
then discarded — exactly matching the ephemeral agent state pattern.
Conversation memory is tracked via Langfuse sessions. When a
session_id is passed in the request body, all traces sharing that
ID are grouped into one Langfuse session — representing a full conversation
across multiple turns.
Three requests sharing the same session_id now appear as one
conversation in the Langfuse dashboard, with per-request latency, token usage,
and cost visible at a glance.
The LiteLLM Proxy: Cost Tracking and Model Routing
The next layer was a LiteLLM proxy — a lightweight gateway that sits between the agent and the LLM provider:
model_list:
- model_name: cheap-classifier
litellm_params:
model: deepseek/deepseek-chat
rpm: 60
cost_per_token: 0.00000014 # ~$0.14/1M tokens
- model_name: expensive-fallback
litellm_params:
model: claude-haiku-4-5-20251001
cost_per_token: 0.000003 # ~$3/1M tokens
The proxy gives me:
- Cost per request — every token logged with a dollar amount
- Model routing — DeepSeek as primary (cheap), Claude Haiku as fallback (expensive)
-
Single endpoint — the agent talks to
localhost:4000regardless of provider - Langfuse integration — proxy logs flow directly into the same traces
The agent code changed by only a few lines to support this:
model = os.getenv("LLM_MODEL", "cheap-classifier")
api_base = os.getenv("LITELLM_PROXY_URL", None)
api_key = os.getenv("LITELLM_MASTER_KEY", None)
litellm.completion(
model=model,
api_base=api_base,
api_key=api_key,
...
)
If the proxy isn't running, the agent falls back to calling the LLM directly — zero downtime.
The Eval Suite: 83% Baseline and Growing
Observability tells you what happened. Eval tells you how well it worked.
I added promptfoo — an eval framework that runs all 12 test cases against the classifier and generates an HTML report:
promptfoo eval -c tests/promptfooconfig.yaml
Baseline results: 10/12 passing (83%).
Two failures were genuine LLM edge cases — messages where the model interpreted intent differently than the labeled expectation. This is the value of evals: not just passing or failing, but understanding where your model struggles.
The 83% baseline will improve as I expand the eval suite and tune the system prompt. More importantly, every future change — new model, new prompt, new routing logic — can be measured against this baseline. Regressions are caught immediately.
Engineering Decisions
Why Langfuse v4 OTel instead of the litellm callback?
The litellm callback for Langfuse had a version conflict with the Langfuse SDK
v4. Rather than pinning an older SDK, I switched to Langfuse's native
@observe() decorator — which uses OpenTelemetry under the hood.
This is the forward-looking approach and aligns with industry standards.
Why a proxy for a single-agent system?
The proxy adds zero overhead when it's not needed (direct mode), and cost
tracking + model routing become available when it is. Going multi-agent in
Week 5 will make the proxy essential — every agent talks to one endpoint, and
the proxy handles rate limiting, cost allocation, and fallback centrally.
Why JavaScript assertions in promptfoo?
The HTTP provider parses JSON responses automatically. Using
javascript assertions (output.intent === 'billing')
is cleaner and more debuggable than JSON schema matching.
Why This Matters for Saudi Arabia's AI Future
Saudi Vision 2030 identifies AI as a cornerstone of economic diversification. The Kingdom needs engineers who can build production AI systems — not just prototypes, but systems that can be monitored, measured, and improved over time.
Observability is not a nice-to-have. It is the difference between a demo and a deployed system. This week's work — tracing, cost tracking, eval baselines — is what separates a portfolio project from a production architecture.
Every request is now traced. Every token is accounted for. Every regression is caught before it reaches users.
Week 3 is done. Week 4 starts now.
Built with FastAPI, LiteLLM, Langfuse, and promptfoo. Deployed across
DeepSeek and Claude Haiku.
github.com/Yashamsan/ai-triage-agent

0 Comments
Post a Comment