From Keywords to Language Models: Building an AI Triage Agent from the Ground Up

An Engineer's Journey in Intent Classification Architecture — Week 1 & 2

The Problem

Every contact center faces the same challenge: thousands of incoming customer messages, each needing routing to the right team — billing, technical support, password resets, or escalation. Convert that to 10,000 daily inquiries across five intent categories, and the architecture decisions start to matter.

I faced this problem as my capstone project. Not as a hypothetical — as production code I'd have to defend in a code review. Here's the journey I took, the mistakes I made, the upgrade I shipped, and what I learned about modern AI system design along the way.

Week 1: The Keyword Classifier — Fast, Deterministic, Fragile

The naive approach: a rule-based classifier using keyword matching.

INTENTS = {
    "password_reset": {
        "keywords": ["forgot", "password", "reset", "locked out"],
        "response": "To reset your password, visit the login page..."
    },
    "billing": {
        "keywords": ["invoice", "charge", "bill", "refund"],
        "response": "For billing questions, please log in..."
    },
}

def classify(message: str) -> str:
    lower = message.lower()
    for intent, data in INTENTS.items():
        if any(kw in lower for kw in data["keywords"]):
            return intent
    return "unknown"

It's elegant in its simplicity. No dependencies beyond FastAPI and Pydantic. Start the server, hit the endpoint, get your intent back. Perfect for a GET /health — and a starting point to get something working before optimizing.

But look closer at the problems:

Semantic blindness — "I can't access my account, I think I've been hacked" contains none of the keywords. It's a password_reset/technical_support edge case, but the classifier sees unknown.
No confidence score — You get an intent or you don't. Zero signal for uncertainty.
No escalation detection — "This is completely unacceptable" matches escalation's keyword only if you remembered to add it.
Maintenance burden — Every edge case means another keyword. Another deploy. Another git commit -m "add more keywords".

This is the state of most rule-based systems I've seen in production. They work — until they don't. And they break silently, because there's no confidence threshold telling you the system is guessing.

The first commit (6a4aa7e) shipped this. Four intents. Keyword matching. Zero confidence. It worked for the happy path.

The Pivot: Why I Chose LLM Reasoning Over Better Keywords

The obvious next step would be to write a better keyword dictionary with synonyms, stemming, and regex patterns. That's the engineering instinct: optimize what you have.

But I made a different call. Instead of optimizing the rules, I replaced the entire classification engine with an LLM-based reasoning system. Here's why:

Pattern matching is brittle — every deployment environment has edge cases. An LLM handles unseen phrasing without new keywords.
Confidence is a signal, not decoration — if the model says 0.45 on an intent, that's a routing decision. The keyword classifier had nothing equivalent.
Escalation is a secondary dimension — it shouldn't be a separate intent. It's a flag on top of the primary classification. A good architecture separates them.
Model swapping is a config change — LiteLLM abstracts 100+ providers. Swap from DeepSeek to Claude Haiku with one env var. Try doing that with a keyword dict.

The trade-off: latency goes from sub-millisecond to 200–800ms depending on provider. The production insight is that 800ms of model time saves 5 minutes of human routing. Worth every millisecond.

Week 2: The LLM Classifier — Architecture Decisions

The upgrade shipped in commit b0eab80. Let me walk through the architecture.

The System Prompt

SYSTEM_PROMPT = """You are a customer support triage agent.
Classify the customer message into exactly one intent.

Intents:
- password_reset: login issues, forgotten password, account locked
- billing: payments, charges, invoices, refunds, subscriptions
- technical_support: bugs, errors, crashes, features not working
- escalation: wants manager or supervisor, filing a formal complaint
- unknown: does not fit any category above

Return ONLY valid JSON with these exact fields:
{
  "intent": "<one of the five intents above>",
  "confidence": <float between 0.0 and 1.0>,
  "needs_escalation": <true if urgent or emotionally charged>
}"""

Key architectural decisions in this prompt:

Structured output via JSON — No regex parsing. No string matching. The model emits structured data, and Pydantic validates it at runtime. If the model hallucinates an intent name, the VALID_INTENTS gate catches it and falls back to unknown.
Confidence as a first-class field — Not a hack. It's part of the response schema. In production, you'd set a threshold: confidence < 0.7 → route to human review. The architecture supports this from day one.
Escalation as a boolean flag — Separate from the intent. A billing question can be escalated if the customer is angry. A password reset isn't escalated. These are independent axes.

The LiteLLM Layer

def classify(message: str) -> ClassifierOutput:
    model = os.getenv("LLM_MODEL", "deepseek/deepseek-chat")
    llm_response = litellm.completion(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": message},
        ],
        temperature=0,
    )

The switch to LiteLLM (litellm==1.40.28) was deliberate. It decouples the application from any single provider. One env var changes the model. In a GCC context where Arabic NLP models from different vendors need evaluation, this abstraction is critical — you can A/B test Mistral Arabic, DeepSeek, and GPT on the same codebase.

One gotcha: LiteLLM 1.40 dropped response_format support for some providers, which surfaced during integration. The fix (6ca95d3) was to let the system prompt handle JSON formatting and remove the API-level parameter. This is the kind of integration detail that only shows up when you actually ship — and the fix is upstream-compatible.

The Eval Harness

@pytest.mark.parametrize("case", TEST_DATA, ids=lambda c: c["message"][:50])
def test_classify(case):
    result = classify(case["message"])
    correct = result.intent == case["expected_intent"]
    ...

The test harness runs 12 parametrized cases and prints an accuracy table per intent. This is the foundation of observability-first engineering — before optimizing, you measure:

── Accuracy by Intent ─────────────────────────
  billing              ████  4/4
  escalation           ████  4/4
  password_reset       ████  3/3
  technical_support    ████  3/3
  TOTAL                12/12  (100%)

100% on 12 curated test cases. That's the baseline. The next step is expanding to edge cases, Arabic code-switching inputs, and adversarial examples. But the harness is there. The measurement culture is established.

Engineering Decisions That Matter

A few choices that shaped the architecture beyond the obvious:

temperature=0 — Not optional. For classification, you want deterministic output. Any creativity from the model is noise. This is a production maxim: use temperature for generation, shut it off for classification.

Pydantic double-validation — The response model TriageResponse validates the classifier output before it reaches the API layer. If the model returns malformed JSON, the error is caught in the classifier, not at the endpoint. Defensive layering matters.

Validation on the request side — raise HTTPException(status_code=422, detail="message cannot be empty") — a 3-line guard that prevents downstream costs. An empty string costs the same API call as a valid message. Gate early.

Model-agnostic .env — The .env.example documents both DeepSeek and Claude Haiku paths. A junior engineer cloning the repo has the mental model: "I can swap models without touching code." That's the design principle in action.

What This Means for Junior Engineers

If you're a software tester or QA engineer looking to move into AI engineering, here's the actionable path this project demonstrates:

Ship something simple first — The keyword classifier was 60 lines. It ran. It taught me the API shape, the deployment flow, the testing pattern. Don't start with LLMs — start with something that works and level up.
Measurement before optimization — The eval harness came after I proved the classifier worked. Not before. Measure the baseline, then improve. Perfectionism is the enemy of shipped code.
Architecture decisions compound — LiteLLM, Pydantic, structured JSON output — each choice makes the next upgrade cheaper. The Langfuse observability integration coming in Week 3? It's a decorator away because the architecture already separates concerns cleanly.
Production thinking from day one — Input validation, confidence thresholds, model provider abstraction — these aren't "nice to haves" for a blog project. They're the difference between a demo and a deployable system.

The Road Ahead: Week 3 and Beyond

The foundation is laid. The architecture is clean, tested, and model-agnostic. Week 3 targets observability — Langfuse tracing on every /triage request, latency breakdowns, cost tracking per model provider, and an eval gate that runs before deployment.

The longer arc — building an Arabic NLP evaluation benchmark, integrating real-time voice pipelines, and deploying MCP servers for tool-based contact center workflows — is mapped in the roadmap. Each week adds a production-grade capability to a system that already routes real traffic.

Why This Matters for Saudi Arabia's AI Future

Every component in this architecture — eval harness, structured output, observability-ready design, provider abstraction — maps directly to the skills the Kingdom needs for its contact center digital transformation. Saudi Vision 2030's digital government pillar requires systems that are measurable, auditable, and improvable. Not black-box classifiers, but transparent ones with confidence scores, test coverage, and traceability.

This project is my contribution to that ecosystem. Built in Riyadh, open-source, documented in English and Arabic-accessible terms. A reference architecture that a junior engineer at STC, Aramco Digital, or Mozn can clone, study, and deploy.

The AI Engineer Architect isn't defined by a title — it's defined by the decisions made when no one is watching. Layout choice at temperature=0. Defensive validation on empty inputs. A test harness that tells you where your model fails before your customers do.

That's the standard I'm building to. Every shipped component is another brick in the Kingdom's AI infrastructure. The work continues.