Eval Harness Demo

See what an LLM eval actually is. 5 support-ticket classification cases, 2 prompt versions, side-by-side scoring with hybrid exact-match + judge fallback. Edit any case or either prompt in-browser and re-run.

Test cases (5)

Prompt versions

v1 — baseline

Vague, no definitions

v2 — improved

Definitions + JSON output

Use {ticket} as the placeholder where each test case's text will be injected.

What this demonstrates

A real eval harness has the same shape: golden cases, multiple prompt variants, hybrid scoring (exact match + LLM-as-judge for ambiguous outputs), and a clear pass-rate delta. This same scaffolding took CRED's workflow accuracy from 70% to 96% — see the MCP-server story for the long version.

Demo-grade by design — sketches of how I work. Production fidelity scales with the infra and compute behind it.