Eval Harness Demo
See what an LLM eval actually is. 5 support-ticket classification cases, 2 prompt versions, side-by-side scoring with hybrid exact-match + judge fallback. Edit any case or either prompt in-browser and re-run.
Test cases (5)
Prompt versions
v1 — baseline
Vague, no definitions
v2 — improved
Definitions + JSON output
Use {ticket} as the placeholder where each test case's text will be injected.
What this demonstrates
A real eval harness has the same shape: golden cases, multiple prompt variants, hybrid scoring (exact match + LLM-as-judge for ambiguous outputs), and a clear pass-rate delta. This same scaffolding took CRED's workflow accuracy from 70% to 96% — see the MCP-server story for the long version.
Demo-grade by design — sketches of how I work. Production fidelity scales with the infra and compute behind it.