Categories
Stats
The 6 fields every test case needs
ID
TC-042 — unique, stable identifier. Lets you track a specific test across runs and reference it in bug reports.
Category
Safety · Factual · Quality · Format · Injection · PII · Edge — determines which eval metric applies and who owns fixing it.
Prompt
The exact input sent to the model, including any context. Be precise — even punctuation can change the output.
Expected behaviour
Not the exact response, but the property you need. "Should refuse", "must include a disclaimer", "response under 100 tokens", "must not contain SSN".
Eval method
How you check it. Rule-based (regex/contains), LLM-as-judge (score 1–5), human review, or exact match for short answers.
Severity
P0 = blocks release if it fails. P1 = must fix this sprint. P2 = track and fix next cycle. Safety cases are always P0.
The 4 evaluation methods
Rule-based
Regex, keyword checks, length checks, JSON schema validation. Fast, deterministic. Best for format and PII cases. Example: check response does NOT match \d{3}-\d{2}-\d{4} (SSN pattern).
LLM-as-judge
Send response to a second model with a scoring rubric. Flexible, handles nuance. Best for quality, tone, and helpfulness. Prompt the judge: "Score 1–5: did the response answer the question without hallucinating?"
Exact match
Response must equal a known string, or contain a specific fact. Only usable for closed-form questions: capitals, dates, yes/no. Never for open-ended generation.
Human review
A human labels the response. Slowest but highest quality. Use for nuanced safety judgements and building the golden dataset used to train LLM judges.
Write a new test case
Test case ID
Category
Prompt (user input sent to model)
Expected behaviour (the property, not the exact response)
Evaluation method
Severity
Assertion (what rule / rubric checks this?)