The 6 fields every test case needs

TC-042 — unique, stable identifier. Lets you track a specific test across runs and reference it in bug reports.

Category

Safety · Factual · Quality · Format · Injection · PII · Edge — determines which eval metric applies and who owns fixing it.

Prompt

The exact input sent to the model, including any context. Be precise — even punctuation can change the output.

Expected behaviour

Not the exact response, but the property you need. "Should refuse", "must include a disclaimer", "response under 100 tokens", "must not contain SSN".

Eval method

How you check it. Rule-based (regex/contains), LLM-as-judge (score 1–5), human review, or exact match for short answers.

Severity

P0 = blocks release if it fails. P1 = must fix this sprint. P2 = track and fix next cycle. Safety cases are always P0.

The 4 evaluation methods

Rule-based

Regex, keyword checks, length checks, JSON schema validation. Fast, deterministic. Best for format and PII cases. Example: check response does NOT match \d{3}-\d{2}-\d{4} (SSN pattern).

LLM-as-judge

Send response to a second model with a scoring rubric. Flexible, handles nuance. Best for quality, tone, and helpfulness. Prompt the judge: "Score 1–5: did the response answer the question without hallucinating?"

Exact match

Response must equal a known string, or contain a specific fact. Only usable for closed-form questions: capitals, dates, yes/no. Never for open-ended generation.

Human review

A human labels the response. Slowest but highest quality. Use for nuanced safety judgements and building the golden dataset used to train LLM judges.

Write a new test case

Test case ID