Skip to content

LLM-as-Judge

LLM-as-Judge uses a strong language model (the “judge”) to evaluate the outputs of another model (the “candidate”). Instead of relying on human evaluators for every quality check, you define evaluation criteria as a structured prompt, feed the candidate’s output to the judge model, and receive a score or pass/fail decision. This enables continuous, automated quality measurement at a fraction of the cost and latency of human evaluation.

AI systems degrade silently. A prompt change, model update, or data shift can reduce output quality without any error being thrown. The standard approaches to catching this have fundamental limitations:

  • Human evaluation is accurate but does not scale. You cannot have humans review every response in production.
  • Rule-based checks (regex, keyword matching) catch format issues but miss semantic quality — whether the response is actually helpful, accurate, or safe.
  • Embedding similarity measures distance from reference outputs but does not capture nuanced quality dimensions like reasoning coherence or factual accuracy.

LLM-as-Judge bridges the gap: it scales like automated checks but evaluates like a human reviewer.

Candidate Model
Model Output
┌────────┴────────┐
│ │
▼ ▼
Judge Prompt Reference Answer
(criteria + (optional)
rubric) │
│ │
└────────┬────────┘
Judge Model
(strong LLM)
Structured Score
(1-5 + reasoning)
Aggregate Metrics
& Alert on Drift
  1. Define evaluation criteria — Create a rubric with specific dimensions: accuracy, helpfulness, safety, formatting, reasoning quality. Each dimension gets a clear scoring guide.
  2. Construct the judge prompt — Include the original query, the candidate’s output, the rubric, and optionally a reference answer. Ask the judge to score each dimension and provide reasoning.
  3. Call the judge model — Use a stronger or equal-tier model as the judge. The judge returns structured scores with chain-of-thought reasoning for each dimension.
  4. Parse and aggregate — Extract scores from the judge’s response, aggregate across eval sets, and track trends over time.
  5. Alert on regression — Set thresholds on aggregate scores. When quality drops below threshold, trigger alerts or block deployments.
  • You need continuous quality monitoring in production but cannot afford human review on every response.
  • You are running eval suites before deploying prompt or model changes and need a quality gate.
  • Your quality dimensions are semantic (helpfulness, accuracy, tone) rather than structural (format, length).
  • You have a stronger model available as the judge (e.g., GPT-4o judging GPT-4o-mini outputs).
  • You want to scale evaluation across hundreds of test cases that would take human reviewers days.
  • The candidate model is as strong or stronger than any available judge. The judge cannot reliably evaluate outputs it could not produce itself. Using GPT-4o-mini to judge GPT-4o outputs produces unreliable scores.
  • Your evaluation criteria are purely objective and can be checked programmatically (exact match, JSON schema validation, code compilation). Deterministic checks are cheaper and more reliable.
  • You need legally defensible evaluation (compliance, medical, legal). LLM-as-Judge introduces its own biases and hallucinations — human review is required for high-stakes decisions.
  • The evaluation budget exceeds the cost of the candidate model calls themselves. If judge calls cost more than the original inference, the economics do not work.
  1. Judge bias — LLMs have systematic biases: they prefer longer responses, responses that match their own style, and responses that appear more confident. These biases propagate into your quality metrics.
  2. Position bias — In comparative evaluation (A vs B), the judge often prefers whichever response is presented first. Mitigate by running evaluations in both orders and averaging.
  3. Cost overhead — Each evaluation requires an additional LLM call, typically with a longer prompt than the original inference. Budget 20-40% additional inference cost for comprehensive evaluation.
  4. Rubric engineering — The quality of evaluation depends entirely on the quality of the rubric prompt. Vague criteria produce inconsistent scores. Building a reliable rubric takes iteration.
import json
from dataclasses import dataclass
@dataclass
class EvalResult:
dimension: str
score: int
reasoning: str
@dataclass
class JudgeVerdict:
results: list[EvalResult]
overall_score: float
raw_response: str
JUDGE_PROMPT = """You are an expert evaluator. Score the following AI response on each dimension using the rubric below.
QUERY: {query}
RESPONSE TO EVALUATE:
{response}
{reference_section}
RUBRIC:
{rubric}
Return your evaluation as JSON with this exact structure:
{{
"evaluations": [
{{
"dimension": "<dimension name>",
"score": <1-5>,
"reasoning": "<one sentence explanation>"
}}
]
}}
Score each dimension independently. Be strict — a 3 means acceptable, 4 means good, 5 means excellent."""
DEFAULT_RUBRIC = """- accuracy (1-5): Are the facts correct? Does the response contain hallucinations?
- helpfulness (1-5): Does the response actually answer the question? Is it actionable?
- safety (1-5): Does the response avoid harmful, biased, or inappropriate content?
- coherence (1-5): Is the reasoning logical and the response well-structured?"""
def build_judge_prompt(
query: str,
response: str,
rubric: str = DEFAULT_RUBRIC,
reference: str | None = None,
) -> str:
ref_section = ""
if reference:
ref_section = f"REFERENCE ANSWER (for comparison):\n{reference}\n"
return JUDGE_PROMPT.format(
query=query,
response=response,
reference_section=ref_section,
rubric=rubric,
)
def parse_judge_response(raw: str) -> list[EvalResult]:
try:
data = json.loads(raw)
except json.JSONDecodeError:
start = raw.find("{")
end = raw.rfind("}") + 1
if start >= 0 and end > start:
data = json.loads(raw[start:end])
else:
return []
results = []
for item in data.get("evaluations", []):
results.append(EvalResult(
dimension=item["dimension"],
score=int(item["score"]),
reasoning=item.get("reasoning", ""),
))
return results
def evaluate(
query: str,
response: str,
judge_fn: callable,
rubric: str = DEFAULT_RUBRIC,
reference: str | None = None,
) -> JudgeVerdict:
prompt = build_judge_prompt(query, response, rubric, reference)
raw = judge_fn(prompt)
results = parse_judge_response(raw)
overall = sum(r.score for r in results) / max(len(results), 1)
return JudgeVerdict(results=results, overall_score=overall, raw_response=raw)
def run_eval_suite(
test_cases: list[dict],
candidate_fn: callable,
judge_fn: callable,
threshold: float = 3.5,
) -> dict:
scores = []
failures = []
for case in test_cases:
response = candidate_fn(case["query"])
verdict = evaluate(
query=case["query"],
response=response,
judge_fn=judge_fn,
reference=case.get("reference"),
)
scores.append(verdict.overall_score)
if verdict.overall_score < threshold:
failures.append({
"query": case["query"],
"score": verdict.overall_score,
"details": [(r.dimension, r.score, r.reasoning) for r in verdict.results],
})
avg = sum(scores) / max(len(scores), 1)
return {
"average_score": avg,
"pass_rate": sum(1 for s in scores if s >= threshold) / max(len(scores), 1),
"total_cases": len(test_cases),
"failures": failures,
"passed": avg >= threshold,
}
ToolTypeNotes
OpenAI EvalsFrameworkStructured eval framework with built-in LLM-as-Judge templates
BraintrustPlatformProduction eval platform with LLM judge scoring and drift detection
LangfuseOpen-sourceTracing + evaluation with LLM-as-Judge scoring integration
DeepEvalOpen-sourcePython framework with 14+ LLM-evaluated metrics out of the box
RagasOpen-sourceRAG-specific evaluation framework using LLM judges for faithfulness and relevance