LLM-as-a-Judge
Definition
An evaluation methodology where a capable LLM scores another model's or agent's outputs against a predefined rubric
#LLM-as-a-Judge#LLM evaluator#AI evaluation#evals#rubric#automated evaluation
What is LLM-as-a-Judge?
LLM-as-a-Judge is an evaluation methodology where a powerful language model — such as GPT-4o or Claude — scores the outputs of another model or agent against a predefined rubric. Instead of human reviewers, the model serves as the evaluator.
Why is it used?
AI agent outputs often have no fixed correct answer, making simple keyword-matching approaches insufficient. LLM-as-a-Judge evaluates open-ended outputs against flexible criteria, making it well-suited for large-scale automated evaluation pipelines.
Limitations to watch for
- Bias: The evaluating model may score outputs that resemble its own style more favorably.
- Rubric quality: Vague or poorly defined criteria produce unreliable scores.
- Cost: Every evaluation requires an LLM call, which compounds at scale.
Related Terms
Is your site visible in AI search?
See for free how ChatGPT, Perplexity, and Gemini describe your brand.
Start Free Diagnosis →Related terms
operations
Evals (AI Evaluation)
A structured framework for measuring AI agent and model outputs against quantified criteria and detecting regressions
operations
Minimum Viable Agent (MVA)
A smallest-possible agent design that validates one core task first with single-input, single-output execution
operations
Verification Loop
An operational pattern that converges quality by repeatedly testing, reviewing, and retrying AI-generated outputs