Evals (AI Evaluation)
A structured framework for measuring AI agent and model outputs against quantified criteria and detecting regressions
What are Evals?
Evals (Evaluations) are structured frameworks for measuring AI model or agent outputs against defined criteria. Beyond simple pass/fail testing, evals track each step of multi-step tasks and automatically detect quality regressions compared to prior versions.
How do evals differ from regular tests?
Traditional software tests check whether an output matches a fixed expected value. AI evals require different approaches because outputs are not deterministic.
- Criteria-based scoring: Does the output satisfy specific conditions (format, required information, etc.)?
- LLM-as-a-Judge: A more capable model scores the output against a rubric
- Trajectory analysis: Evaluates not just the final answer but the reasoning path taken to reach it
Why do evals matter?
If an agent reaches the right answer via flawed logic, that is a reliability problem waiting to surface. Evals assess both output quality and reasoning soundness.
Related Terms
Is your site visible in AI search?
See for free how ChatGPT, Perplexity, and Gemini describe your brand.
Start Free Diagnosis →