AI End-to-End Testing in 2026: Evaluation Harnesses for LLM Applications
Traditional E2E tests break with non-deterministic AI. A practical guide to building evaluation harnesses that catch regressions before users do.
TL;DR
- Traditional assertion-based testing breaks with AI—outputs are non-deterministic and correct answers vary.
- Build evaluation harnesses with: tasks (test cases), graders (scoring logic), transcripts (full traces), and outcomes (verifiable results).
- Use multi-dimensional evaluation: task success, safety compliance, format adherence, and reasoning quality.
- Implement golden test suites for regression detection—baseline examples that must consistently pass.
- Extract checkable rules from prompts and system instructions for automated specification testing.
- Grade at multiple levels: per-turn, per-task, and aggregate across evaluation sets.
- Run evaluations in CI/CD, but understand they catch trends, not individual failures.
Why Traditional Testing Fails
Traditional E2E tests assume deterministic outputs:
// Traditional test: exact match expected
expect(await processOrder(input)).toEqual({
status: 'confirmed',
orderId: 'ORD-123',
});
This breaks with AI systems because:
- Same input may produce different (valid) outputs
- Correct answers can be phrased many ways
- Tool call sequences vary by run
- Reasoning paths differ even with same conclusions
The solution isn’t to force determinism—it’s to evaluate behaviors, not strings.
The Evaluation Framework
Core Components
Every AI evaluation needs these elements:
| Component | Purpose | Example |
|---|---|---|
| Task | Well-defined test scenario with success criteria | ”Book a flight from NYC to LA for next Tuesday” |
| Trial | One complete agent run on a task | Single execution with transcript |
| Grader | Scoring logic that evaluates outcomes | Check if flight was booked correctly |
| Transcript | Complete record of messages and tool calls | Full conversation + tool inputs/outputs |
| Outcome | Verifiable final state | Booking confirmation, database changes |
The Evaluation Harness
An evaluation harness orchestrates testing at scale:
interface EvaluationHarness {
// Load test tasks from definitions
loadTasks(taskSet: string): Task[];
// Run agent on a single task
runTrial(agent: Agent, task: Task): Trial;
// Grade a completed trial
gradeTrial(trial: Trial, task: Task): GradeResult;
// Run full evaluation suite
evaluate(agent: Agent, taskSet: string): EvaluationResult;
}
interface Task {
id: string;
description: string;
input: string | Message[];
expectedBehaviors: string[];
successCriteria: SuccessCriterion[];
environment?: Record<string, unknown>;
}
interface Trial {
taskId: string;
transcript: Message[];
toolCalls: ToolCall[];
finalOutput: unknown;
duration: number;
tokenUsage: { input: number; output: number };
}
interface GradeResult {
score: number; // 0-1
passed: boolean;
dimensions: Record<string, number>;
feedback: string;
evidence: string[];
}
Multi-Dimensional Grading
Grade AI outputs on multiple dimensions:
Dimension 1: Task Success
Did the agent accomplish the goal?
function gradeTaskSuccess(trial: Trial, task: Task): number {
// Check verifiable outcomes
const outcomes = task.successCriteria.map(criterion => {
switch (criterion.type) {
case 'state_change':
return verifyStateChange(trial.finalState, criterion.expected);
case 'output_contains':
return trial.finalOutput.includes(criterion.value) ? 1 : 0;
case 'tool_called':
return trial.toolCalls.some(c => c.name === criterion.toolName) ? 1 : 0;
default:
return 0;
}
});
return outcomes.reduce((a, b) => a + b, 0) / outcomes.length;
}
Dimension 2: Safety Compliance
Did the agent avoid harmful behaviors?
const safetyChecks = [
'no_pii_in_logs',
'no_unauthorized_actions',
'no_harmful_content',
'respected_permissions',
];
function gradeSafety(trial: Trial): number {
let passed = 0;
for (const check of safetyChecks) {
if (passesSafetyCheck(trial, check)) {
passed++;
}
}
return passed / safetyChecks.length;
}
Dimension 3: Format Adherence
Did outputs match expected structure?
function gradeFormat(trial: Trial, task: Task): number {
if (!task.expectedFormat) return 1;
try {
const parsed = task.expectedFormat.parse(trial.finalOutput);
return 1;
} catch (e) {
return 0;
}
}
Dimension 4: Reasoning Quality
Was the thinking sound?
async function gradeReasoning(trial: Trial): Promise<number> {
// Use LLM-as-judge for reasoning quality
const evaluation = await evaluatorLLM.evaluate({
prompt: `Evaluate the reasoning quality in this transcript:
${formatTranscript(trial.transcript)}
Rate on 1-5 scale:
1. Was the reasoning logical?
2. Were tool calls appropriate?
3. Was information used correctly?
4. Were edge cases handled?
Provide overall score 0-1.`,
});
return evaluation.score;
}
Golden Test Suites
Golden tests are baseline examples that must consistently pass:
Building Golden Tests
interface GoldenTest {
id: string;
category: 'core' | 'edge_case' | 'regression';
input: string;
minimumScore: number;
requiredBehaviors: string[];
forbiddenBehaviors: string[];
notes: string;
}
const goldenTests: GoldenTest[] = [
{
id: 'book-simple-flight',
category: 'core',
input: 'Book a flight from SFO to LAX for tomorrow morning',
minimumScore: 0.9,
requiredBehaviors: [
'searched for flights',
'presented options to user',
'confirmed selection',
'completed booking',
],
forbiddenBehaviors: [
'booked without confirmation',
'charged wrong card',
],
notes: 'Basic happy path for flight booking',
},
{
id: 'handle-unavailable-flight',
category: 'edge_case',
input: 'Book a flight from SFO to Mars for tomorrow',
minimumScore: 0.8,
requiredBehaviors: [
'recognized impossible request',
'explained limitation',
'offered alternatives or clarification',
],
forbiddenBehaviors: [
'claimed to book flight',
'hallucinated flight options',
],
notes: 'Graceful handling of impossible requests',
},
];
Running Golden Tests in CI
# .github/workflows/ai-evaluation.yml
name: AI Evaluation
on:
pull_request:
paths:
- 'src/agent/**'
- 'src/prompts/**'
- 'src/tools/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Golden Tests
run: |
pnpm run evaluate:golden
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: evaluation-results
path: evaluation-results.json
- name: Check Threshold
run: |
SCORE=$(jq '.aggregateScore' evaluation-results.json)
if (( $(echo "$SCORE < 0.85" | bc -l) )); then
echo "Evaluation score $SCORE below threshold 0.85"
exit 1
fi
Specification-Driven Testing
Extract testable rules from system prompts:
Rule Extraction
interface BehaviorRule {
id: string;
source: 'prompt' | 'policy' | 'inferred';
rule: string;
testable: boolean;
checker: (trial: Trial) => boolean;
}
// Example rules extracted from prompt:
// "Never book a flight without user confirmation"
const rules: BehaviorRule[] = [
{
id: 'require-booking-confirmation',
source: 'prompt',
rule: 'Must request user confirmation before completing booking',
testable: true,
checker: (trial) => {
const bookingCall = trial.toolCalls.find(c => c.name === 'complete_booking');
if (!bookingCall) return true; // No booking, rule not applicable
// Check if confirmation was received before booking
const confirmationIndex = trial.transcript.findIndex(
m => m.role === 'user' && m.content.match(/yes|confirm|book it/i)
);
const bookingIndex = trial.toolCalls.indexOf(bookingCall);
return confirmationIndex < bookingIndex && confirmationIndex >= 0;
},
},
{
id: 'no-pii-logging',
source: 'policy',
rule: 'Never log PII in tool call arguments',
testable: true,
checker: (trial) => {
const piiPattern = /\b\d{3}-\d{2}-\d{4}\b|\b\d{16}\b/; // SSN, credit card
return !trial.toolCalls.some(c =>
JSON.stringify(c.arguments).match(piiPattern)
);
},
},
];
Automated Rule Checking
function checkRules(trial: Trial, rules: BehaviorRule[]): RuleCheckResult {
const results: RuleResult[] = [];
for (const rule of rules.filter(r => r.testable)) {
try {
const passed = rule.checker(trial);
results.push({
ruleId: rule.id,
passed,
rule: rule.rule,
});
} catch (error) {
results.push({
ruleId: rule.id,
passed: false,
rule: rule.rule,
error: error.message,
});
}
}
return {
totalRules: results.length,
passed: results.filter(r => r.passed).length,
failed: results.filter(r => !r.passed),
score: results.filter(r => r.passed).length / results.length,
};
}
LLM-as-Judge Patterns
For subjective evaluations, use an LLM grader:
Rubric-Based Grading
async function llmGrade(
trial: Trial,
rubric: GradingRubric
): Promise<GradeResult> {
const prompt = `You are evaluating an AI agent's performance.
## Task Description
${trial.taskDescription}
## Agent Transcript
${formatTranscript(trial.transcript)}
## Grading Rubric
${rubric.criteria.map(c => `- ${c.name}: ${c.description}`).join('\n')}
## Instructions
For each criterion, provide:
1. Score (1-5)
2. Brief justification with specific evidence
Output as JSON:
{
"scores": {
"criterion_name": { "score": 1-5, "justification": "..." }
},
"overall_assessment": "...",
"overall_score": 1-5
}`;
const response = await evaluatorLLM.complete(prompt);
return parseGradeResponse(response);
}
Consensus Grading
For high-stakes evaluations, use multiple graders:
async function consensusGrade(
trial: Trial,
graders: LLM[],
threshold: number = 0.7
): Promise<ConsensusResult> {
const grades = await Promise.all(
graders.map(g => llmGrade(trial, rubric, g))
);
const scores = grades.map(g => g.overallScore);
const meanScore = scores.reduce((a, b) => a + b, 0) / scores.length;
const agreement = calculateAgreement(scores);
return {
consensus: agreement >= threshold,
meanScore,
agreement,
individualGrades: grades,
recommendation: meanScore >= 0.7 ? 'pass' : 'fail',
};
}
Monitoring and Alerting
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Golden test pass rate | % of golden tests passing | <95% |
| Mean evaluation score | Average score across tasks | <0.8 |
| Safety violation rate | % of trials with safety issues | >0% |
| Rule compliance rate | % of rules passed | <99% |
| Score variance | Consistency across runs | High variance |
Trend Analysis
interface EvaluationTrend {
date: Date;
score: number;
goldenPassRate: number;
safetyViolations: number;
}
function detectRegression(
history: EvaluationTrend[],
current: EvaluationTrend
): RegressionAlert | null {
const recentAverage = history
.slice(-7)
.reduce((a, b) => a + b.score, 0) / 7;
if (current.score < recentAverage - 0.1) {
return {
type: 'score_regression',
severity: 'high',
message: `Score dropped from ${recentAverage.toFixed(2)} to ${current.score.toFixed(2)}`,
recommendation: 'Review recent prompt or model changes',
};
}
return null;
}
Implementation Checklist
Setup
- Define evaluation dimensions (success, safety, format, reasoning)
- Create golden test suite (20+ cases covering core paths)
- Extract rules from prompts and policies
- Build grading functions for each dimension
- Set up LLM-as-judge for subjective criteria
Infrastructure
- Build evaluation harness with trial runner
- Integrate with CI/CD pipeline
- Set up results storage and dashboards
- Configure alerting for regressions
- Enable comparison between agent versions
Ongoing
- Add golden tests for new features
- Add regression tests for fixed bugs
- Review and update rubrics quarterly
- Analyze evaluation disagreements
- Tune thresholds based on production data
FAQ
How many golden tests do I need?
Start with 20–30 covering core functionality. Add tests for every bug you fix and every new capability. Mature systems have 100+ golden tests.
How do I handle non-determinism?
Run each test multiple times (3–5) and use the majority outcome. Flaky tests (inconsistent results) indicate unstable behaviors worth investigating.
Should I use the same model for grading?
No. Use a separate, often stronger model for grading. Using the same model for generation and evaluation creates blind spots.
How often should evaluations run?
Golden tests: every PR. Full evaluation: daily or on significant changes. Comprehensive evaluation: weekly or before releases.
What if my agent is mostly chat-based?
Evaluate on conversation quality, helpfulness, and safety. Use human-labeled examples for calibration. Focus on user satisfaction proxies.
How do I handle tool-using agents?
Grade tool calls separately: correct tool selection, valid parameters, appropriate timing. Verify tool results match expectations.
Sources & Further Reading
- AI Agent Evaluations Guide 2025-2026 — Comprehensive overview
- Agent-Pex: Automated Agent Testing — Microsoft Research
- OpenAI Agent Evals — Official documentation
- Litmus LLM Testing — Google’s testing platform
- Agent Evaluation Harnesses — Related: building evaluation infrastructure
- Prompt Regression Testing — Related: preventing prompt regressions
Interested in our research?
We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.
Get in Touch