Agents #testing#AI#LLM

AI End-to-End Testing in 2026: Evaluation Harnesses for LLM Applications

Traditional E2E tests break with non-deterministic AI. A practical guide to building evaluation harnesses that catch regressions before users do.

15 min · January 4, 2026 · Updated January 27, 2026

TL;DR

Traditional assertion-based testing breaks with AI—outputs are non-deterministic and correct answers vary.
Build evaluation harnesses with: tasks (test cases), graders (scoring logic), transcripts (full traces), and outcomes (verifiable results).
Use multi-dimensional evaluation: task success, safety compliance, format adherence, and reasoning quality.
Implement golden test suites for regression detection—baseline examples that must consistently pass.
Extract checkable rules from prompts and system instructions for automated specification testing.
Grade at multiple levels: per-turn, per-task, and aggregate across evaluation sets.
Run evaluations in CI/CD, but understand they catch trends, not individual failures.

Why Traditional Testing Fails

Traditional E2E tests assume deterministic outputs:

// Traditional test: exact match expected
expect(await processOrder(input)).toEqual({
  status: 'confirmed',
  orderId: 'ORD-123',
});

This breaks with AI systems because:

Same input may produce different (valid) outputs
Correct answers can be phrased many ways
Tool call sequences vary by run
Reasoning paths differ even with same conclusions

The solution isn’t to force determinism—it’s to evaluate behaviors, not strings.

The Evaluation Framework

Core Components

Every AI evaluation needs these elements:

Component	Purpose	Example
Task	Well-defined test scenario with success criteria	”Book a flight from NYC to LA for next Tuesday”
Trial	One complete agent run on a task	Single execution with transcript
Grader	Scoring logic that evaluates outcomes	Check if flight was booked correctly
Transcript	Complete record of messages and tool calls	Full conversation + tool inputs/outputs
Outcome	Verifiable final state	Booking confirmation, database changes

The Evaluation Harness

An evaluation harness orchestrates testing at scale:

interface EvaluationHarness {
  // Load test tasks from definitions
  loadTasks(taskSet: string): Task[];
  
  // Run agent on a single task
  runTrial(agent: Agent, task: Task): Trial;
  
  // Grade a completed trial
  gradeTrial(trial: Trial, task: Task): GradeResult;
  
  // Run full evaluation suite
  evaluate(agent: Agent, taskSet: string): EvaluationResult;
}

interface Task {
  id: string;
  description: string;
  input: string | Message[];
  expectedBehaviors: string[];
  successCriteria: SuccessCriterion[];
  environment?: Record<string, unknown>;
}

interface Trial {
  taskId: string;
  transcript: Message[];
  toolCalls: ToolCall[];
  finalOutput: unknown;
  duration: number;
  tokenUsage: { input: number; output: number };
}

interface GradeResult {
  score: number;  // 0-1
  passed: boolean;
  dimensions: Record<string, number>;
  feedback: string;
  evidence: string[];
}

Multi-Dimensional Grading

Grade AI outputs on multiple dimensions:

Dimension 1: Task Success

Did the agent accomplish the goal?

function gradeTaskSuccess(trial: Trial, task: Task): number {
  // Check verifiable outcomes
  const outcomes = task.successCriteria.map(criterion => {
    switch (criterion.type) {
      case 'state_change':
        return verifyStateChange(trial.finalState, criterion.expected);
      case 'output_contains':
        return trial.finalOutput.includes(criterion.value) ? 1 : 0;
      case 'tool_called':
        return trial.toolCalls.some(c => c.name === criterion.toolName) ? 1 : 0;
      default:
        return 0;
    }
  });
  
  return outcomes.reduce((a, b) => a + b, 0) / outcomes.length;
}

Dimension 2: Safety Compliance

Did the agent avoid harmful behaviors?

const safetyChecks = [
  'no_pii_in_logs',
  'no_unauthorized_actions',
  'no_harmful_content',
  'respected_permissions',
];

function gradeSafety(trial: Trial): number {
  let passed = 0;
  
  for (const check of safetyChecks) {
    if (passesSafetyCheck(trial, check)) {
      passed++;
    }
  }
  
  return passed / safetyChecks.length;
}

Dimension 3: Format Adherence

Did outputs match expected structure?

function gradeFormat(trial: Trial, task: Task): number {
  if (!task.expectedFormat) return 1;
  
  try {
    const parsed = task.expectedFormat.parse(trial.finalOutput);
    return 1;
  } catch (e) {
    return 0;
  }
}

Dimension 4: Reasoning Quality

Was the thinking sound?

async function gradeReasoning(trial: Trial): Promise<number> {
  // Use LLM-as-judge for reasoning quality
  const evaluation = await evaluatorLLM.evaluate({
    prompt: `Evaluate the reasoning quality in this transcript:
    
${formatTranscript(trial.transcript)}

Rate on 1-5 scale:
1. Was the reasoning logical?
2. Were tool calls appropriate?
3. Was information used correctly?
4. Were edge cases handled?

Provide overall score 0-1.`,
  });
  
  return evaluation.score;
}

Golden Test Suites

Golden tests are baseline examples that must consistently pass:

Building Golden Tests

interface GoldenTest {
  id: string;
  category: 'core' | 'edge_case' | 'regression';
  input: string;
  minimumScore: number;
  requiredBehaviors: string[];
  forbiddenBehaviors: string[];
  notes: string;
}

const goldenTests: GoldenTest[] = [
  {
    id: 'book-simple-flight',
    category: 'core',
    input: 'Book a flight from SFO to LAX for tomorrow morning',
    minimumScore: 0.9,
    requiredBehaviors: [
      'searched for flights',
      'presented options to user',
      'confirmed selection',
      'completed booking',
    ],
    forbiddenBehaviors: [
      'booked without confirmation',
      'charged wrong card',
    ],
    notes: 'Basic happy path for flight booking',
  },
  {
    id: 'handle-unavailable-flight',
    category: 'edge_case',
    input: 'Book a flight from SFO to Mars for tomorrow',
    minimumScore: 0.8,
    requiredBehaviors: [
      'recognized impossible request',
      'explained limitation',
      'offered alternatives or clarification',
    ],
    forbiddenBehaviors: [
      'claimed to book flight',
      'hallucinated flight options',
    ],
    notes: 'Graceful handling of impossible requests',
  },
];

Running Golden Tests in CI

# .github/workflows/ai-evaluation.yml
name: AI Evaluation

on:
  pull_request:
    paths:
      - 'src/agent/**'
      - 'src/prompts/**'
      - 'src/tools/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Golden Tests
        run: |
          pnpm run evaluate:golden
          
      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-results
          path: evaluation-results.json
          
      - name: Check Threshold
        run: |
          SCORE=$(jq '.aggregateScore' evaluation-results.json)
          if (( $(echo "$SCORE < 0.85" | bc -l) )); then
            echo "Evaluation score $SCORE below threshold 0.85"
            exit 1
          fi

Specification-Driven Testing

Extract testable rules from system prompts:

Rule Extraction

interface BehaviorRule {
  id: string;
  source: 'prompt' | 'policy' | 'inferred';
  rule: string;
  testable: boolean;
  checker: (trial: Trial) => boolean;
}

// Example rules extracted from prompt:
// "Never book a flight without user confirmation"
const rules: BehaviorRule[] = [
  {
    id: 'require-booking-confirmation',
    source: 'prompt',
    rule: 'Must request user confirmation before completing booking',
    testable: true,
    checker: (trial) => {
      const bookingCall = trial.toolCalls.find(c => c.name === 'complete_booking');
      if (!bookingCall) return true; // No booking, rule not applicable
      
      // Check if confirmation was received before booking
      const confirmationIndex = trial.transcript.findIndex(
        m => m.role === 'user' && m.content.match(/yes|confirm|book it/i)
      );
      const bookingIndex = trial.toolCalls.indexOf(bookingCall);
      
      return confirmationIndex < bookingIndex && confirmationIndex >= 0;
    },
  },
  {
    id: 'no-pii-logging',
    source: 'policy',
    rule: 'Never log PII in tool call arguments',
    testable: true,
    checker: (trial) => {
      const piiPattern = /\b\d{3}-\d{2}-\d{4}\b|\b\d{16}\b/; // SSN, credit card
      return !trial.toolCalls.some(c => 
        JSON.stringify(c.arguments).match(piiPattern)
      );
    },
  },
];

Automated Rule Checking

function checkRules(trial: Trial, rules: BehaviorRule[]): RuleCheckResult {
  const results: RuleResult[] = [];
  
  for (const rule of rules.filter(r => r.testable)) {
    try {
      const passed = rule.checker(trial);
      results.push({
        ruleId: rule.id,
        passed,
        rule: rule.rule,
      });
    } catch (error) {
      results.push({
        ruleId: rule.id,
        passed: false,
        rule: rule.rule,
        error: error.message,
      });
    }
  }
  
  return {
    totalRules: results.length,
    passed: results.filter(r => r.passed).length,
    failed: results.filter(r => !r.passed),
    score: results.filter(r => r.passed).length / results.length,
  };
}

LLM-as-Judge Patterns

For subjective evaluations, use an LLM grader:

Rubric-Based Grading

async function llmGrade(
  trial: Trial, 
  rubric: GradingRubric
): Promise<GradeResult> {
  const prompt = `You are evaluating an AI agent's performance.

## Task Description
${trial.taskDescription}

## Agent Transcript
${formatTranscript(trial.transcript)}

## Grading Rubric
${rubric.criteria.map(c => `- ${c.name}: ${c.description}`).join('\n')}

## Instructions
For each criterion, provide:
1. Score (1-5)
2. Brief justification with specific evidence

Output as JSON:
{
  "scores": {
    "criterion_name": { "score": 1-5, "justification": "..." }
  },
  "overall_assessment": "...",
  "overall_score": 1-5
}`;

  const response = await evaluatorLLM.complete(prompt);
  return parseGradeResponse(response);
}

Consensus Grading

For high-stakes evaluations, use multiple graders:

async function consensusGrade(
  trial: Trial,
  graders: LLM[],
  threshold: number = 0.7
): Promise<ConsensusResult> {
  const grades = await Promise.all(
    graders.map(g => llmGrade(trial, rubric, g))
  );
  
  const scores = grades.map(g => g.overallScore);
  const meanScore = scores.reduce((a, b) => a + b, 0) / scores.length;
  const agreement = calculateAgreement(scores);
  
  return {
    consensus: agreement >= threshold,
    meanScore,
    agreement,
    individualGrades: grades,
    recommendation: meanScore >= 0.7 ? 'pass' : 'fail',
  };
}

Monitoring and Alerting

Key Metrics

Metric	Description	Alert Threshold
Golden test pass rate	% of golden tests passing	<95%
Mean evaluation score	Average score across tasks	<0.8
Safety violation rate	% of trials with safety issues	>0%
Rule compliance rate	% of rules passed	<99%
Score variance	Consistency across runs	High variance

Trend Analysis

interface EvaluationTrend {
  date: Date;
  score: number;
  goldenPassRate: number;
  safetyViolations: number;
}

function detectRegression(
  history: EvaluationTrend[],
  current: EvaluationTrend
): RegressionAlert | null {
  const recentAverage = history
    .slice(-7)
    .reduce((a, b) => a + b.score, 0) / 7;
  
  if (current.score < recentAverage - 0.1) {
    return {
      type: 'score_regression',
      severity: 'high',
      message: `Score dropped from ${recentAverage.toFixed(2)} to ${current.score.toFixed(2)}`,
      recommendation: 'Review recent prompt or model changes',
    };
  }
  
  return null;
}

Implementation Checklist

Setup

Define evaluation dimensions (success, safety, format, reasoning)
Create golden test suite (20+ cases covering core paths)
Extract rules from prompts and policies
Build grading functions for each dimension
Set up LLM-as-judge for subjective criteria

Infrastructure

Build evaluation harness with trial runner
Integrate with CI/CD pipeline
Set up results storage and dashboards
Configure alerting for regressions
Enable comparison between agent versions

Ongoing

Add golden tests for new features
Add regression tests for fixed bugs
Review and update rubrics quarterly
Analyze evaluation disagreements
Tune thresholds based on production data

FAQ

How many golden tests do I need?

Start with 20–30 covering core functionality. Add tests for every bug you fix and every new capability. Mature systems have 100+ golden tests.

How do I handle non-determinism?

Run each test multiple times (3–5) and use the majority outcome. Flaky tests (inconsistent results) indicate unstable behaviors worth investigating.

Should I use the same model for grading?

No. Use a separate, often stronger model for grading. Using the same model for generation and evaluation creates blind spots.

How often should evaluations run?

Golden tests: every PR. Full evaluation: daily or on significant changes. Comprehensive evaluation: weekly or before releases.

What if my agent is mostly chat-based?

Evaluate on conversation quality, helpfulness, and safety. Use human-labeled examples for calibration. Focus on user satisfaction proxies.

How do I handle tool-using agents?

Grade tool calls separately: correct tool selection, valid parameters, appropriate timing. Verify tool results match expectations.

Sources & Further Reading

AI Agent Evaluations Guide 2025-2026 — Comprehensive overview
Agent-Pex: Automated Agent Testing — Microsoft Research
OpenAI Agent Evals — Official documentation
Litmus LLM Testing — Google’s testing platform
Agent Evaluation Harnesses — Related: building evaluation infrastructure
Prompt Regression Testing — Related: preventing prompt regressions

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch