Back to blog
Agents #testing#AI#LLM

AI End-to-End Testing in 2026: Evaluation Harnesses for LLM Applications

Traditional E2E tests break with non-deterministic AI. A practical guide to building evaluation harnesses that catch regressions before users do.

15 min · January 4, 2026 · Updated January 27, 2026
Topic relevant background image

TL;DR

  • Traditional assertion-based testing breaks with AI—outputs are non-deterministic and correct answers vary.
  • Build evaluation harnesses with: tasks (test cases), graders (scoring logic), transcripts (full traces), and outcomes (verifiable results).
  • Use multi-dimensional evaluation: task success, safety compliance, format adherence, and reasoning quality.
  • Implement golden test suites for regression detection—baseline examples that must consistently pass.
  • Extract checkable rules from prompts and system instructions for automated specification testing.
  • Grade at multiple levels: per-turn, per-task, and aggregate across evaluation sets.
  • Run evaluations in CI/CD, but understand they catch trends, not individual failures.

Why Traditional Testing Fails

Traditional E2E tests assume deterministic outputs:

// Traditional test: exact match expected
expect(await processOrder(input)).toEqual({
  status: 'confirmed',
  orderId: 'ORD-123',
});

This breaks with AI systems because:

  • Same input may produce different (valid) outputs
  • Correct answers can be phrased many ways
  • Tool call sequences vary by run
  • Reasoning paths differ even with same conclusions

The solution isn’t to force determinism—it’s to evaluate behaviors, not strings.

The Evaluation Framework

Core Components

Every AI evaluation needs these elements:

ComponentPurposeExample
TaskWell-defined test scenario with success criteria”Book a flight from NYC to LA for next Tuesday”
TrialOne complete agent run on a taskSingle execution with transcript
GraderScoring logic that evaluates outcomesCheck if flight was booked correctly
TranscriptComplete record of messages and tool callsFull conversation + tool inputs/outputs
OutcomeVerifiable final stateBooking confirmation, database changes

The Evaluation Harness

An evaluation harness orchestrates testing at scale:

interface EvaluationHarness {
  // Load test tasks from definitions
  loadTasks(taskSet: string): Task[];
  
  // Run agent on a single task
  runTrial(agent: Agent, task: Task): Trial;
  
  // Grade a completed trial
  gradeTrial(trial: Trial, task: Task): GradeResult;
  
  // Run full evaluation suite
  evaluate(agent: Agent, taskSet: string): EvaluationResult;
}

interface Task {
  id: string;
  description: string;
  input: string | Message[];
  expectedBehaviors: string[];
  successCriteria: SuccessCriterion[];
  environment?: Record<string, unknown>;
}

interface Trial {
  taskId: string;
  transcript: Message[];
  toolCalls: ToolCall[];
  finalOutput: unknown;
  duration: number;
  tokenUsage: { input: number; output: number };
}

interface GradeResult {
  score: number;  // 0-1
  passed: boolean;
  dimensions: Record<string, number>;
  feedback: string;
  evidence: string[];
}

Multi-Dimensional Grading

Grade AI outputs on multiple dimensions:

Dimension 1: Task Success

Did the agent accomplish the goal?

function gradeTaskSuccess(trial: Trial, task: Task): number {
  // Check verifiable outcomes
  const outcomes = task.successCriteria.map(criterion => {
    switch (criterion.type) {
      case 'state_change':
        return verifyStateChange(trial.finalState, criterion.expected);
      case 'output_contains':
        return trial.finalOutput.includes(criterion.value) ? 1 : 0;
      case 'tool_called':
        return trial.toolCalls.some(c => c.name === criterion.toolName) ? 1 : 0;
      default:
        return 0;
    }
  });
  
  return outcomes.reduce((a, b) => a + b, 0) / outcomes.length;
}

Dimension 2: Safety Compliance

Did the agent avoid harmful behaviors?

const safetyChecks = [
  'no_pii_in_logs',
  'no_unauthorized_actions',
  'no_harmful_content',
  'respected_permissions',
];

function gradeSafety(trial: Trial): number {
  let passed = 0;
  
  for (const check of safetyChecks) {
    if (passesSafetyCheck(trial, check)) {
      passed++;
    }
  }
  
  return passed / safetyChecks.length;
}

Dimension 3: Format Adherence

Did outputs match expected structure?

function gradeFormat(trial: Trial, task: Task): number {
  if (!task.expectedFormat) return 1;
  
  try {
    const parsed = task.expectedFormat.parse(trial.finalOutput);
    return 1;
  } catch (e) {
    return 0;
  }
}

Dimension 4: Reasoning Quality

Was the thinking sound?

async function gradeReasoning(trial: Trial): Promise<number> {
  // Use LLM-as-judge for reasoning quality
  const evaluation = await evaluatorLLM.evaluate({
    prompt: `Evaluate the reasoning quality in this transcript:
    
${formatTranscript(trial.transcript)}

Rate on 1-5 scale:
1. Was the reasoning logical?
2. Were tool calls appropriate?
3. Was information used correctly?
4. Were edge cases handled?

Provide overall score 0-1.`,
  });
  
  return evaluation.score;
}

Golden Test Suites

Golden tests are baseline examples that must consistently pass:

Building Golden Tests

interface GoldenTest {
  id: string;
  category: 'core' | 'edge_case' | 'regression';
  input: string;
  minimumScore: number;
  requiredBehaviors: string[];
  forbiddenBehaviors: string[];
  notes: string;
}

const goldenTests: GoldenTest[] = [
  {
    id: 'book-simple-flight',
    category: 'core',
    input: 'Book a flight from SFO to LAX for tomorrow morning',
    minimumScore: 0.9,
    requiredBehaviors: [
      'searched for flights',
      'presented options to user',
      'confirmed selection',
      'completed booking',
    ],
    forbiddenBehaviors: [
      'booked without confirmation',
      'charged wrong card',
    ],
    notes: 'Basic happy path for flight booking',
  },
  {
    id: 'handle-unavailable-flight',
    category: 'edge_case',
    input: 'Book a flight from SFO to Mars for tomorrow',
    minimumScore: 0.8,
    requiredBehaviors: [
      'recognized impossible request',
      'explained limitation',
      'offered alternatives or clarification',
    ],
    forbiddenBehaviors: [
      'claimed to book flight',
      'hallucinated flight options',
    ],
    notes: 'Graceful handling of impossible requests',
  },
];

Running Golden Tests in CI

# .github/workflows/ai-evaluation.yml
name: AI Evaluation

on:
  pull_request:
    paths:
      - 'src/agent/**'
      - 'src/prompts/**'
      - 'src/tools/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Golden Tests
        run: |
          pnpm run evaluate:golden
          
      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-results
          path: evaluation-results.json
          
      - name: Check Threshold
        run: |
          SCORE=$(jq '.aggregateScore' evaluation-results.json)
          if (( $(echo "$SCORE < 0.85" | bc -l) )); then
            echo "Evaluation score $SCORE below threshold 0.85"
            exit 1
          fi

Specification-Driven Testing

Extract testable rules from system prompts:

Rule Extraction

interface BehaviorRule {
  id: string;
  source: 'prompt' | 'policy' | 'inferred';
  rule: string;
  testable: boolean;
  checker: (trial: Trial) => boolean;
}

// Example rules extracted from prompt:
// "Never book a flight without user confirmation"
const rules: BehaviorRule[] = [
  {
    id: 'require-booking-confirmation',
    source: 'prompt',
    rule: 'Must request user confirmation before completing booking',
    testable: true,
    checker: (trial) => {
      const bookingCall = trial.toolCalls.find(c => c.name === 'complete_booking');
      if (!bookingCall) return true; // No booking, rule not applicable
      
      // Check if confirmation was received before booking
      const confirmationIndex = trial.transcript.findIndex(
        m => m.role === 'user' && m.content.match(/yes|confirm|book it/i)
      );
      const bookingIndex = trial.toolCalls.indexOf(bookingCall);
      
      return confirmationIndex < bookingIndex && confirmationIndex >= 0;
    },
  },
  {
    id: 'no-pii-logging',
    source: 'policy',
    rule: 'Never log PII in tool call arguments',
    testable: true,
    checker: (trial) => {
      const piiPattern = /\b\d{3}-\d{2}-\d{4}\b|\b\d{16}\b/; // SSN, credit card
      return !trial.toolCalls.some(c => 
        JSON.stringify(c.arguments).match(piiPattern)
      );
    },
  },
];

Automated Rule Checking

function checkRules(trial: Trial, rules: BehaviorRule[]): RuleCheckResult {
  const results: RuleResult[] = [];
  
  for (const rule of rules.filter(r => r.testable)) {
    try {
      const passed = rule.checker(trial);
      results.push({
        ruleId: rule.id,
        passed,
        rule: rule.rule,
      });
    } catch (error) {
      results.push({
        ruleId: rule.id,
        passed: false,
        rule: rule.rule,
        error: error.message,
      });
    }
  }
  
  return {
    totalRules: results.length,
    passed: results.filter(r => r.passed).length,
    failed: results.filter(r => !r.passed),
    score: results.filter(r => r.passed).length / results.length,
  };
}

LLM-as-Judge Patterns

For subjective evaluations, use an LLM grader:

Rubric-Based Grading

async function llmGrade(
  trial: Trial, 
  rubric: GradingRubric
): Promise<GradeResult> {
  const prompt = `You are evaluating an AI agent's performance.

## Task Description
${trial.taskDescription}

## Agent Transcript
${formatTranscript(trial.transcript)}

## Grading Rubric
${rubric.criteria.map(c => `- ${c.name}: ${c.description}`).join('\n')}

## Instructions
For each criterion, provide:
1. Score (1-5)
2. Brief justification with specific evidence

Output as JSON:
{
  "scores": {
    "criterion_name": { "score": 1-5, "justification": "..." }
  },
  "overall_assessment": "...",
  "overall_score": 1-5
}`;

  const response = await evaluatorLLM.complete(prompt);
  return parseGradeResponse(response);
}

Consensus Grading

For high-stakes evaluations, use multiple graders:

async function consensusGrade(
  trial: Trial,
  graders: LLM[],
  threshold: number = 0.7
): Promise<ConsensusResult> {
  const grades = await Promise.all(
    graders.map(g => llmGrade(trial, rubric, g))
  );
  
  const scores = grades.map(g => g.overallScore);
  const meanScore = scores.reduce((a, b) => a + b, 0) / scores.length;
  const agreement = calculateAgreement(scores);
  
  return {
    consensus: agreement >= threshold,
    meanScore,
    agreement,
    individualGrades: grades,
    recommendation: meanScore >= 0.7 ? 'pass' : 'fail',
  };
}

Monitoring and Alerting

Key Metrics

MetricDescriptionAlert Threshold
Golden test pass rate% of golden tests passing<95%
Mean evaluation scoreAverage score across tasks<0.8
Safety violation rate% of trials with safety issues>0%
Rule compliance rate% of rules passed<99%
Score varianceConsistency across runsHigh variance

Trend Analysis

interface EvaluationTrend {
  date: Date;
  score: number;
  goldenPassRate: number;
  safetyViolations: number;
}

function detectRegression(
  history: EvaluationTrend[],
  current: EvaluationTrend
): RegressionAlert | null {
  const recentAverage = history
    .slice(-7)
    .reduce((a, b) => a + b.score, 0) / 7;
  
  if (current.score < recentAverage - 0.1) {
    return {
      type: 'score_regression',
      severity: 'high',
      message: `Score dropped from ${recentAverage.toFixed(2)} to ${current.score.toFixed(2)}`,
      recommendation: 'Review recent prompt or model changes',
    };
  }
  
  return null;
}

Implementation Checklist

Setup

  • Define evaluation dimensions (success, safety, format, reasoning)
  • Create golden test suite (20+ cases covering core paths)
  • Extract rules from prompts and policies
  • Build grading functions for each dimension
  • Set up LLM-as-judge for subjective criteria

Infrastructure

  • Build evaluation harness with trial runner
  • Integrate with CI/CD pipeline
  • Set up results storage and dashboards
  • Configure alerting for regressions
  • Enable comparison between agent versions

Ongoing

  • Add golden tests for new features
  • Add regression tests for fixed bugs
  • Review and update rubrics quarterly
  • Analyze evaluation disagreements
  • Tune thresholds based on production data

FAQ

How many golden tests do I need?

Start with 20–30 covering core functionality. Add tests for every bug you fix and every new capability. Mature systems have 100+ golden tests.

How do I handle non-determinism?

Run each test multiple times (3–5) and use the majority outcome. Flaky tests (inconsistent results) indicate unstable behaviors worth investigating.

Should I use the same model for grading?

No. Use a separate, often stronger model for grading. Using the same model for generation and evaluation creates blind spots.

How often should evaluations run?

Golden tests: every PR. Full evaluation: daily or on significant changes. Comprehensive evaluation: weekly or before releases.

What if my agent is mostly chat-based?

Evaluate on conversation quality, helpfulness, and safety. Use human-labeled examples for calibration. Focus on user satisfaction proxies.

How do I handle tool-using agents?

Grade tool calls separately: correct tool selection, valid parameters, appropriate timing. Verify tool results match expectations.

Sources & Further Reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch

Let's build
something real.

No more slide decks. No more "maybe next quarter".
Let's ship your MVP in weeks.

Start Building Now