The Evaluation module provides a three-layer system for measuring agent quality: evaluators that score responses, a benchmark engine that runs test datasets, and reporting tools for quality gates, trend analysis, and regression detection.

Sections: See Advanced for LLM-as-judge, RAG evals, safety evals, and the trace-eval bridge.

Quick Start

This example shows the core evaluation workflow: define a test dataset, run it through your agent with multiple evaluators, and check the results against a quality gate.

// Define test dataset
TestDataset dataset = TestDataset.fromClasspath("qa-benchmark.json");

// Build benchmark runner
BenchmarkRunner runner = BenchmarkRunner.builder()
    .evaluator(new ResponseQualityEvaluator())
    .evaluator(new ToolSelectionEvaluator())
    .evaluator(new CostEfficiencyEvaluator(0.10, 30_000))
    .agentFunction(testCase -> myAgent.chat(testCase.getInput()))
    .parallelism(4)
    .build();

// Run benchmark
BenchmarkResult result = runner.run(dataset);

// Check quality gate
QualityGate gate = new QualityGate(QualityGateConfig.builder()
    .defaultThreshold(0.7)
    .maxRegressionPercent(5.0)
    .build());
QualityGate.Verdict verdict = gate.evaluate(result);
System.out.println(verdict.summary());

Evaluators

Evaluators are the scoring components that measure specific aspects of agent quality. All evaluators implement the Evaluator SPI interface and produce normalized scores in the [0.0, 1.0] range, making them directly comparable.

Response Quality

The ResponseQualityEvaluator measures how well the agent's response matches the expected output by checking keyword overlap, sentence coverage, and precision. This is the most general-purpose evaluator.

Evaluator eval = new ResponseQualityEvaluator();
EvaluationResult result = eval.evaluate(input);
// Metrics: relevance, completeness, accuracy

Scoring weights:

Keyword overlap: 40% (relevance)
Sentence coverage: 35% (completeness)
Precision: 25% (accuracy)

Tool Selection

The ToolSelectionEvaluator checks whether the agent called the correct tools, whether it missed any expected tools, and whether it called them in the right order. This is essential for evaluating agents that use tools as part of their reasoning.

Evaluator eval = new ToolSelectionEvaluator();
EvaluationResult result = eval.evaluate(input);
// Metrics: precision, recall, order, expected_count, actual_count

Scoring weights:

Tool correctness (precision): 30%
Tool recall: 40%
Execution order (LCS matching): 30%

Instruction Following

The InstructionFollowingEvaluator checks whether the agent followed specific instructions in the prompt, such as output format requirements (JSON, bullet points), length limits, and required/forbidden keywords.

Evaluator eval = new InstructionFollowingEvaluator();
EvaluationResult result = eval.evaluate(input);
// Metrics: checks_total, checks_passed, format_*, keyword_coverage

Checks for: format constraints (JSON, bullet points), word/character limits, required/forbidden keywords.

Cost Efficiency

The CostEfficiencyEvaluator measures whether the agent stayed within cost and latency budgets. This is important for production deployments where you need to balance quality against operational costs.

// $0.10 cost budget, 30s latency budget
Evaluator eval = new CostEfficiencyEvaluator(0.10, 30_000);
EvaluationResult result = eval.evaluate(input);
// Metrics: cost_score, token_efficiency, latency_score, cost_usd, latency_ms

Scoring weights:

Cost vs budget: 40%
Token efficiency (output/input ratio): 30%
Latency vs budget: 30%

Test Datasets

Test datasets are collections of test cases that define inputs, expected outputs, and evaluation criteria. You define them in JSON and load them at benchmark time.

{
  "name": "QA Benchmark",
  "testCases": [
    {
      "id": "tc-001",
      "input": "What is the capital of France?",
      "expectedOutput": "Paris is the capital of France.",
      "expectedToolSequence": ["search"],
      "instructions": "Answer in one sentence.",
      "tags": ["geography", "simple"]
    }
  ]
}

Loading Datasets

Datasets can be loaded from the filesystem or classpath, and filtered by tag to run subsets of your test suite.

// From file
TestDataset dataset = TestDataset.fromFile(Path.of("benchmarks/qa.json"));

// From classpath
TestDataset dataset = TestDataset.fromClasspath("benchmarks/qa.json");

// Filter by tag
TestDataset filtered = dataset.filterByTag("geography");

Benchmark Runner

The BenchmarkRunner orchestrates the entire evaluation process: it takes a dataset, runs each test case through your agent function, collects the responses, and scores them with every registered evaluator.

Building

Configure the runner with evaluators, your agent function, and a parallelism level for concurrent execution.

BenchmarkRunner runner = BenchmarkRunner.builder()
    .evaluator(new ResponseQualityEvaluator())
    .evaluator(new ToolSelectionEvaluator())
    .evaluator(new InstructionFollowingEvaluator())
    .evaluator(new CostEfficiencyEvaluator(0.10, 30_000))
    .agentFunction(testCase -> {
        long start = System.currentTimeMillis();
        String response = agent.chat(testCase.getInput());
        long latency = System.currentTimeMillis() - start;
        return new AgentOutput(response, latency, 0.005, List.of("search"));
    })
    .parallelism(4)  // Run 4 test cases in parallel
    .build();

Running

After running a benchmark, you can inspect per-case scores and aggregate statistics across all evaluators.

BenchmarkResult result = runner.run(dataset);

// Per-case results
for (var caseResult : result.getCaseResults()) {
    System.out.printf("%s: %.2f%n", caseResult.testCaseId(), caseResult.overallScore());
}

// Aggregate scores
Map<String, Map<String, Double>> aggregates = result.getAggregateScores();
// e.g., {"response_quality": {"mean": 0.85, "median": 0.87, "p95": 0.95, "stddev": 0.08}}

Statistical Analysis

The StatisticalAnalyzer computes per-evaluator statistics so you can understand the distribution of scores, not just averages.

Metric	Description
`mean`	Average score across all test cases
`median`	Middle value
`p5` / `p95`	5th and 95th percentile
`min` / `max`	Score range
`stddev`	Standard deviation
`CI95Lower` / `CI95Upper`	95% confidence interval

Quality Gates

Quality gates define the minimum acceptable scores for your agent. Use them in CI/CD pipelines to automatically block deployments when agent quality drops below your standards.

QualityGateConfig config = QualityGateConfig.builder()
    .defaultThreshold(0.7)                              // All evaluators must score >= 0.7
    .evaluatorThreshold("response_quality", 0.8)        // Response quality needs 0.8
    .evaluatorThreshold("cost_efficiency", 0.6)         // Cost can be lower
    .maxRegressionPercent(5.0)                           // Max 5% drop from baseline
    .minPassRate(0.9)                                    // 90% of cases must pass
    .build();

QualityGate gate = new QualityGate(config);
QualityGate.Verdict verdict = gate.evaluate(result);

if (verdict.passed()) {
    System.out.println("Quality gate PASSED");
} else {
    verdict.failures().forEach(f -> System.err.println("FAIL: " + f));
    verdict.warnings().forEach(w -> System.out.println("WARN: " + w));
}

Baseline & Regression Detection

Baselines let you save benchmark results and compare future runs against them. This is how you detect regressions -- if a code change causes agent quality to drop more than the allowed percentage, the quality gate fails.

// Save baseline
BaselineStore store = new BaselineStore(Path.of("baselines/"));
store.save(result);

// Load and compare
Optional<Baseline> latest = store.loadLatest(datasetName);
if (latest.isPresent()) {
    QualityGate gate = new QualityGate(config);
    // Gate automatically compares against baseline when maxRegressionPercent is set
}

Trend Analysis

Trend analysis goes beyond single-run comparisons by analyzing quality over many benchmark runs. It uses linear regression to determine whether each metric is improving, degrading, or stable over time.

TrendReport report = TrendAnalyzer.analyze(baselines);
// report.metricTrends() — Per-metric trend data (slope, direction, change)
// MetricTrend.direction() — IMPROVING, DEGRADING, or STABLE
// MetricTrend.slope()     — Linear regression slope (positive = improving)

Reporting

Export benchmark results as JSON for integration with dashboards, CI/CD systems, or custom analysis tools. The report also provides drill-down methods to quickly find your best and worst performing test cases.

EvaluationReport report = new EvaluationReport(result);
report.exportJson(Path.of("reports/benchmark-result.json"));

// Drill-down analysis
report.getWorstCases(5);  // Bottom 5 performing test cases
report.getBestCases(5);   // Top 5 performing test cases

G-Eval (LLM-as-Judge)

G-Eval uses an LLM to judge agent responses, which is more flexible than rule-based evaluators for subjective quality dimensions like clarity, helpfulness, or factual accuracy. It follows the G-Eval paper's two-step process: first generate evaluation criteria via chain-of-thought reasoning, then score the response.

GEvalEvaluator

The GEvalEvaluator is a general-purpose LLM-as-judge evaluator. You provide evaluation criteria in natural language, and it uses the LLM to generate evaluation steps and then score the response.

GEvalEvaluator geval = GEvalEvaluator.builder()
    .llm(llmClient)
    .criteria("Evaluate for factual accuracy, completeness, and clarity")
    .build();

EvaluationResult result = geval.evaluate(input);
// result.score()     -- 0.0-1.0 overall score
// result.reasoning() -- CoT explanation of the score

Two-step process:

CoT Generation: The LLM generates detailed evaluation steps based on the criteria
Scoring: The LLM applies those steps to produce a final score with explanation

Agentic Evaluators

These LLM-based evaluators are designed specifically for agent workflows. They judge whether the agent completed its task, used the right tools, and followed its plan -- things that are difficult to measure with simple text comparison.

// Task completion -- did the agent achieve the stated goal?
Evaluator taskEval = new TaskCompletionEvaluator(llmClient);

// Tool correctness -- did the agent call the right tools with correct arguments?
Evaluator toolEval = new ToolCorrectnessEvaluator(llmClient);

// Plan adherence -- did the agent follow its plan?
Evaluator planEval = new PlanAdherenceEvaluator(llmClient);

Evaluator	Measures	Key Metrics
`TaskCompletionEvaluator`	Whether the agent completed the stated goal	`completion_score`, `goal_achieved`, `reasoning`
`ToolCorrectnessEvaluator`	Tool selection accuracy and argument quality	`tool_precision`, `arg_quality`, `unnecessary_calls`
`PlanAdherenceEvaluator`	How closely the agent followed its plan	`adherence_score`, `skipped_steps`, `added_steps`

Use with the benchmark runner:

BenchmarkRunner runner = BenchmarkRunner.builder()
    .evaluator(new GEvalEvaluator(llmClient, "accuracy and completeness"))
    .evaluator(new TaskCompletionEvaluator(llmClient))
    .evaluator(new ToolCorrectnessEvaluator(llmClient))
    .evaluator(new PlanAdherenceEvaluator(llmClient))
    .agentFunction(testCase -> agent.chat(testCase.getInput()))
    .build();

Self-Improving Evaluation Loop (Auto-Harness)

The auto-harness creates a continuous improvement loop: it runs benchmarks, analyzes failures, automatically generates new test cases targeting weak areas, and re-evaluates. This means your test suite gets smarter over time, discovering edge cases you might not have thought of.

AutoHarnessLoop

The AutoHarnessLoop orchestrates the self-improving cycle. You configure it with a benchmark runner, dataset, and the pipeline components, then call run() to start the improvement loop.

AutoHarnessLoop loop = AutoHarnessLoop.builder()
    .benchmarkRunner(runner)
    .dataset(dataset)
    .failureMiner(new FailureMiner())
    .clusterer(new FailureClusterer())
    .caseGenerator(new EvalCaseGenerator(llmClient))
    .regressionGate(new RegressionGate(baselineStore))
    .maxIterations(5)
    .improvementThreshold(0.02)  // stop if improvement < 2%
    .build();

AutoHarnessResult result = loop.run();
result.getIterationCount();      // How many improvement cycles ran
result.getFinalScore();          // Score after all iterations
result.getGeneratedCases();      // New test cases discovered
result.getRegressionsPassed();   // Whether all regression gates passed

Pipeline Components

The auto-harness pipeline consists of four components that work together. Each can also be used independently.

FailureMiner -- Extracts failing test cases from benchmark results:

FailureMiner miner = new FailureMiner();
List<FailureCase> failures = miner.mine(benchmarkResult, 0.5); // threshold

FailureClusterer -- Groups failures by root cause pattern:

FailureClusterer clusterer = new FailureClusterer();
List<FailureCluster> clusters = clusterer.cluster(failures);
// Each cluster: pattern description, failure count, representative examples

EvalCaseGenerator -- Uses an LLM to generate new test cases targeting discovered failure patterns:

EvalCaseGenerator generator = new EvalCaseGenerator(llmClient);
List<TestCase> newCases = generator.generate(clusters, 10); // 10 cases per cluster

RegressionGate -- Ensures new changes do not regress existing quality:

RegressionGate gate = new RegressionGate(baselineStore);
RegressionGate.Verdict verdict = gate.check(newResult);
if (!verdict.passed()) {
    verdict.regressions().forEach(r ->
        System.err.println("Regression: " + r.evaluator() + " dropped " + r.delta()));
}

Custom Evaluators

If the built-in evaluators do not cover your needs, you can create your own by implementing the Evaluator interface. Return a score between 0.0 and 1.0 along with any metrics you want to track.

public class HallucinationEvaluator implements Evaluator {

    @Override
    public String name() {
        return "hallucination_check";
    }

    @Override
    public EvaluationResult evaluate(EvaluationInput input) {
        double score = checkForHallucinations(input.agentResponse(), input.expectedOutput());
        return EvaluationResult.of(name(), score, Map.of(
            "hallucination_count", countHallucinations(input),
            "factual_accuracy", score
        ));
    }
}

BenchmarkRunner runner = BenchmarkRunner.builder()
    .evaluator(new HallucinationEvaluator())
    .evaluator(new ResponseQualityEvaluator())
    .agentFunction(testCase -> agent.chat(testCase.getInput()))
    .build();

Evaluation

On this page