Advanced Evaluation
This guide covers the specialized evaluator families introduced in 0.3.0: RAG evaluators for retrieval-augmented generation pipelines, multi-turn evaluators for conversation quality, safety evaluators for harmful content detection, and the trace-eval bridge that connects observability with evaluation.
All evaluators in this guide use the LLM-as-judge pattern via the LLMJudge functional interface. See Evaluation overview for the base evaluator framework, benchmark runner, quality gates, and auto-harness.
LLMJudge Interface
Every advanced evaluator takes an LLMJudge instance, which is a @FunctionalInterface that sends a prompt and returns the LLM's text response. This decouples evaluation from any specific LLM client.
@FunctionalInterface
public interface LLMJudge {
String judge(String prompt);
}
// Plug in any LLM client
LLMJudge judge = prompt -> myLlmClient.chat(prompt);Evaluator SPI
All evaluators implement com.tnsai.evaluation.spi.Evaluator:
public interface Evaluator {
String name();
EvaluationResult evaluate(EvaluationInput context);
}EvaluationInput is a record carrying the full evaluation context:
| Field | Type | Description |
|---|---|---|
userInput | String | The user's query |
agentResponse | String | The agent's response to evaluate |
expectedOutput | String | Ground-truth expected answer |
expectedToolSequence | List<String> | Expected tool call order |
actualToolSequence | List<String> | Actual tool calls made |
instructions | String | Instructions the agent was given |
latencyMs | long | Response latency in milliseconds |
costUsd | double | Cost of the LLM call in USD |
inputTokens | int | Input token count |
outputTokens | int | Output token count |
metadata | Map<String, Object> | Arbitrary metadata (retrieved docs, conversation history, etc.) |
Build inputs with the fluent builder:
Evaluator.EvaluationInput input = Evaluator.EvaluationInput.builder()
.userInput("What causes tides?")
.agentResponse("Tides are caused by gravitational pull of the Moon.")
.expectedOutput("Tides are caused by the gravitational pull of the Moon and Sun.")
.metadata("retrieved_documents", List.of(doc1, doc2))
.build();EvaluationResult
Every evaluator returns an EvaluationResult record with a normalized score in [0.0, 1.0]:
public record EvaluationResult(
String evaluatorName,
double score,
String details,
Map<String, Double> metrics,
Instant timestamp
) {
// Factory methods
static EvaluationResult of(String name, double score, String details, Map<String, Double> metrics);
static EvaluationResult pass(String name, String details); // score = 1.0
static EvaluationResult fail(String name, String details); // score = 0.0
boolean passed(double threshold);
}RAG Evaluators
Package: com.tnsai.evaluation.evaluators.rag
RAG evaluators measure retrieval-augmented generation quality across four dimensions: faithfulness, contextual precision, contextual recall, and answer relevancy. All require retrieved_documents in the metadata as a List<String>.
FaithfulnessEvaluator
Measures whether the agent's response is grounded in the retrieved documents. Uses a 2-step LLM-as-judge process:
- Extract factual claims from the response
- Verify each claim against the retrieved context
Score: supported_claims / total_claims (1.0 = fully faithful, 0.0 = fully hallucinated)
var evaluator = new FaithfulnessEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.agentResponse("Paris is the capital of France and has 2.1 million people.")
.metadata("retrieved_documents", List.of(
"Paris is the capital and most populous city of France.",
"The population of Paris is approximately 2.1 million."
))
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.score() -> 1.0 (both claims supported)
// result.metrics(): supported_claims, total_claims, hallucinated_claimsMetrics returned:
| Metric | Description |
|---|---|
supported_claims | Number of claims verified against context |
total_claims | Total factual claims extracted |
hallucinated_claims | Claims not supported by context |
ContextualPrecisionEvaluator
Measures whether the retrieved documents are relevant to the query. Uses weighted precision -- irrelevant documents ranked higher are penalized more heavily.
For each document, the LLM judges relevance (YES/NO). The score uses the formula: sum of precision@k for each relevant document at position k, divided by total relevant count.
Score: Weighted precision (1.0 = all relevant docs ranked first, 0.0 = no relevant docs)
var evaluator = new ContextualPrecisionEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.userInput("What causes tides?")
.expectedOutput("Gravitational pull of the Moon and Sun causes tides.")
.metadata("retrieved_documents", List.of(
"Tides are caused by gravitational forces of the Moon and Sun.",
"The Pacific Ocean is the largest ocean on Earth.",
"Spring tides occur when the Moon and Sun are aligned."
))
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): relevant_docs, total_docs, naive_precisionMetrics returned:
| Metric | Description |
|---|---|
relevant_docs | Number of documents judged relevant |
total_docs | Total documents evaluated |
naive_precision | Simple relevant / total ratio (without ranking weight) |
ContextualRecallEvaluator
Measures whether all relevant information needed for the expected answer was actually retrieved. Extracts key facts from the expected output and checks how many are attributable to the retrieved documents.
Score: attributed_facts / total_facts (1.0 = all facts covered, 0.0 = none covered)
Requires: Both retrieved_documents in metadata and a non-empty expectedOutput.
var evaluator = new ContextualRecallEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.expectedOutput("Tides are caused by the Moon's gravity. Spring tides happen during full and new moons.")
.metadata("retrieved_documents", List.of(
"The Moon's gravitational pull is the primary cause of ocean tides."
// Missing: spring tide information -> recall will be less than 1.0
))
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): attributed_facts, total_facts, missing_factsMetrics returned:
| Metric | Description |
|---|---|
attributed_facts | Facts from expected output found in retrieved docs |
total_facts | Total key facts extracted from expected output |
missing_facts | Facts not covered by any retrieved document |
AnswerRelevancyEvaluator
Measures whether the agent's response actually addresses the user's query. Scores on three normalized dimensions:
- Directness: Does the response directly answer the question?
- Completeness: Does it cover all aspects of the query?
- Focus: Does it avoid irrelevant tangents?
Each dimension is scored 1-5 by the LLM, then normalized to [0.0, 1.0] and averaged.
Score: Average of normalized directness, completeness, and focus.
var evaluator = new AnswerRelevancyEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.userInput("What is the boiling point of water?")
.agentResponse("Water boils at 100 degrees Celsius at standard atmospheric pressure.")
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): directness, completeness, focus
// result.details() -> "directness=5/5 completeness=5/5 focus=5/5 score=1.00"Using RAG Evaluators Together
For a comprehensive RAG evaluation, combine all four evaluators:
LLMJudge judge = prompt -> llmClient.chat(prompt);
var evaluators = List.of(
new FaithfulnessEvaluator(judge),
new ContextualPrecisionEvaluator(judge),
new ContextualRecallEvaluator(judge),
new AnswerRelevancyEvaluator(judge)
);
BenchmarkRunner runner = BenchmarkRunner.builder()
.evaluators(evaluators)
.agentFunction(testCase -> ragAgent.query(testCase.getInput()))
.build();Multi-Turn Evaluators
Package: com.tnsai.evaluation.evaluators.multiturn
Multi-turn evaluators assess conversation quality across multiple exchanges. All require conversation_history in metadata as a List<Map<String, String>> with "role" and "content" keys.
KnowledgeRetentionEvaluator
Measures whether the agent retains information from earlier conversation turns. Uses a 2-step process:
- Extract key facts established in earlier turns
- Check if the agent recalls those facts in later turns
Score: retained_facts / total_facts (1.0 = perfect retention, 0.0 = no retention)
Requires: At least 2 turns in conversation_history.
var evaluator = new KnowledgeRetentionEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.metadata("conversation_history", List.of(
Map.of("role", "user", "content", "My name is Alice and I work at Acme Corp."),
Map.of("role", "assistant", "content", "Nice to meet you, Alice! How can I help?"),
Map.of("role", "user", "content", "Can you summarize what you know about me?"),
Map.of("role", "assistant", "content", "You're Alice and you work at Acme Corp.")
))
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): retained_facts, total_facts, forgotten_factsConversationCompletenessEvaluator
Measures whether a multi-turn conversation achieved its stated goal. Uses a 1-5 scale:
| Score | Meaning |
|---|---|
| 1 | Goal not addressed at all |
| 2 | Goal partially acknowledged but not resolved |
| 3 | Goal partially resolved |
| 4 | Goal mostly resolved with minor gaps |
| 5 | Goal fully achieved |
Score: Normalized to [0.0, 1.0] from the raw 1-5 scale.
Requires: Both conversation_history and conversation_goal (a String) in metadata.
var evaluator = new ConversationCompletenessEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.metadata("conversation_goal", "Help the user book a flight to Paris")
.metadata("conversation_history", List.of(
Map.of("role", "user", "content", "I need to fly to Paris next week"),
Map.of("role", "assistant", "content", "I found flights on Tuesday and Thursday. Which do you prefer?"),
Map.of("role", "user", "content", "Tuesday please"),
Map.of("role", "assistant", "content", "Booked! Your flight departs Tuesday at 10am. Confirmation: ABC123.")
))
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): raw_score, normalized_score
// result.details() -> "completeness=5/5 score=1.00"TurnRelevancyEvaluator
Measures whether the last assistant turn is relevant to the preceding conversation context. Scores on three dimensions:
- Context alignment: Does the response align with the conversation so far?
- Query addressing: Does it address the most recent user message?
- Coherence: Is it logically consistent with prior turns?
Each dimension is scored 1-5, normalized and averaged.
Requires: At least 2 turns with at least one assistant turn.
var evaluator = new TurnRelevancyEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.metadata("conversation_history", List.of(
Map.of("role", "user", "content", "Tell me about quantum computing"),
Map.of("role", "assistant", "content", "Quantum computing uses qubits..."),
Map.of("role", "user", "content", "How does that compare to classical computing?"),
Map.of("role", "assistant", "content", "Unlike classical bits that are 0 or 1, qubits can be in superposition...")
))
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): context_alignment, query_addressing, coherence
// result.details() -> "context_alignment=5/5 query_addressing=5/5 coherence=5/5 score=1.00"Safety Evaluators
Package: com.tnsai.evaluation.evaluators.safety
Safety evaluators detect harmful content in agent responses. All use inverted scoring: 1.0 = safe, 0.0 = unsafe.
ToxicityEvaluator
Detects toxic, harmful, or offensive content across five categories:
| Category | What it detects |
|---|---|
INSULT | Demeaning or belittling language |
THREAT | Threatening or intimidating language |
OBSCENITY | Profane or vulgar language |
HATE_SPEECH | Language targeting protected groups |
HARASSMENT | Persistent targeting or bullying |
The LLM scores overall toxicity from 0 (none) to 5 (severe), and classifies each category as PRESENT or ABSENT. The evaluator understands context -- quoting toxic content to explain it is treated differently from generating it.
Score: Inverted: 1.0 - (toxicity_score / 5.0) (1.0 = clean, 0.0 = severely toxic)
var evaluator = new ToxicityEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.agentResponse("Here is the information you requested about climate change.")
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.score() -> ~1.0 (clean)
// result.metrics(): raw_toxicity_score, insult, threat, obscenity, hate_speech, harassmentMetrics returned: raw_toxicity_score (0-5) plus per-category flags (1.0 = present, 0.0 = absent).
BiasEvaluator
Detects demographic, cultural, or ideological bias across six categories:
| Category | What it detects |
|---|---|
GENDER | Stereotypes or assumptions based on gender |
RACE | Racial or ethnic stereotypes |
AGE | Age-based stereotypes or assumptions |
RELIGION | Religious bias or assumptions |
POLITICAL | Political ideology presented as fact |
SOCIOECONOMIC | Class-based assumptions or stereotypes |
Score: Inverted: 1.0 - (bias_score / 5.0) (1.0 = no bias, 0.0 = severely biased)
The evaluator also considers the user's query for context -- a biased response to a question about bias may be appropriate.
var evaluator = new BiasEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.userInput("What are common career paths?")
.agentResponse("Common career paths include engineering, medicine, law, and education.")
.build();
EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): raw_bias_score, gender_bias, race_bias, age_bias, religion_bias, political_bias, socioeconomic_biasHallucinationEvaluator
Detects hallucinated content by checking factual claims against provided context. Unlike FaithfulnessEvaluator (which is RAG-specific), this evaluator works with any context source and classifies claims into three categories:
| Classification | Meaning |
|---|---|
SUPPORTED | Claim is backed by the provided context |
CONTRADICTED | Claim conflicts with the provided context |
FABRICATED | Claim has no basis in the context at all |
Context sources (checked in order): metadata.get("context") as String or List<String>, then metadata.get("retrieved_documents"). If no context is provided, the evaluator checks for internal contradictions and invented references.
Score: Inverted: supported / total (1.0 = no hallucination, 0.0 = fully hallucinated)
var evaluator = new HallucinationEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
.agentResponse("The product costs $99 and ships in 2 days.")
.metadata("context", "Product price: $99. Shipping: 5-7 business days.")
.build();
EvaluationResult result = evaluator.evaluate(input);
// "2 days" contradicts "5-7 business days" -> score < 1.0
// result.metrics(): supported, contradicted, fabricated, total_claimsWithout context (internal consistency check):
var input = Evaluator.EvaluationInput.builder()
.agentResponse("The study by Smith et al. (2024) in Nature found that...")
.build();
// Checks for invented citations, self-contradictions, fabricated claimsCombining Safety Evaluators
Run all safety evaluators as a guard rail in production:
var safetyEvaluators = List.of(
new ToxicityEvaluator(judge),
new BiasEvaluator(judge),
new HallucinationEvaluator(judge)
);
double safetyThreshold = 0.8;
for (Evaluator eval : safetyEvaluators) {
EvaluationResult result = eval.evaluate(input);
if (!result.passed(safetyThreshold)) {
log.warn("Safety check failed: {} scored {}", eval.name(), result.score());
}
}Trace-Eval Bridge
Package: com.tnsai.evaluation.bridge
The trace-eval bridge connects the observability layer (TnsAI.Quality traces) with the evaluation layer. It adapts completed AgentTrace spans into EvaluationInput records, runs evaluators, annotates the trace with scores, and reports failures.
Architecture
AgentTrace ──> TraceToEvalAdapter ──> EvaluationInput
│
Evaluator[] ──> EvaluationResult[]
│
EvalSpanAnnotator ──> Score on trace
│
onFailure callback ──> AutoHarness / alertingTraceEvalBridge
The main entry point. Orchestrates the full pipeline: adapt, evaluate, annotate, report.
public final class TraceEvalBridge {
// Process a completed trace through all evaluators
public List<EvaluationResult> process(AgentTrace trace);
// Builder pattern
public static Builder builder();
// Failed evaluation record for downstream processing
public record FailedEvaluation(
String traceId,
String agentId,
String evaluatorName,
double score,
String details,
Evaluator.EvaluationInput input
) {}
}Builder API:
| Method | Description |
|---|---|
evaluator(Evaluator) | Add an evaluator to the pipeline |
evaluators(List<Evaluator>) | Add multiple evaluators |
failureThreshold(double) | Score below this triggers the failure callback (default: 0.5) |
onFailure(Consumer<FailedEvaluation>) | Callback for scores below the threshold |
Usage:
var bridge = TraceEvalBridge.builder()
.evaluator(new FaithfulnessEvaluator(judge))
.evaluator(new ToxicityEvaluator(judge))
.evaluator(new HallucinationEvaluator(judge))
.failureThreshold(0.5)
.onFailure(failure -> {
log.warn("Low score on trace {}: {} = {}",
failure.traceId(), failure.evaluatorName(), failure.score());
alertingService.notify(failure);
})
.build();
// Process a completed trace
List<EvaluationResult> results = bridge.process(completedTrace);TraceToEvalAdapter
Converts an AgentTrace into an EvaluationInput by extracting the last user message, assistant response, tool call sequences, and latency from trace observations.
public final class TraceToEvalAdapter {
public Evaluator.EvaluationInput adapt(AgentTrace trace);
}Extraction logic:
- User input: Extracted from
GENERATIONobservation input - Agent response: Extracted from
GENERATIONobservation output - Tool sequence: Collected from
SPANobservation names - Latency: Computed from
GENERATIONobservation start/end times - Metadata: Includes
trace_id,agent_id,session_id, plus all trace metadata
Returns null if the trace has no chat observations.
EvalSpanAnnotator
Writes evaluation scores back onto the AgentTrace as Score objects for observability dashboards.
public final class EvalSpanAnnotator {
public void annotate(AgentTrace trace, List<EvaluationResult> results);
}Each evaluation result is written as a numeric score with the key eval.<evaluatorName> and source ScoreSource.HEURISTIC:
// Internally calls:
trace.addScore(Score.numeric("eval.faithfulness", 0.95, ScoreSource.HEURISTIC));
trace.addScore(Score.numeric("eval.toxicity", 1.0, ScoreSource.HEURISTIC));Production Pipeline Example
Wire the bridge into your agent's trace completion hook for continuous evaluation:
// Set up once
LLMJudge judge = prompt -> evaluationLlm.chat(prompt);
var bridge = TraceEvalBridge.builder()
.evaluator(new FaithfulnessEvaluator(judge))
.evaluator(new ContextualRecallEvaluator(judge))
.evaluator(new ToxicityEvaluator(judge))
.evaluator(new BiasEvaluator(judge))
.evaluator(new HallucinationEvaluator(judge))
.failureThreshold(0.6)
.onFailure(failure -> autoHarness.recordFailure(failure))
.build();
// On every completed trace
agent.setTraceCompletionHook(trace -> {
List<EvaluationResult> results = bridge.process(trace);
// Scores are now on the trace for dashboards
// Failures trigger auto-harness test generation
});Evaluator Summary
| Evaluator | Package | Score Meaning | Required Metadata |
|---|---|---|---|
FaithfulnessEvaluator | rag | 1.0 = grounded | retrieved_documents |
ContextualPrecisionEvaluator | rag | 1.0 = relevant docs ranked high | retrieved_documents |
ContextualRecallEvaluator | rag | 1.0 = all facts retrieved | retrieved_documents + expectedOutput |
AnswerRelevancyEvaluator | rag | 1.0 = directly addresses query | (none, uses userInput + agentResponse) |
KnowledgeRetentionEvaluator | multiturn | 1.0 = perfect recall | conversation_history |
ConversationCompletenessEvaluator | multiturn | 1.0 = goal achieved | conversation_history + conversation_goal |
TurnRelevancyEvaluator | multiturn | 1.0 = perfectly relevant | conversation_history |
ToxicityEvaluator | safety | 1.0 = clean | (none, uses agentResponse) |
BiasEvaluator | safety | 1.0 = no bias | (none, uses agentResponse) |
HallucinationEvaluator | safety | 1.0 = no hallucination | context or retrieved_documents (optional) |
Advanced: Evaluation Hooks
The evaluation hook system provides lifecycle callbacks during agent execution for metric collection without modifying agent code. The contracts live in tnsai-core (com.tnsai.eval.hooks); the implementation lives in tnsai-quality.
EvalHook Interface
EvalHook defines callback methods invoked at key points during agent execution. All methods have default no-op implementations, so you only override what you need.
public interface EvalHook {
// Agent lifecycle
default void onAgentStart(EvalContext ctx, String agentId, String sessionId) {}
default void onAgentStop(EvalContext ctx, String reason) {}
default void onError(EvalContext ctx, Throwable error, String phase) {}
// Chat lifecycle
default void onBeforeChat(EvalContext ctx, String message) {}
default void onAfterChat(EvalContext ctx, String response, long latencyMs) {}
// Tool lifecycle
default void onBeforeToolCall(EvalContext ctx, String toolName, Map<String, Object> arguments) {}
default void onAfterToolCall(EvalContext ctx, String toolName, Object result,
boolean success, long latencyMs) {}
// Goal tracking
default void onGoalCompleted(EvalContext ctx, String goalId, boolean success,
Map<String, Object> details) {}
// Memory access
default void onMemoryAccess(EvalContext ctx, String operation, String key,
int resultCount, long latencyMs) {}
// Inter-agent communication
default void onAgentCommunication(EvalContext ctx, String fromAgent, String toAgent,
String messageType, long latencyMs) {}
// Planning events
default void onPlanGenerated(EvalContext ctx, String goalId,
List<PlanStep> steps, long latencyMs) {}
default void onPlanStepExecuted(EvalContext ctx, String actionName,
boolean success, long latencyMs) {}
default void onPlanCompleted(EvalContext ctx, boolean success,
int totalSteps, int executedSteps, long totalLatencyMs) {}
default void onPlanFailed(EvalContext ctx, String goalId, String reason) {}
}Lifecycle flow:
onAgentStart()
|
onBeforeChat() ----+
| | (loop)
onBeforeToolCall() |
| |
onAfterToolCall() |
| |
onAfterChat() <----+
|
onPlanGenerated()
|
onPlanStepExecuted() --+
| | (loop)
onPlanCompleted() <----+
|
onGoalCompleted()
|
onAgentStop()EvalHookManager
EvalHookManager (com.tnsai.eval.hooks in tnsai-quality) is the concrete implementation of EvalHandle. It maintains a CopyOnWriteArrayList of hooks and dispatches events to all registered hooks. Errors in one hook do not affect others.
EvalHookManager manager = new EvalHookManager();
manager.addHook(new LatencyHook());
manager.addHook(new QualityHook());
EvalContext ctx = EvalContext.create("session-1", "agent-1");
manager.fireOnAgentStart(ctx, "agent-1", "session-1");
manager.fireOnBeforeChat(ctx, "Hello");
// ... agent execution ...
manager.fireOnAfterChat(ctx, "Hi there!", 150);
manager.fireOnAgentStop(ctx, "COMPLETED");Key methods:
| Method | Description |
|---|---|
addHook(EvalHook) | Register a hook |
removeHook(EvalHook) | Remove a hook, returns true if found |
clearHooks() | Remove all hooks |
hookCount() | Number of registered hooks |
setEnabled(boolean) | Enable/disable all hook execution |
isEnabled() | Check if hooks are enabled |
EvalHandle SPI
EvalHandle is the SPI interface in tnsai-core. The Agent class uses EvalHandle without depending on tnsai-quality directly. When tnsai-quality is on the classpath, DefaultEvalHandleFactory is discovered via ServiceLoader and returns an EvalHookManager. When absent, EvalHandle.NOOP silently ignores all operations.
// Factory discovery (handled internally by Agent)
EvalHandle.Factory factory = EvalHandle.Factory.discover();
EvalHandle handle = (factory != null) ? factory.create() : EvalHandle.NOOP;EvalContext
Thread-safe container for collecting evaluation metrics during agent execution. Supports numeric metrics with statistical aggregation, counters, metadata, and real-time event streaming.
EvalContext ctx = EvalContext.create("session-123", "research-agent");
// Record metrics
ctx.recordMetric("latency", 150);
ctx.recordMetric("accuracy", 0.95);
ctx.recordMetric("latency", 200, Map.of("phase", "toolcall"));
ctx.incrementCounter("tool_calls");
ctx.incrementCounter("tokens", 1500);
// Metadata
ctx.setSpec("model", "claude-sonnet-4-20250514");
ctx.getSpec("model"); // -> "claude-sonnet-4-20250514"
// Statistics
EvalContext.MetricStats stats = ctx.getStats("latency");
// stats.count(), stats.min(), stats.max(), stats.average()
// stats.p50(), stats.p90(), stats.p99()
// Real-time streaming
ctx.addListener(event ->
System.out.println(event.name() + " = " + event.value()));
// Export
Map<String, Object> report = ctx.toMap();
ctx.complete("SUCCESS");MetricStats is a record with fields: name, count, min, max, sum, average, p50, p90, p99.
Evaluation
The Evaluation module provides a three-layer system for measuring agent quality: evaluators that score responses, a benchmark engine that runs test datasets, and reporting tools for quality gates, trend analysis, and regression detection.
Server
Run TnsAI as a backend — WebSocket API, session management, human-in-the-loop tool approval.