Observability
Quick Start
TnsAI provides structured logging out of the box. Create a logger for your agent and attach key-value metadata to every log entry for easy filtering and searching in your logging backend.
// Structured logging
AgentLogger logger = AgentLogger.forAgent("agent-001");
logger.info("Processing request")
.with("userId", userId)
.with("action", "search")
.log();OpenTelemetry Tracing
Distributed tracing with automatic span management:
TnsAITelemetry telemetry = TnsAITelemetry.builder()
.serviceName("my-agent-service")
.serviceVersion("1.0.0")
.otlpEndpoint("http://localhost:4317")
.build();
OpenTelemetryTracer tracer = new OpenTelemetryTracer(telemetry.getTracer());
try (SpanScope scope = tracer.startSpan("process-request")) {
scope.setAttribute("user.id", userId);
scope.setAttribute("request.type", "search");
// ... work happens here ...
scope.setSuccess();
} // Span automatically closed
// Convenience wrapper
String result = tracer.trace("fetch-data", () -> fetchData(query));Prometheus Metrics
TnsAI provides pre-defined counters and histograms that track agent actions, LLM calls, token usage, and latency. These integrate with OpenTelemetry and can be exported to any compatible metrics backend.
OpenTelemetryMetrics metrics = new OpenTelemetryMetrics(meter);
// Record agent actions
metrics.recordActionSuccess("search", 150); // type, duration ms
metrics.recordActionFailure("write", 50, "timeout");
// Record LLM calls
metrics.recordLlmCall("openai", "gpt-4o", 1000, 500, 2300); // provider, model, tokens, latency
metrics.recordLlmCallFailure("anthropic", "claude-3", 1500, "rate_limit");
// Gauges
metrics.setActiveAgents(5);Available metrics:
| Metric | Type | Description |
|---|---|---|
tnsai.agent.actions.total | Counter | Total actions executed |
tnsai.agent.actions.success | Counter | Successful actions |
tnsai.agent.actions.failed | Counter | Failed actions |
tnsai.agent.action.duration | Histogram | Action execution time (ms) |
tnsai.llm.calls.total | Counter | Total LLM calls |
tnsai.llm.tokens.prompt | Counter | Input tokens consumed |
tnsai.llm.tokens.completion | Counter | Output tokens generated |
tnsai.llm.latency | Histogram | LLM response time (ms) |
tnsai.agent.active | Gauge | Currently active agents |
Structured Logging
The AgentLogger provides a fluent API for emitting structured log entries with arbitrary key-value metadata. It also includes convenience methods for common agent events like action starts, completions, and LLM calls.
AgentLogger logger = AgentLogger.forAgent("agent-001");
// Fluent API
logger.info("Tool executed")
.with("tool", "brave_search")
.with("duration", 350)
.log();
// Convenience methods
logger.logActionStart("searchPapers", params);
logger.logActionComplete("searchPapers", 1200, true);
logger.logLLMCall("gpt-4o", 500, 2100);
// Timed execution
String result = logger.timed("database-query", () -> db.query(sql));Themed Logging
Themed loggers wrap SLF4J and add fun, themed prefixes to log messages based on the event type. 7 built-in themes, 22 event types, custom theme support.
// Create with theme
ThemedLogger logger = ThemedLogger.create("my-agent", LogThemes.STAR_WARS);
// Log events — each gets a themed prefix
logger.actionStart("Searching for documents");
// Output: [LIGHTSABER ON] Searching for documents
logger.actionComplete("Found 5 documents");
// Output: [THE FORCE SUCCEEDS] Found 5 documents
logger.error("Connection failed");
// Output: [SITH LORD DETECTED] Connection failed
// Startup/shutdown banners
logger.logStartup();
// Output: A long time ago in a galaxy far, far away... TnsAI awakens.Factory methods:
// With explicit theme
ThemedLogger logger = ThemedLogger.create("agent-001", LogThemes.LOTR);
// Default theme (no prefix)
ThemedLogger logger = ThemedLogger.create("agent-001");
// From class name
ThemedLogger logger = ThemedLogger.forClass(MyAgent.class, LogThemes.MATRIX);Available themes:
| Theme | Name | Example prefix |
|---|---|---|
LogThemes.DEFAULT | "default" | (no prefix) |
LogThemes.STAR_WARS | "starwars" | [LIGHTSABER ON], [CONSULTING YODA] |
LogThemes.LOTR | "lotr" | [QUEST BEGINS], [CONSULTING GANDALF] |
LogThemes.MATRIX | "matrix" | [ENTERING THE MATRIX], [CALLING ORACLE] |
LogThemes.PIRATE | "pirate" | [SETTING SAIL], [CONSULTING DAVY JONES] |
LogThemes.TURKISH | "turkish" | [IS BASLIYOR], [AKIL DANIYOR] |
LogThemes.EMOJI | "emoji" | Lightning, Robot, Trophy emojis |
22 event types (LogEvent enum):
| Group | Events |
|---|---|
| Agent lifecycle | AGENT_START, AGENT_STOP, AGENT_IDLE |
| Actions | ACTION_START, ACTION_COMPLETE, ACTION_ERROR |
| Planning | PLAN_START, PLAN_COMPLETE, GOAL_ACHIEVED, GOAL_FAILED |
| LLM | LLM_CALL, LLM_RESPONSE, LLM_ERROR, TOOL_CALL |
| Communication | MESSAGE_SENT, MESSAGE_RECEIVED |
| State | STATE_CHANGE, BELIEF_UPDATE |
| General | INFO, WARNING, ERROR, DEBUG, SUCCESS, FAILURE |
Custom themes and runtime switching:
// Implement LogTheme interface
LogTheme custom = new LogTheme() {
public String name() { return "custom"; }
public String format(LogEvent event, String message) {
return "[MY-PREFIX] " + message;
}
};
// Register for lookup by name
LogThemes.register(custom);
LogTheme theme = LogThemes.get("custom");
// Change theme at runtime
logger.setTheme(LogThemes.PIRATE);
// List and check themes
String[] names = LogThemes.listThemes();
boolean exists = LogThemes.hasTheme("starwars");Prometheus Exporter
If you use Prometheus for monitoring, this exporter spins up a lightweight HTTP server that serves metrics in the Prometheus text format. It syncs with the Metrics singleton and exposes all predefined agent and LLM counters.
// Create and start
PrometheusMetricsExporter exporter = PrometheusMetricsExporter.builder()
.port(9090)
.build();
exporter.start();
// Metrics available at http://localhost:9090/metrics
// Record metrics
exporter.recordActionSuccess("search", 150); // type, duration ms
exporter.recordActionFailure("write", "timeout", 50);
exporter.recordLlmCall("openai", "gpt-4o", 1000, 500, 2300);
exporter.recordLlmCallFailure("anthropic", "claude-3", "rate_limit", 1500);
exporter.setActiveAgents(5);
// One-liner startup
PrometheusMetricsExporter exporter = PrometheusMetricsExporter.createAndStart();
// Shutdown
exporter.stop();Environment variables:
| Variable | Default | Description |
|---|---|---|
TNSAI_PROMETHEUS_PORT | 9090 | HTTP server port |
TNSAI_PROMETHEUS_ENABLED | true | Enable/disable exporter |
// Configure from environment
PrometheusMetricsExporter exporter = PrometheusMetricsExporter.builder()
.fromEnvironment()
.build();Predefined Prometheus metrics:
| Metric | Type | Labels |
|---|---|---|
tnsai_agent_actions_total | Counter | status |
tnsai_agent_actions_success_total | Counter | -- |
tnsai_agent_actions_failed_total | Counter | error_type |
tnsai_llm_calls_total | Counter | provider, model, status |
tnsai_llm_calls_failed_total | Counter | provider, model, error_type |
tnsai_llm_prompt_tokens_total | Counter | provider, model |
tnsai_llm_completion_tokens_total | Counter | provider, model |
tnsai_agent_active | Gauge | -- |
tnsai_agent_action_duration_seconds | Histogram | action_type |
tnsai_llm_latency_seconds | Histogram | provider, model |
Custom metrics:
Counter counter = exporter.getOrCreateCounter("my_counter", "My help text", "label1");
Gauge gauge = exporter.getOrCreateGauge("my_gauge", "My help text", "label1");
Histogram histogram = exporter.getOrCreateHistogram("my_histogram", "My help text", "label1");LLM Instrumentation
This module provides OpenTelemetry instrumentation specifically for LLM calls. It follows the GenAI semantic conventions, which means your traces are compatible with observability tools like Jaeger, Zipkin, and Grafana Tempo without any custom mapping.
LlmInstrumentation llmInstr = new LlmInstrumentation(telemetry.getTracer());
// Trace a chat completion
ChatResponse response = llmInstr.traceChat("openai", "gpt-4", () -> {
return client.chat(message);
});
// With request parameters
ChatResponse response = llmInstr.traceChat("openai", "gpt-4", 4096, 0.7, () -> {
return client.chat(message);
});
// Trace streaming
Stream<ChatChunk> stream = llmInstr.traceStreamChat("anthropic", "claude-3", () -> {
return client.streamChat(message);
});
// Trace embeddings
List<float[]> embeddings = llmInstr.traceEmbeddings("openai", "text-embedding-3", 10, () -> {
return client.embed(inputs);
});GenAI semantic convention attributes:
| Attribute | Description |
|---|---|
gen_ai.system | LLM provider (openai, anthropic, etc.) |
gen_ai.request.model | Model name |
gen_ai.request.max_tokens | Max tokens requested |
gen_ai.request.temperature | Temperature setting |
gen_ai.request.top_p | Top-p setting |
gen_ai.usage.prompt_tokens | Prompt tokens used |
gen_ai.usage.completion_tokens | Completion tokens generated |
gen_ai.usage.total_tokens | Total tokens |
gen_ai.response.finish_reasons | Finish reason (stop, length, error) |
gen_ai.operation.name | Operation type (chat, stream_chat, embeddings) |
TnsAI-specific attributes:
| Attribute | Description |
|---|---|
tnsai.agent.id | Agent identifier |
tnsai.agent.name | Agent name |
tnsai.message.length | Input message length |
tnsai.tool_use.enabled | Whether tool use is enabled |
tnsai.tool_calls.count | Number of tool calls |
Recording token usage and tool events on a span:
Span span = Span.current();
llmInstr.recordTokenUsage(span, 150, 50);
llmInstr.recordResponse(span, 150, 50, "stop");
llmInstr.recordToolUse(span, 3);
llmInstr.recordToolCallEvent(span, "brave_search", true);Span builder for full control:
Span span = llmInstr.spanBuilder("openai", "gpt-4")
.operation("chat")
.maxTokens(4096)
.temperature(0.7)
.topP(0.9)
.agent("agent-001", "ResearchAgent")
.messageLength(500)
.toolUseEnabled(true)
.start();Agent Instrumentation
Agent instrumentation adds tracing and metrics to agent operations without modifying the Agent class itself. It wraps calls like chat, action execution, and tool calls with OpenTelemetry spans, giving you end-to-end visibility into what your agents are doing.
// Initialize
AgentInstrumentation instrumentation = new AgentInstrumentation(telemetry);
// Or with custom tracer/metrics
AgentInstrumentation instrumentation = new AgentInstrumentation(tracer, metrics);
// Register agents (increments active agent gauge)
instrumentation.registerAgent(agent);Traced operations:
// Trace chat
String response = instrumentation.traceChat(agent, "Hello", () -> {
return agent.chat("Hello");
});
// Trace action execution
Object result = instrumentation.traceAction(agent, "searchPapers", params, () -> {
return agent.executeAction("searchPapers", params);
});
// Trace tool call
String output = instrumentation.traceToolCall(agent, "brave_search", args, () -> {
return tool.execute(args);
});
// Trace LLM call with GenAI conventions
ChatResponse resp = instrumentation.traceLlmCall(agent, "openai", "gpt-4", () -> {
return llmClient.chat(messages);
});Span attributes per operation:
| Operation | Attributes |
|---|---|
agent.chat | agent.id, agent.name, message.length, message.preview |
agent.action | agent.id, agent.name, action.name, action.parameters.count |
agent.tool_call | agent.id, tool.name, tool.arguments.count |
agent.llm_call | agent.id, gen_ai.system, gen_ai.request.model, gen_ai.operation.name |
Recording metrics directly:
instrumentation.recordLlmTokens("openai", "gpt-4", 150, 50, 2300);
instrumentation.recordLlmFailure("openai", "gpt-4", 1500, "rate_limit");
instrumentation.unregisterAgent(agent); // decrements active agent gaugeAgent-Specific Tracing
While the above instrumentation gives you OpenTelemetry-level visibility, agent-specific tracing goes deeper. It captures the full execution trace of a single agent run, including every LLM call, tool invocation, and guardrail check, along with quality scores you attach manually or automatically.
AgentTrace
An AgentTrace represents a complete execution recording for one agent run. You start a trace, add observations as the agent works, and then complete it. This is the foundation for evaluation and debugging.
AgentTrace trace = AgentTrace.start("agent-001", "research-task");
// Add observations during execution
trace.addObservation(Observation.span("llm-call")
.input("Summarize the document")
.output("The document describes...")
.duration(Duration.ofMillis(2300))
.metadata(Map.of("model", "claude-sonnet-4", "tokens", 1500))
.build());
trace.addObservation(Observation.generation("tool-call")
.input(Map.of("tool", "brave_search", "query", "quantum computing"))
.output("Results: ...")
.build());
trace.addObservation(Observation.event("guardrail-check")
.metadata(Map.of("guardrail", "pii-filter", "passed", true))
.build());
trace.complete();Observation Types
Observations are the building blocks of a trace. Each one records a specific event or operation during the agent's execution, and they come in three types depending on what happened.
| Type | Constant | Description |
|---|---|---|
SPAN | Observation.span() | Timed execution block (LLM call, tool execution) |
GENERATION | Observation.generation() | LLM text generation with input/output |
EVENT | Observation.event() | Point-in-time event (guardrail check, state change) |
Each observation supports: input, output, duration, metadata, parentId (for nesting), and status.
Score
Scores let you attach quality measurements to a trace. You can use them for manual evaluation (did the agent answer correctly?) or automated evaluation (did the output contain PII?). Scores come in three types: numeric, boolean, and categorical.
// Numeric score
trace.addScore(Score.numeric("relevance", 0.92));
// Boolean score
trace.addScore(Score.bool("contains_pii", false));
// Categorical score
trace.addScore(Score.categorical("sentiment", "positive"));| Factory Method | Score Type | Value |
|---|---|---|
Score.numeric(name, value) | Numeric | 0.0-1.0 double |
Score.bool(name, value) | Boolean | true/false |
Score.categorical(name, value) | Categorical | String category |
TraceContext (ThreadLocal Nesting)
TraceContext uses a ThreadLocal to track which trace is active on the current thread. This means you can start nested spans anywhere in your code and they automatically link to the parent span, without passing the trace object through every method call.
// Set current trace for this thread
TraceContext.set(trace);
// Get current trace (returns Optional)
AgentTrace current = TraceContext.current().orElseThrow();
// Nested spans auto-link to parent
try (var scope = TraceContext.startSpan("sub-operation")) {
// This span's parentId is automatically set to the enclosing span
doWork();
}
// Clear when done
TraceContext.clear();Guardrail SPI
Guardrails are safety checks that evaluate agent behavior during or after execution. Implement the Guardrail interface to define your own checks, such as ensuring the agent does not leak sensitive data or produce harmful content.
public interface Guardrail {
GuardrailResult evaluate(AgentTrace trace);
String name();
}PiiGuardrail
The PiiGuardrail is a built-in guardrail that scans agent outputs for personally identifiable information such as email addresses, phone numbers, and social security numbers. Use it to prevent your agents from accidentally leaking sensitive data.
PiiGuardrail piiGuard = new PiiGuardrail();
GuardrailResult result = piiGuard.evaluate(trace);
if (!result.passed()) {
System.out.println("PII detected: " + result.findings());
// Findings include: type (EMAIL, PHONE, SSN, etc.), location, severity
}Wire guardrails into the agent:
Agent agent = AgentBuilder.create()
.model("claude-sonnet-4")
.guardrail(new PiiGuardrail())
.build();Health Checks
Health checks let you monitor the operational status of your application's components (database, cache, LLM providers, etc.). TnsAI provides an aggregated health registry that runs checks concurrently, handles timeouts gracefully, and produces HTTP-ready summaries suitable for load balancer probes.
HealthStatus levels:
| Status | Severity | Description |
|---|---|---|
UP | 0 | Component functioning normally |
DEGRADED | 1 | Functioning with reduced capability |
UNKNOWN | 2 | Status cannot be determined |
DOWN | 3 | Not functioning |
HealthCheckResult factory methods:
HealthCheckResult.up() // healthy
HealthCheckResult.up("All good") // healthy with message
HealthCheckResult.down() // unhealthy
HealthCheckResult.down("Connection refused") // unhealthy with message
HealthCheckResult.down(exception) // unhealthy from exception
HealthCheckResult.degraded("High latency") // degraded
HealthCheckResult.unknown("Check timed out") // unknown
HealthCheckResult.of(status, message) // any status
// Add details (immutable — returns new instance)
HealthCheckResult result = HealthCheckResult.up()
.withDetail("connections", 10)
.withDetail("latency_ms", 50)
.withDuration(duration);
// Inspect
result.getStatus(); // HealthStatus enum
result.getMessage(); // String or null
result.getDetails(); // Map<String, Object>
result.isHealthy(); // true if UP or DEGRADED
result.getDuration(); // Duration or null
result.toMap(); // Map for JSON serialization
result.combine(other); // returns result with worse statusHealthIndicator interface:
// Implement for custom components
public class DatabaseHealth implements HealthIndicator {
public String getName() { return "database"; }
public HealthCheckResult check() {
return conn.isValid(5) ? HealthCheckResult.up() : HealthCheckResult.down();
}
}
// Lambda shorthand — from boolean
HealthIndicator simple = HealthIndicator.of("cache", () -> cache.isConnected());
// Lambda shorthand — from supplier
HealthIndicator detailed = HealthIndicator.of("cache", () -> {
return HealthCheckResult.up().withDetail("size", cache.size());
});
// Async and timeout support (default methods)
CompletableFuture<HealthCheckResult> future = indicator.checkAsync();
HealthCheckResult result = indicator.checkWithTimeout(Duration.ofSeconds(2));HealthRegistry:
HealthRegistry registry = HealthRegistry.getInstance(); // singleton
// Register indicators
registry.register(new ResourceHealthIndicator());
registry.register("custom", HealthIndicator.of("custom", () -> true));
// Check health
Optional<HealthCheckResult> single = registry.check("database");
HealthCheckResult all = registry.checkAll(); // default 30s timeout
HealthCheckResult all = registry.checkAll(Duration.ofSeconds(10)); // custom timeout
CompletableFuture<HealthCheckResult> async = registry.checkAllAsync();
HealthCheckResult matched = registry.checkMatching("db-*"); // wildcard
// Quick status (2s timeout per indicator)
HealthStatus status = registry.getQuickStatus(); // UP, DOWN, DEGRADED, or UNKNOWN
// HTTP-ready summary
Map<String, Object> summary = registry.getHealthSummary();
// { "status": "UP", "timestamp": "...", "totalDurationMs": 45,
// "components": { "database": {...}, "cache": {...} } }
// Management
registry.getIndicatorNames(); // Set<String>
registry.getIndicatorCount(); // int
registry.unregister("old");
registry.clear();
registry.shutdown(); // call on app shutdownResourceHealthIndicator -- monitors heap memory, CPU load, threads, and deadlocks:
// Default thresholds (85% memory, 90% CPU)
registry.register(new ResourceHealthIndicator());
// Custom thresholds
registry.register(new ResourceHealthIndicator(0.9, 0.9));
// Direct queries
ResourceHealthIndicator res = new ResourceHealthIndicator();
double heapUsage = res.getHeapUsageRatio(); // 0.0-1.0
int threads = res.getThreadCount();
boolean deadlock = res.hasDeadlock();Reports details: heap.used.mb, heap.max.mb, heap.usage.percent, nonHeap.used.mb, cpu.availableProcessors, cpu.systemLoadAverage, cpu.loadPerProcessor, threads.current, threads.peak, threads.deadlocked. Returns DEGRADED on high memory/CPU, DOWN on deadlock.
MCP Transports
The MCP module provides 5 transport implementations plus an OAuth decorator and an auto-detection utility for connecting to MCP servers over stdio, HTTP, SSE, and bidirectional streaming.
Security
The Quality module provides a layered security framework: annotation-driven access control and encryption (`SecurityEnforcer`), content moderation (`PatternBasedModerator`), prompt injection detection (`PromptInjectionDetector`), sandboxed execution, audit logging, and input validation (`ValidationService`).