LLM Observability
Capture every LLM call as a typed LLMCallLog event — prompt, response, token usage, cost, streaming timing, errors, full context. One publish call per request, decorator-shaped so any provider works without modification.
See also: Cost Tracking — the older
CostAwareLLMClient+BudgetManagersystem focused on spend control. Use observability for full call telemetry; use cost tracking when you need budget enforcement at the client edge.
Why a typed event
SLF4J debug lines and OTel span attributes answer "did this call happen?" but not "which prompt did agent X send in session Y?", "what did the LLM reply?", "what did this turn cost in USD attributed to which tenant?". Every major LLM-ops tool (LangFuse, Helicone, Phoenix) is built around per-call telemetry as a first-class object. LLMCallLog is the same shape, native to TnsAI.
Quick Start
Wrap any LLMClient with CapturingLLMClient. The default publisher emits one structured SLF4J line per call:
import com.tnsai.llm.observability.CapturingLLMClient;
import com.tnsai.llm.observability.JsonLLMPricingRegistry;
import com.tnsai.llm.observability.Slf4jLLMCallPublisher;
LLMClient base = LLMClientFactory.create("openai", "gpt-4o", 0.7f);
LLMClient observed = new CapturingLLMClient(
base,
JsonLLMPricingRegistry.defaultRegistry(), // 7 providers, 14+ models
new Slf4jLLMCallPublisher());
Agent agent = AgentBuilder.create()
.role(new MyRole())
.llm(observed)
.build();Every chat / streamChat now logs:
INFO com.tnsai.llm.callLog - llm.call provider=openai model=gpt-4o elapsedMs=842 \
promptTokens=312 completionTokens=89 cachedTokens=0 totalTokens=401 \
costUSD=0.00168 pricingTable=2026-05 finishReason=STOP streamed=false tools=2Failures log at WARN with errorClass, errorMessage, and httpStatus.
What gets captured
LLMCallLog is a typed record carrying:
| Field | Description |
|---|---|
callId | UUID — primary key for joining call to downstream events |
startedAt / completedAt / elapsed | Wall-clock timing |
provider / model / endpoint | Routing |
prompt | Messages, system prompt, parameters, prompt-cache markers |
tools | ToolSurface — names, schemas, SHA-256 hash for cache correlation |
response | Content, tool calls, reasoning content (o1 / Claude thinking) |
usage | Prompt / completion / cached / reasoning / total tokens |
cost | CostEstimate — prompt / completion / cached-discount / total USD |
finishReason | STOP, LENGTH, TOOL_CALL, CONTENT_FILTER |
streamMetrics | TTFT + chunk count for streaming calls |
error | ErrorInfo for failed calls — re-thrown after capture |
context | Full EventContext — tenant, agent, role, capability, session |
retryAttempt | Retry counter |
Pricing Registry
JsonLLMPricingRegistry loads versioned rate cards from classpath JSON:
JsonLLMPricingRegistry pricing = JsonLLMPricingRegistry.defaultRegistry();
// loads /pricing/2026-05.json — 7 providers, 14+ modelsDefault coverage: openai (GPT-4o, GPT-4o-mini, o1-preview), anthropic (Claude Sonnet 4, Opus 4, Haiku 4.5), google (Gemini 2.0 Flash, Pro), mistral (Large, Small), groq (Llama 3.3 70B, Mixtral 8x7B), cohere (Command R+, R), ollama (wildcard at zero — local models).
Bring your own rate card for enterprise-negotiated pricing or new providers:
LLMPricingRegistry custom = new InMemoryLLMPricingRegistry("contract-2026-05");
custom.register("openai", "gpt-4o", new ModelPricing(
BigDecimal.valueOf(0.0015), // promptPer1k (negotiated)
BigDecimal.valueOf(0.0005), // cachedPer1k
BigDecimal.valueOf(0.006), // completionPer1k
null)); // reasoningPer1k
LLMClient observed = new CapturingLLMClient(base, custom, new Slf4jLLMCallPublisher());The pricingTable field on every LLMCallLog records which version generated the cost — historical estimates don't shift when rates change downstream.
Streaming Capture
For streaming calls, the decorator captures StreamMetrics:
public record StreamMetrics(
Instant firstChunkAt,
Duration timeToFirstToken, // operator's #1 latency metric
long chunkCount,
Duration interChunkP50, // p50/p99 are zero in 0.9.x; histogram-friendly
Duration interChunkP99 // counts ship now, percentiles in a follow-up
) {}TTFT (time to first token) is the metric you graph for user-perceived latency.
Tool Surface Hashing
When the LLM call advertises tools, ToolSurface carries the names + JSON schemas plus a SHA-256 hash of the canonical sorted-key form:
public record ToolSurface(
List<String> toolNames,
List<String> toolSchemas,
String surfaceHash
) {}Same surfaceHash across calls = identical tool set = prompt-cache friendly. Use the hash to identify cacheable trajectories in your dashboards.
Custom Publisher
LLMCallPublisher is a single-method functional interface. Build your own to push to LangFuse, Helicone, Phoenix, or a custom sink:
public final class LangFusePublisher implements LLMCallPublisher {
@Override
public void publish(LLMCallLog call) {
// Convert LLMCallLog → LangFuse trace + generation
langfuseClient.trace()
.name(call.callId())
.metadata(Map.of(
"provider", call.provider(),
"model", call.model(),
"tenant", call.context().tenantId().orElse("default")))
.generation(g -> g
.input(call.prompt().messages())
.output(call.response().content())
.usage(call.usage())
.totalCost(call.cost().totalUSD()))
.submit();
}
}The publisher contract requires publish not to throw — observability failures must never block the agent's hot path.
Cost Attribution
LLMCallLog.context() carries the full EventContext — tenant, agent, role, capability, session, group. Aggregate cost in your downstream sink along any of these dimensions:
- Per tenant — billing
- Per agent — which agent is the budget hog
- Per role — which role's LLM allocation is tight
- Per capability — chatty vs terse
@Capabilityimplementations - Per session — per-conversation cost for end-user billing
Multi-agent cost split per group member works the same way — group context propagates.
What's Not in the Default Publisher
Slf4jLLMCallPublisher deliberately does NOT log raw prompt or response text. Those can carry PII (user dictation, API keys passed as tool arguments, addresses in responses). Verbose dump belongs behind the redaction SPI from issue #80, on a separate publisher with explicit consumer opt-in.
Coverage Notes
- The decorator covers
chat()andstreamChat(). Multimodalchat(List<ContentPart> ...)and tool-awarestreamChatWithSpecpass through without capture in 0.9.x — those paths are smaller in production usage and will land with integration coverage in a follow-up. usage().promptTokens()is zero when the provider didn't populate the usage block (some local Ollama models). Cost estimate is also zero — a meaningful "no usage data" signal, not a bug.- The
endpointfield is populated when the underlying client exposes its base URL; falls back to empty string otherwise.
See Also
- Cost Tracking —
CostAwareLLMClient+BudgetManagerfor client-edge spend control - Sampling — pair
CapturingLLMClientwith sampling decorators when you ship to a high-volume aggregator - Providers — the 14 built-in LLM providers all work with the decorator unchanged