Capture every LLM call as a typed LLMCallLog event — prompt, response, token usage, cost, streaming timing, errors, full context. One publish call per request, decorator-shaped so any provider works without modification.

See also: Cost Tracking — the older CostAwareLLMClient + BudgetManager system focused on spend control. Use observability for full call telemetry; use cost tracking when you need budget enforcement at the client edge.

Why a typed event

SLF4J debug lines and OTel span attributes answer "did this call happen?" but not "which prompt did agent X send in session Y?", "what did the LLM reply?", "what did this turn cost in USD attributed to which tenant?". Every major LLM-ops tool (LangFuse, Helicone, Phoenix) is built around per-call telemetry as a first-class object. LLMCallLog is the same shape, native to TnsAI.

Quick Start

Wrap any LLMClient with CapturingLLMClient. The default publisher emits one structured SLF4J line per call:

import com.tnsai.llm.observability.CapturingLLMClient;
import com.tnsai.llm.observability.JsonLLMPricingRegistry;
import com.tnsai.llm.observability.Slf4jLLMCallPublisher;

LLMClient base = LLMClientFactory.create("openai", "gpt-4o", 0.7f);

LLMClient observed = new CapturingLLMClient(
        base,
        JsonLLMPricingRegistry.defaultRegistry(),  // 7 providers, 14+ models
        new Slf4jLLMCallPublisher());

Agent agent = AgentBuilder.create()
        .role(new MyRole())
        .llm(observed)
        .build();

Every chat / streamChat now logs:

INFO  com.tnsai.llm.callLog - llm.call provider=openai model=gpt-4o elapsedMs=842 \
  promptTokens=312 completionTokens=89 cachedTokens=0 totalTokens=401 \
  costUSD=0.00168 pricingTable=2026-05 finishReason=STOP streamed=false tools=2

Failures log at WARN with errorClass, errorMessage, and httpStatus.

What gets captured

LLMCallLog is a typed record carrying:

Field	Description
`callId`	UUID — primary key for joining call to downstream events
`startedAt` / `completedAt` / `elapsed`	Wall-clock timing
`provider` / `model` / `endpoint`	Routing
`prompt`	Messages, system prompt, parameters, prompt-cache markers
`tools`	`ToolSurface` — names, schemas, SHA-256 hash for cache correlation
`response`	Content, tool calls, reasoning content (o1 / Claude thinking)
`usage`	Prompt / completion / cached / reasoning / total tokens
`cost`	`CostEstimate` — prompt / completion / cached-discount / total USD
`finishReason`	STOP, LENGTH, TOOL_CALL, CONTENT_FILTER
`streamMetrics`	TTFT + chunk count for streaming calls
`error`	`ErrorInfo` for failed calls — re-thrown after capture
`context`	Full `EventContext` — tenant, agent, role, capability, session
`retryAttempt`	Retry counter

Pricing Registry

JsonLLMPricingRegistry loads versioned rate cards from classpath JSON:

JsonLLMPricingRegistry pricing = JsonLLMPricingRegistry.defaultRegistry();
// loads /pricing/2026-05.json — 7 providers, 14+ models

Default coverage: openai (GPT-4o, GPT-4o-mini, o1-preview), anthropic (Claude Sonnet 4, Opus 4, Haiku 4.5), google (Gemini 2.0 Flash, Pro), mistral (Large, Small), groq (Llama 3.3 70B, Mixtral 8x7B), cohere (Command R+, R), ollama (wildcard at zero — local models).

Bring your own rate card for enterprise-negotiated pricing or new providers:

LLMPricingRegistry custom = new InMemoryLLMPricingRegistry("contract-2026-05");
custom.register("openai", "gpt-4o", new ModelPricing(
        BigDecimal.valueOf(0.0015),  // promptPer1k (negotiated)
        BigDecimal.valueOf(0.0005),  // cachedPer1k
        BigDecimal.valueOf(0.006),   // completionPer1k
        null));                       // reasoningPer1k

LLMClient observed = new CapturingLLMClient(base, custom, new Slf4jLLMCallPublisher());

The pricingTable field on every LLMCallLog records which version generated the cost — historical estimates don't shift when rates change downstream.

Streaming Capture

For streaming calls, the decorator captures StreamMetrics:

public record StreamMetrics(
    Instant firstChunkAt,
    Duration timeToFirstToken,    // operator's #1 latency metric
    long chunkCount,
    Duration interChunkP50,        // p50/p99 are zero in 0.9.x; histogram-friendly
    Duration interChunkP99         // counts ship now, percentiles in a follow-up
) {}

TTFT (time to first token) is the metric you graph for user-perceived latency.

Tool Surface Hashing

When the LLM call advertises tools, ToolSurface carries the names + JSON schemas plus a SHA-256 hash of the canonical sorted-key form:

public record ToolSurface(
    List<String> toolNames,
    List<String> toolSchemas,
    String surfaceHash
) {}

Same surfaceHash across calls = identical tool set = prompt-cache friendly. Use the hash to identify cacheable trajectories in your dashboards.

Custom Publisher

LLMCallPublisher is a single-method functional interface. Build your own to push to LangFuse, Helicone, Phoenix, or a custom sink:

public final class LangFusePublisher implements LLMCallPublisher {
    @Override
    public void publish(LLMCallLog call) {
        // Convert LLMCallLog → LangFuse trace + generation
        langfuseClient.trace()
                .name(call.callId())
                .metadata(Map.of(
                        "provider", call.provider(),
                        "model", call.model(),
                        "tenant", call.context().tenantId().orElse("default")))
                .generation(g -> g
                        .input(call.prompt().messages())
                        .output(call.response().content())
                        .usage(call.usage())
                        .totalCost(call.cost().totalUSD()))
                .submit();
    }
}

The publisher contract requires publish not to throw — observability failures must never block the agent's hot path.

Cost Attribution

LLMCallLog.context() carries the full EventContext — tenant, agent, role, capability, session, group. Aggregate cost in your downstream sink along any of these dimensions:

Per tenant — billing
Per agent — which agent is the budget hog
Per role — which role's LLM allocation is tight
Per capability — chatty vs terse @Capability implementations
Per session — per-conversation cost for end-user billing

Multi-agent cost split per group member works the same way — group context propagates.

What's Not in the Default Publisher

Slf4jLLMCallPublisher deliberately does NOT log raw prompt or response text. Those can carry PII (user dictation, API keys passed as tool arguments, addresses in responses). Verbose dump belongs behind the redaction SPI from issue #80, on a separate publisher with explicit consumer opt-in.

Coverage Notes

The decorator covers chat() and streamChat(). Multimodal chat(List<ContentPart> ...) and tool-aware streamChatWithSpec pass through without capture in 0.9.x — those paths are smaller in production usage and will land with integration coverage in a follow-up.
usage().promptTokens() is zero when the provider didn't populate the usage block (some local Ollama models). Cost estimate is also zero — a meaningful "no usage data" signal, not a bug.
The endpoint field is populated when the underlying client exposes its base URL; falls back to empty string otherwise.

LLM Observability