TnsAI
LLM

LLM Caching

Reduce latency and cost with semantic response caching. The cache uses similarity matching so that near-identical prompts return cached responses without hitting the API.

Setup

Wrap any LLMClient with CachedLLMClient to enable semantic caching. The cache compares new prompts against stored ones using embedding similarity, so even rephrased questions can hit the cache.

LLMClient cached = CachedLLMClient.wrap(baseClient)
    .withCache(InMemorySemanticCache.builder()
        .embeddingProvider(new OpenAIEmbeddingProvider())
        .ttlSeconds(3600)
        .build())
    .highThreshold(0.95)   // Direct cache hit
    .lowThreshold(0.70)    // Similarity threshold
    .cacheStreaming(true)   // Cache streaming responses
    .build();

How It Works

The cache uses a two-threshold system to decide when a stored response is "close enough" to reuse. This avoids returning stale answers for slightly different questions while still catching exact or near-exact duplicates.

  1. A prompt comes in
  2. The cache computes semantic similarity against stored prompts
  3. If similarity \>= highThreshold (0.95) → direct hit, return cached response
  4. If similarity between lowThreshold and highThresholdgray zone, may verify
  5. If similarity \< lowThreshold (0.70) → cache miss, call LLM and store result

Configuration

Tune the similarity thresholds to control the tradeoff between cache hit rate and answer accuracy. Lower thresholds give more hits but risk returning less relevant cached responses.

ParameterDefaultDescription
withCache(SemanticCache)InMemorySemanticCacheSemantic cache implementation to use
highThreshold0.95Similarity score for direct cache hit
lowThreshold0.70Minimum similarity to consider a match
cacheStreamingtrueWhether to cache streaming responses

TTL and max size are configured on the SemanticCache implementation (e.g., InMemorySemanticCache.builder().ttlSeconds(3600).build()).

Prompt Caching

Some providers (notably Anthropic) offer native prompt caching that lets you reuse previously processed system prompts, tools, and conversation prefixes at dramatically reduced cost. PromptCachingClient wraps any client and automatically adds the required cache control markers -- you do not need to modify your prompts manually.

Builder

Configure which parts of the request to cache and how many breakpoints to place in conversation history.

PromptCachingClient client = PromptCachingClient.builder()
    .client(anthropicClient)         // Required: LLM client to wrap
    .cacheSystemPrompt(true)         // Cache the system prompt (default: true)
    .cacheTools(true)                // Cache tool definitions (default: true)
    .cacheHistoryBreakpoints(2)      // Number of history cache points (default: 4, max: 4)
    .minTokensForCaching(1024)       // Minimum token threshold (default: 1024, min: 1024)
    .build();

How It Works

PromptCachingClient transparently modifies outgoing requests to add cache markers. Your application code uses the client like any other LLMClient -- the caching is invisible.

  • System prompt: Marked for caching so it is reused across requests without re-processing.
  • Tools: A cache_control marker is added to the last tool definition (per Anthropic's recommendation).
  • History: Cache breakpoints are distributed evenly across conversation history. With cacheHistoryBreakpoints(2) and 10 messages, breakpoints are placed at positions ~3 and ~6.

Usage is identical to any LLMClient -- caching is automatic:

ChatResponse response = client.chat("Hello", systemPrompt, history, tools);

// Streaming works too
Stream<String> tokens = client.streamChat("Tell me more", systemPrompt, history, tools);
Stream<ChatChunk> chunks = client.streamChatWithSpec("Continue", systemPrompt, history, tools);

Cost Savings

Prompt caching provides significant cost savings, especially for applications with long system prompts or many tools. The initial cache write is slightly more expensive, but every subsequent read saves 90%.

OperationCost Impact
Cache read90% cheaper than regular input tokens
Cache write25% more expensive (one-time cost)
Cache TTL5 minutes (refreshed on each use)

Over multiple requests with the same system prompt and tools, savings compound rapidly.

Statistics

Monitor your cache effectiveness with built-in counters. Track hit rates and estimated savings to verify that caching is working as expected.

// Token counters
long readTokens = client.getTotalCacheReadTokens();
long creationTokens = client.getTotalCacheCreationTokens();
long requests = client.getRequestCount();

// Cache hit rate (0.0 to 1.0)
double hitRate = client.getCacheHitRate();

// Estimated savings as a fraction (e.g., 0.85 = 85% savings)
// Accounts for read savings (90%) minus creation overhead (25%)
double savings = client.getEstimatedSavings();

// Reset all counters
client.resetStats();

Per-response cache usage is also available on ChatResponse:

ChatResponse response = client.chat("Hello", systemPrompt, history, tools);

if (response.hasCacheUsage()) {
    response.getCacheReadInputTokens().ifPresent(
        tokens -> System.out.println("Cache read: " + tokens));
    response.getCacheCreationInputTokens().ifPresent(
        tokens -> System.out.println("Cache created: " + tokens));
}

Configuration Introspection

Inspect the current caching configuration at runtime, useful for debugging or logging the active settings.

client.isCacheSystemPromptEnabled();   // boolean
client.isCacheToolsEnabled();          // boolean
client.getCacheHistoryBreakpoints();   // int
client.getDelegate();                  // underlying LLMClient

Semantic Cache Interface

If the built-in InMemorySemanticCache does not fit your needs (for example, you want Redis-backed caching or persistence across restarts), implement the SemanticCache interface with your own storage backend.

public interface SemanticCache {
    Optional<String> get(String prompt);
    void put(String prompt, String response);
    void invalidate(String prompt);
    void clear();
}

Built-in: InMemorySemanticCache (thread-safe, LRU eviction).

On this page