LLM Caching
Reduce latency and cost with semantic response caching. The cache uses similarity matching so that near-identical prompts return cached responses without hitting the API.
Setup
Wrap any LLMClient with CachedLLMClient to enable semantic caching. The cache compares new prompts against stored ones using embedding similarity, so even rephrased questions can hit the cache.
LLMClient cached = CachedLLMClient.wrap(baseClient)
.withCache(InMemorySemanticCache.builder()
.embeddingProvider(new OpenAIEmbeddingProvider())
.ttlSeconds(3600)
.build())
.highThreshold(0.95) // Direct cache hit
.lowThreshold(0.70) // Similarity threshold
.cacheStreaming(true) // Cache streaming responses
.build();How It Works
The cache uses a two-threshold system to decide when a stored response is "close enough" to reuse. This avoids returning stale answers for slightly different questions while still catching exact or near-exact duplicates.
- A prompt comes in
- The cache computes semantic similarity against stored prompts
- If similarity \>=
highThreshold(0.95) → direct hit, return cached response - If similarity between
lowThresholdandhighThreshold→ gray zone, may verify - If similarity \<
lowThreshold(0.70) → cache miss, call LLM and store result
Configuration
Tune the similarity thresholds to control the tradeoff between cache hit rate and answer accuracy. Lower thresholds give more hits but risk returning less relevant cached responses.
| Parameter | Default | Description |
|---|---|---|
withCache(SemanticCache) | InMemorySemanticCache | Semantic cache implementation to use |
highThreshold | 0.95 | Similarity score for direct cache hit |
lowThreshold | 0.70 | Minimum similarity to consider a match |
cacheStreaming | true | Whether to cache streaming responses |
TTL and max size are configured on the SemanticCache implementation (e.g., InMemorySemanticCache.builder().ttlSeconds(3600).build()).
Prompt Caching
Some providers (notably Anthropic) offer native prompt caching that lets you reuse previously processed system prompts, tools, and conversation prefixes at dramatically reduced cost. PromptCachingClient wraps any client and automatically adds the required cache control markers -- you do not need to modify your prompts manually.
Builder
Configure which parts of the request to cache and how many breakpoints to place in conversation history.
PromptCachingClient client = PromptCachingClient.builder()
.client(anthropicClient) // Required: LLM client to wrap
.cacheSystemPrompt(true) // Cache the system prompt (default: true)
.cacheTools(true) // Cache tool definitions (default: true)
.cacheHistoryBreakpoints(2) // Number of history cache points (default: 4, max: 4)
.minTokensForCaching(1024) // Minimum token threshold (default: 1024, min: 1024)
.build();How It Works
PromptCachingClient transparently modifies outgoing requests to add cache markers. Your application code uses the client like any other LLMClient -- the caching is invisible.
- System prompt: Marked for caching so it is reused across requests without re-processing.
- Tools: A
cache_controlmarker is added to the last tool definition (per Anthropic's recommendation). - History: Cache breakpoints are distributed evenly across conversation history. With
cacheHistoryBreakpoints(2)and 10 messages, breakpoints are placed at positions ~3 and ~6.
Usage is identical to any LLMClient -- caching is automatic:
ChatResponse response = client.chat("Hello", systemPrompt, history, tools);
// Streaming works too
Stream<String> tokens = client.streamChat("Tell me more", systemPrompt, history, tools);
Stream<ChatChunk> chunks = client.streamChatWithSpec("Continue", systemPrompt, history, tools);Cost Savings
Prompt caching provides significant cost savings, especially for applications with long system prompts or many tools. The initial cache write is slightly more expensive, but every subsequent read saves 90%.
| Operation | Cost Impact |
|---|---|
| Cache read | 90% cheaper than regular input tokens |
| Cache write | 25% more expensive (one-time cost) |
| Cache TTL | 5 minutes (refreshed on each use) |
Over multiple requests with the same system prompt and tools, savings compound rapidly.
Statistics
Monitor your cache effectiveness with built-in counters. Track hit rates and estimated savings to verify that caching is working as expected.
// Token counters
long readTokens = client.getTotalCacheReadTokens();
long creationTokens = client.getTotalCacheCreationTokens();
long requests = client.getRequestCount();
// Cache hit rate (0.0 to 1.0)
double hitRate = client.getCacheHitRate();
// Estimated savings as a fraction (e.g., 0.85 = 85% savings)
// Accounts for read savings (90%) minus creation overhead (25%)
double savings = client.getEstimatedSavings();
// Reset all counters
client.resetStats();Per-response cache usage is also available on ChatResponse:
ChatResponse response = client.chat("Hello", systemPrompt, history, tools);
if (response.hasCacheUsage()) {
response.getCacheReadInputTokens().ifPresent(
tokens -> System.out.println("Cache read: " + tokens));
response.getCacheCreationInputTokens().ifPresent(
tokens -> System.out.println("Cache created: " + tokens));
}Configuration Introspection
Inspect the current caching configuration at runtime, useful for debugging or logging the active settings.
client.isCacheSystemPromptEnabled(); // boolean
client.isCacheToolsEnabled(); // boolean
client.getCacheHistoryBreakpoints(); // int
client.getDelegate(); // underlying LLMClientSemantic Cache Interface
If the built-in InMemorySemanticCache does not fit your needs (for example, you want Redis-backed caching or persistence across restarts), implement the SemanticCache interface with your own storage backend.
public interface SemanticCache {
Optional<String> get(String prompt);
void put(String prompt, String response);
void invalidate(String prompt);
void clear();
}Built-in: InMemorySemanticCache (thread-safe, LRU eviction).
Audio & Speech
The `WhisperClient` provides speech-to-text capabilities powered by OpenAI's Whisper model. It supports transcription in multiple languages and translation of non-English audio to English.
Cost Tracking
Monitor and control LLM spending across providers with built-in cost tracking, budget management, and model pricing data for 100+ models.