Resilience
TnsAI.Core provides a declarative resilience framework built on top of Resilience4j. The `@Resilience` annotation configures retry, circuit breaker, rate limiting, bulkhead isolation, timeout, and fallback policies for actions and roles. The `ResilienceExecutor` applies these policies in a layered pipeline and tracks terminal failures in a dead-letter queue.
@Resilience Annotation
The @Resilience annotation is the single entry point for declaring all resilience policies on an action or role. Apply it to a method to configure that specific action, or to a class to set defaults for all actions in the role. Method-level annotations override type-level ones.
@Documented
@Retention(RetentionPolicy.RUNTIME)
@Target({ElementType.METHOD, ElementType.TYPE})
public @interface Resilience {
Retry retry() default @Retry();
CircuitBreaker circuitBreaker() default @CircuitBreaker();
RateLimit rateLimit() default @RateLimit();
int timeout() default 0; // milliseconds, 0 = no timeout
String fallback() default ""; // fallback method name (same signature)
boolean bulkhead() default false; // thread pool isolation
int maxConcurrent() default 10; // max concurrent calls for bulkhead
}@Retry
When a transient error occurs (network timeout, server 500), retrying after a short delay often succeeds. The @Retry sub-annotation configures automatic retries with exponential backoff (each retry waits longer) and optional jitter (randomized delays to avoid thundering herd problems).
@interface Retry {
int maxAttempts() default 0; // 0 = no retries
int backoffMs() default 1000; // initial delay
double multiplier() default 2.0; // backoff multiplier
int maxBackoffMs() default 30000; // max delay cap
Class<? extends Throwable>[] retryOn() default {}; // empty = all
Class<? extends Throwable>[] noRetryOn() default {}; // exclusions
boolean jitter() default true; // add randomization
}Example:
@Resilience(retry = @Retry(
maxAttempts = 3,
backoffMs = 500,
multiplier = 2.0,
maxBackoffMs = 10000,
retryOn = {NetworkException.class, RateLimitException.class},
jitter = true
))
public String fetchData(String query) { ... }With the default multiplier = 2.0 and jitter = true, delays are approximately: 500ms, 1000ms, 2000ms (with random jitter applied to each).
@CircuitBreaker
If a downstream service is down, retrying every request wastes resources and can cascade failures across your system. A circuit breaker tracks failures and, after a threshold is reached, "opens" the circuit to reject all calls immediately. After a cooldown period, it allows a few test calls through ("half-open") to check if the service has recovered.
@interface CircuitBreaker {
boolean enabled() default false;
int failureThreshold() default 5; // failures before opening
int failureWindowMs() default 60000; // window for counting failures
int resetTimeoutMs() default 30000; // wait before half-open
int successThreshold() default 3; // successes needed in half-open
int failureRateThreshold() default 0; // percentage (0-100), alternative to count
}Example:
@Resilience(circuitBreaker = @CircuitBreaker(
enabled = true,
failureThreshold = 5,
failureWindowMs = 60000,
resetTimeoutMs = 30000,
successThreshold = 3
))
public String callExternalApi() { ... }States: Closed (normal) -\> Open (reject all calls) -\> Half-Open (allow successThreshold test calls) -\> Closed (if tests pass).
@RateLimit
Rate limiting prevents your application from overwhelming a downstream service or exceeding API quotas. The @RateLimit sub-annotation lets you cap the number of requests in a time window, with four different strategies for how that limit is enforced.
@interface RateLimit {
boolean enabled() default false;
int maxRequests() default 100;
int windowMs() default 60000; // 1 minute
Strategy strategy() default Strategy.SLIDING_WINDOW;
}Rate Limit Strategies
The four strategies differ in how smoothly they distribute requests over time. Choose based on whether your use case tolerates bursts or needs strict even pacing.
| Strategy | Description |
|---|---|
FIXED_WINDOW | Counts requests in fixed time windows. Simple but can allow bursts at window boundaries. |
SLIDING_WINDOW | Counts requests in a sliding time window. Smoother rate control than fixed window. |
TOKEN_BUCKET | Tokens are added at a fixed rate. Each request consumes one token. Allows controlled bursts. |
LEAKY_BUCKET | Requests are processed at a fixed rate. Excess requests queue or are rejected. Smoothest output. |
Example:
@Resilience(rateLimit = @RateLimit(
enabled = true,
maxRequests = 60,
windowMs = 60000,
strategy = RateLimit.Strategy.TOKEN_BUCKET
))
public String queryLlm(String prompt) { ... }Combining Policies
In production, you typically want multiple resilience layers working together. A single @Resilience annotation can configure retry, circuit breaker, rate limiting, timeout, bulkhead isolation, and a fallback method all at once.
@Resilience(
retry = @Retry(maxAttempts = 3, backoffMs = 1000),
circuitBreaker = @CircuitBreaker(enabled = true, failureThreshold = 5, resetTimeoutMs = 30000),
rateLimit = @RateLimit(enabled = true, maxRequests = 100, windowMs = 60000),
timeout = 5000,
bulkhead = true,
maxConcurrent = 10,
fallback = "fetchDataFallback"
)
public Result fetchData(String query) { ... }
// Fallback must have same signature
public Result fetchDataFallback(String query) {
return Result.cached(query);
}Type-level defaults apply to all actions in a role unless overridden:
@Resilience(
retry = @Retry(maxAttempts = 2),
timeout = 10000
)
public class ApiRole extends Role {
@Resilience(retry = @Retry(maxAttempts = 5)) // overrides type-level retry
public String criticalCall() { ... }
// Uses type-level defaults: 2 retries, 10s timeout
public String normalCall() { ... }
}ResilienceExecutor
The ResilienceExecutor is the engine that applies all your resilience policies at runtime. It builds a layered pipeline where each policy wraps the next, using Resilience4j under the hood. You can also use it programmatically (without annotations) for ad-hoc resilient operations.
Pipeline Order
Policies are applied in a specific order, from outermost to innermost. Rate limiting is checked first (to avoid unnecessary work), then bulkhead isolation, then the circuit breaker, and finally retries wrap the actual operation.
Rate Limit -> Bulkhead -> Circuit Breaker -> Retry -> OperationEach layer wraps the next. When an operation fails and exhausts all resilience layers, the failure is recorded in the dead-letter queue.
Construction
You can create a ResilienceExecutor with defaults, inject a custom dead-letter queue, or take full control over all Resilience4j registries.
// Default registries + in-memory DLQ
ResilienceExecutor executor = new ResilienceExecutor();
// Custom dead-letter queue
ResilienceExecutor executor = new ResilienceExecutor(myDlq);
// Full control over all registries
ResilienceExecutor executor = new ResilienceExecutor(
circuitBreakerRegistry,
bulkheadRegistry,
rateLimiterRegistry,
deadLetterQueue
);Programmatic Usage with ResilienceRequest
When you need resilience for a one-off operation that is not tied to an annotated action method, use ResilienceRequest to define the operation and its policies inline.
ResilienceExecutor executor = new ResilienceExecutor();
String result = executor.execute(
ResilienceRequest.<String>builder()
.operationId("fetch-weather")
.operation(() -> httpClient.get("https://api.weather.com/current"))
.retryPolicy(RetryPolicy.defaultPolicy())
.rateLimit(60, 60000) // 60 requests per minute
.bulkhead(5) // max 5 concurrent calls
.timeout(5000) // 5 second timeout
.fallbackStrategy(ex -> "Weather data unavailable")
.build()
);FallbackStrategy
When all retries are exhausted and the operation still fails, a fallback strategy provides a default value instead of throwing an exception. This is useful for returning cached data or a graceful degradation response.
@FunctionalInterface
public interface FallbackStrategy<T> {
T fallback(Exception exception);
// Default: accepts all exceptions
default boolean supports(Exception exception) { return true; }
}Custom fallback with selective exception handling:
FallbackStrategy<String> fallback = new FallbackStrategy<>() {
@Override
public String fallback(Exception exception) {
return "Cached result";
}
@Override
public boolean supports(Exception exception) {
return exception instanceof NetworkException;
}
};Exception Handling
When a resilience policy rejects a request (rather than the underlying operation failing), the executor catches these Resilience4j-specific exceptions so you know which layer blocked the call.
| Exception | Meaning |
|---|---|
CallNotPermittedException | Circuit breaker is open |
BulkheadFullException | Max concurrent calls exceeded |
RequestNotPermitted | Rate limit exceeded |
TimeoutException | Operation timed out |
All failures are recorded in the dead-letter queue before being re-thrown.
DeadLetterQueue
When an operation fails even after retries, circuit breaker bypass, and fallback attempts, the failure is not silently discarded. Instead, it is recorded in a DeadLetterQueue (DLQ) so you can monitor terminal failures, alert on them, or replay the operations later.
Interface
The DLQ interface is intentionally simple: enqueue failed entries, and query them by operation ID or in bulk.
public interface DeadLetterQueue {
void enqueue(DeadLetterEntry entry);
List<DeadLetterEntry> getEntries();
List<DeadLetterEntry> getEntries(String operationId);
int size();
}DeadLetterEntry
Each DLQ entry captures everything you need to understand and potentially replay a failed operation: which operation failed, what exception occurred, when it happened, and any additional context as metadata.
DeadLetterEntry entry = DeadLetterEntry.builder()
.operationId("fetch-weather")
.exceptionType("NetworkException")
.exceptionMessage("Connection timed out")
.timestamp(Instant.now())
.metadata(Map.of("host", "api.weather.com", "port", 443))
.build();Fields:
id-- auto-generated UUIDoperationId-- identifies the operation that failedexceptionType-- exception class nameexceptionMessage-- exception messagetimestamp-- when the failure occurredmetadata-- additional context (immutable map)
Accessing the DLQ
You can query the DLQ through the executor to get all failures, filter by operation ID, or check the total failure count.
ResilienceExecutor executor = new ResilienceExecutor();
// After operations...
DeadLetterQueue dlq = executor.getDeadLetterQueue();
// All failures
List<DeadLetterEntry> allFailures = dlq.getEntries();
// Failures for a specific operation
List<DeadLetterEntry> weatherFailures = dlq.getEntries("fetch-weather");
// Total failure count
int failureCount = dlq.size();The default implementation (InMemoryDeadLetterQueue) stores entries in memory. Implement the DeadLetterQueue interface for persistent storage (database, Redis, etc.) and pass it to the ResilienceExecutor constructor.
Code Examples
These examples show resilience in practice, from a fully annotated action to DLQ monitoring.
Action with Full Resilience
This example shows a real-world action with all resilience layers active: retries for network and rate-limit errors, a circuit breaker to stop calling a failing service, rate limiting to stay within API quotas, a timeout, and a fallback that returns cached data.
@ActionSpec(type = ActionType.WEB_SERVICE, name = "fetchStockPrice")
@Resilience(
retry = @Retry(
maxAttempts = 3,
backoffMs = 1000,
multiplier = 2.0,
retryOn = {NetworkException.class, RateLimitException.class}
),
circuitBreaker = @CircuitBreaker(
enabled = true,
failureThreshold = 5,
resetTimeoutMs = 30000
),
rateLimit = @RateLimit(
enabled = true,
maxRequests = 120,
windowMs = 60000,
strategy = RateLimit.Strategy.SLIDING_WINDOW
),
timeout = 10000,
fallback = "fetchStockPriceFallback"
)
public double fetchStockPrice(String symbol) {
return stockApi.getPrice(symbol);
}
public double fetchStockPriceFallback(String symbol) {
return cacheStore.getLastKnownPrice(symbol);
}Monitoring Failures via DLQ
In production, you want to know when operations are permanently failing. This example sets up a periodic check that logs all DLQ entries, which you can hook into your alerting system.
ResilienceExecutor executor = new ResilienceExecutor();
// Periodic monitoring
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
scheduler.scheduleAtFixedRate(() -> {
DeadLetterQueue dlq = executor.getDeadLetterQueue();
int count = dlq.size();
if (count > 0) {
logger.warn("Dead letter queue has {} entries", count);
for (DeadLetterEntry entry : dlq.getEntries()) {
logger.warn(" Failed: {} - {} at {}",
entry.getOperationId(),
entry.getExceptionType(),
entry.getTimestamp());
}
}
}, 0, 1, TimeUnit.MINUTES);Prompt Strategies
TnsAI includes a prompt enhancement system that applies proven prompting techniques to improve LLM response quality. The system is built around the `PromptStrategy` enum, `PromptEnhancer` builder, and `EnhancedPrompt` output.
Roles
A `Role` defines what an agent can do. Each role has an identity (name, goal, domain), a set of responsibilities, and discoverable actions. Roles generate the system prompt that instructs the LLM. Actions are methods annotated with `@ActionSpec` — they are discovered at runtime via reflection and routed to one of four executor types.