Resilience

TnsAI.Core provides a declarative resilience framework built on top of Resilience4j. The `@Resilience` annotation configures retry, circuit breaker, rate limiting, bulkhead isolation, timeout, and fallback policies for actions and roles. The `ResilienceExecutor` applies these policies in a layered pipeline and tracks terminal failures in a dead-letter queue.

@Resilience Annotation

The @Resilience annotation is the single entry point for declaring all resilience policies on an action or role. Apply it to a method to configure that specific action, or to a class to set defaults for all actions in the role. Method-level annotations override type-level ones.

@Documented
@Retention(RetentionPolicy.RUNTIME)
@Target({ElementType.METHOD, ElementType.TYPE})
public @interface Resilience {
    Retry retry() default @Retry();
    CircuitBreaker circuitBreaker() default @CircuitBreaker();
    RateLimit rateLimit() default @RateLimit();
    int timeout() default 0;             // milliseconds, 0 = no timeout
    String fallback() default "";         // fallback method name (same signature)
    boolean bulkhead() default false;     // thread pool isolation
    int maxConcurrent() default 10;       // max concurrent calls for bulkhead
}

@Retry

When a transient error occurs (network timeout, server 500), retrying after a short delay often succeeds. The @Retry sub-annotation configures automatic retries with exponential backoff (each retry waits longer) and optional jitter (randomized delays to avoid thundering herd problems).

@interface Retry {
    int maxAttempts() default 0;                            // 0 = no retries
    int backoffMs() default 1000;                           // initial delay
    double multiplier() default 2.0;                        // backoff multiplier
    int maxBackoffMs() default 30000;                       // max delay cap
    Class<? extends Throwable>[] retryOn() default {};      // empty = all
    Class<? extends Throwable>[] noRetryOn() default {};    // exclusions
    boolean jitter() default true;                          // add randomization
}

Example:

@Resilience(retry = @Retry(
    maxAttempts = 3,
    backoffMs = 500,
    multiplier = 2.0,
    maxBackoffMs = 10000,
    retryOn = {NetworkException.class, RateLimitException.class},
    jitter = true
))
public String fetchData(String query) { ... }

With the default multiplier = 2.0 and jitter = true, delays are approximately: 500ms, 1000ms, 2000ms (with random jitter applied to each).

@CircuitBreaker

If a downstream service is down, retrying every request wastes resources and can cascade failures across your system. A circuit breaker tracks failures and, after a threshold is reached, "opens" the circuit to reject all calls immediately. After a cooldown period, it allows a few test calls through ("half-open") to check if the service has recovered.

@interface CircuitBreaker {
    boolean enabled() default false;
    int failureThreshold() default 5;        // failures before opening
    int failureWindowMs() default 60000;     // window for counting failures
    int resetTimeoutMs() default 30000;      // wait before half-open
    int successThreshold() default 3;        // successes needed in half-open
    int failureRateThreshold() default 0;    // percentage (0-100), alternative to count
}

Example:

@Resilience(circuitBreaker = @CircuitBreaker(
    enabled = true,
    failureThreshold = 5,
    failureWindowMs = 60000,
    resetTimeoutMs = 30000,
    successThreshold = 3
))
public String callExternalApi() { ... }

States: Closed (normal) -\> Open (reject all calls) -\> Half-Open (allow successThreshold test calls) -\> Closed (if tests pass).

@RateLimit

Rate limiting prevents your application from overwhelming a downstream service or exceeding API quotas. The @RateLimit sub-annotation lets you cap the number of requests in a time window, with four different strategies for how that limit is enforced.

@interface RateLimit {
    boolean enabled() default false;
    int maxRequests() default 100;
    int windowMs() default 60000;                        // 1 minute
    Strategy strategy() default Strategy.SLIDING_WINDOW;
}

Rate Limit Strategies

The four strategies differ in how smoothly they distribute requests over time. Choose based on whether your use case tolerates bursts or needs strict even pacing.

Strategy	Description
`FIXED_WINDOW`	Counts requests in fixed time windows. Simple but can allow bursts at window boundaries.
`SLIDING_WINDOW`	Counts requests in a sliding time window. Smoother rate control than fixed window.
`TOKEN_BUCKET`	Tokens are added at a fixed rate. Each request consumes one token. Allows controlled bursts.
`LEAKY_BUCKET`	Requests are processed at a fixed rate. Excess requests queue or are rejected. Smoothest output.

Example:

@Resilience(rateLimit = @RateLimit(
    enabled = true,
    maxRequests = 60,
    windowMs = 60000,
    strategy = RateLimit.Strategy.TOKEN_BUCKET
))
public String queryLlm(String prompt) { ... }

Combining Policies

In production, you typically want multiple resilience layers working together. A single @Resilience annotation can configure retry, circuit breaker, rate limiting, timeout, bulkhead isolation, and a fallback method all at once.

@Resilience(
    retry = @Retry(maxAttempts = 3, backoffMs = 1000),
    circuitBreaker = @CircuitBreaker(enabled = true, failureThreshold = 5, resetTimeoutMs = 30000),
    rateLimit = @RateLimit(enabled = true, maxRequests = 100, windowMs = 60000),
    timeout = 5000,
    bulkhead = true,
    maxConcurrent = 10,
    fallback = "fetchDataFallback"
)
public Result fetchData(String query) { ... }

// Fallback must have same signature
public Result fetchDataFallback(String query) {
    return Result.cached(query);
}

Type-level defaults apply to all actions in a role unless overridden:

@Resilience(
    retry = @Retry(maxAttempts = 2),
    timeout = 10000
)
public class ApiRole extends Role {

    @Resilience(retry = @Retry(maxAttempts = 5))  // overrides type-level retry
    public String criticalCall() { ... }

    // Uses type-level defaults: 2 retries, 10s timeout
    public String normalCall() { ... }
}

ResilienceExecutor

The ResilienceExecutor is the engine that applies all your resilience policies at runtime. It builds a layered pipeline where each policy wraps the next, using Resilience4j under the hood. You can also use it programmatically (without annotations) for ad-hoc resilient operations.

Pipeline Order

Policies are applied in a specific order, from outermost to innermost. Rate limiting is checked first (to avoid unnecessary work), then bulkhead isolation, then the circuit breaker, and finally retries wrap the actual operation.

Rate Limit -> Bulkhead -> Circuit Breaker -> Retry -> Operation

Each layer wraps the next. When an operation fails and exhausts all resilience layers, the failure is recorded in the dead-letter queue.

Construction

You can create a ResilienceExecutor with defaults, inject a custom dead-letter queue, or take full control over all Resilience4j registries.

// Default registries + in-memory DLQ
ResilienceExecutor executor = new ResilienceExecutor();

// Custom dead-letter queue
ResilienceExecutor executor = new ResilienceExecutor(myDlq);

// Full control over all registries
ResilienceExecutor executor = new ResilienceExecutor(
    circuitBreakerRegistry,
    bulkheadRegistry,
    rateLimiterRegistry,
    deadLetterQueue
);

Programmatic Usage with ResilienceRequest

When you need resilience for a one-off operation that is not tied to an annotated action method, use ResilienceRequest to define the operation and its policies inline.

ResilienceExecutor executor = new ResilienceExecutor();

String result = executor.execute(
    ResilienceRequest.<String>builder()
        .operationId("fetch-weather")
        .operation(() -> httpClient.get("https://api.weather.com/current"))
        .retryPolicy(RetryPolicy.defaultPolicy())
        .rateLimit(60, 60000)           // 60 requests per minute
        .bulkhead(5)                    // max 5 concurrent calls
        .timeout(5000)                  // 5 second timeout
        .fallbackStrategy(ex -> "Weather data unavailable")
        .build()
);

FallbackStrategy

When all retries are exhausted and the operation still fails, a fallback strategy provides a default value instead of throwing an exception. This is useful for returning cached data or a graceful degradation response.

@FunctionalInterface
public interface FallbackStrategy<T> {
    T fallback(Exception exception);

    // Default: accepts all exceptions
    default boolean supports(Exception exception) { return true; }
}

Custom fallback with selective exception handling:

FallbackStrategy<String> fallback = new FallbackStrategy<>() {
    @Override
    public String fallback(Exception exception) {
        return "Cached result";
    }

    @Override
    public boolean supports(Exception exception) {
        return exception instanceof NetworkException;
    }
};

Exception Handling

When a resilience policy rejects a request (rather than the underlying operation failing), the executor catches these Resilience4j-specific exceptions so you know which layer blocked the call.

Exception	Meaning
`CallNotPermittedException`	Circuit breaker is open
`BulkheadFullException`	Max concurrent calls exceeded
`RequestNotPermitted`	Rate limit exceeded
`TimeoutException`	Operation timed out

All failures are recorded in the dead-letter queue before being re-thrown.

DeadLetterQueue

When an operation fails even after retries, circuit breaker bypass, and fallback attempts, the failure is not silently discarded. Instead, it is recorded in a DeadLetterQueue (DLQ) so you can monitor terminal failures, alert on them, or replay the operations later.

Interface

The DLQ interface is intentionally simple: enqueue failed entries, and query them by operation ID or in bulk.

public interface DeadLetterQueue {
    void enqueue(DeadLetterEntry entry);
    List<DeadLetterEntry> getEntries();
    List<DeadLetterEntry> getEntries(String operationId);
    int size();
}

DeadLetterEntry

Each DLQ entry captures everything you need to understand and potentially replay a failed operation: which operation failed, what exception occurred, when it happened, and any additional context as metadata.

DeadLetterEntry entry = DeadLetterEntry.builder()
    .operationId("fetch-weather")
    .exceptionType("NetworkException")
    .exceptionMessage("Connection timed out")
    .timestamp(Instant.now())
    .metadata(Map.of("host", "api.weather.com", "port", 443))
    .build();

Fields:

id -- auto-generated UUID
operationId -- identifies the operation that failed
exceptionType -- exception class name
exceptionMessage -- exception message
timestamp -- when the failure occurred
metadata -- additional context (immutable map)

Accessing the DLQ

You can query the DLQ through the executor to get all failures, filter by operation ID, or check the total failure count.

ResilienceExecutor executor = new ResilienceExecutor();

// After operations...
DeadLetterQueue dlq = executor.getDeadLetterQueue();

// All failures
List<DeadLetterEntry> allFailures = dlq.getEntries();

// Failures for a specific operation
List<DeadLetterEntry> weatherFailures = dlq.getEntries("fetch-weather");

// Total failure count
int failureCount = dlq.size();

The default implementation (InMemoryDeadLetterQueue) stores entries in memory. Implement the DeadLetterQueue interface for persistent storage (database, Redis, etc.) and pass it to the ResilienceExecutor constructor.

Code Examples

These examples show resilience in practice, from a fully annotated action to DLQ monitoring.

Action with Full Resilience

This example shows a real-world action with all resilience layers active: retries for network and rate-limit errors, a circuit breaker to stop calling a failing service, rate limiting to stay within API quotas, a timeout, and a fallback that returns cached data.

@ActionSpec(type = ActionType.WEB_SERVICE, name = "fetchStockPrice")
@Resilience(
    retry = @Retry(
        maxAttempts = 3,
        backoffMs = 1000,
        multiplier = 2.0,
        retryOn = {NetworkException.class, RateLimitException.class}
    ),
    circuitBreaker = @CircuitBreaker(
        enabled = true,
        failureThreshold = 5,
        resetTimeoutMs = 30000
    ),
    rateLimit = @RateLimit(
        enabled = true,
        maxRequests = 120,
        windowMs = 60000,
        strategy = RateLimit.Strategy.SLIDING_WINDOW
    ),
    timeout = 10000,
    fallback = "fetchStockPriceFallback"
)
public double fetchStockPrice(String symbol) {
    return stockApi.getPrice(symbol);
}

public double fetchStockPriceFallback(String symbol) {
    return cacheStore.getLastKnownPrice(symbol);
}

Monitoring Failures via DLQ

In production, you want to know when operations are permanently failing. This example sets up a periodic check that logs all DLQ entries, which you can hook into your alerting system.

ResilienceExecutor executor = new ResilienceExecutor();

// Periodic monitoring
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
scheduler.scheduleAtFixedRate(() -> {
    DeadLetterQueue dlq = executor.getDeadLetterQueue();
    int count = dlq.size();
    if (count > 0) {
        logger.warn("Dead letter queue has {} entries", count);
        for (DeadLetterEntry entry : dlq.getEntries()) {
            logger.warn("  Failed: {} - {} at {}",
                entry.getOperationId(),
                entry.getExceptionType(),
                entry.getTimestamp());
        }
    }
}, 0, 1, TimeUnit.MINUTES);

On this page