Long-Running Runs
Multi-hour and multi-day agent executions need a categorically different runtime than minute-scale interactive sessions. Process crashes, runtime upgrades, transient API failures, and operator-initiated pauses all happen — the framework's reliability layer makes them survivable rather than catastrophic.
The com.tnsai.reliability package in tnsai-core ships the primitives:
Checkpoint+CheckpointStore— durable state at step boundariesResumableRun+DefaultResumableRun— execution driver that consumes the storesProgress+ProgressSink— observable mid-run state for live dashboardsRunConfig— duration / retries / cost / progress-timeout policy bundleOutcome+AbortReason— terminal-result sealed type
Pairs with the idempotency layer (tnsai-core.idempotency) so side-effecting steps don't double-execute on retry, and with cost governance so a runaway loop hits a configured ceiling instead of burning unbounded spend.
When to use
You probably want this if your agent run:
- takes longer than the longest interactive session you'd let the user wait on (rule of thumb: \> 5 minutes)
- calls external systems with non-trivial side effects (creates issues, posts messages, runs payments, kicks off CI builds)
- has a per-run budget worth enforcing during execution rather than after the fact
- needs to be inspectable from a separate process / dashboard while it's running
You probably don't need it for:
- single-LLM-call request handlers
- read-only research agents that always finish in seconds
- one-shot CLI utilities without resumption needs
Quick start
import com.tnsai.reliability.*;
import com.tnsai.idempotency.InMemoryIdempotencyStore;
import java.time.Duration;
import java.math.BigDecimal;
// 1. Pick stores. InMemory for tests; Filesystem / SQLite for persistence.
CheckpointStore checkpoints = new FilesystemCheckpointStore(Path.of("/var/tns/checkpoints"));
var idempotency = new InMemoryIdempotencyStore();
ProgressSink progress = new InMemoryProgressSink();
// 2. Define a state codec — encode/decode whatever your run threads
// through its steps. JSON is the recommended default; here we use UTF-8
// bytes for a string state.
DefaultResumableRun.StateCodec<String> codec =
new DefaultResumableRun.StateCodec<>() {
public byte[] encode(String s) { return s.getBytes(StandardCharsets.UTF_8); }
public String decode(byte[] b) { return new String(b, StandardCharsets.UTF_8); }
};
// 3. Compose the run. Each step is a UnaryOperator<S> with a name and a
// side-effecting flag. Side-effecting steps consult the idempotency
// store so a retry doesn't double-execute.
DefaultResumableRun<String> run = DefaultResumableRun.<String>builder()
.codec(codec)
.checkpointStore(checkpoints)
.idempotencyStore(idempotency)
.progressSink(progress)
.step(DefaultResumableRun.Step.pure("plan", state -> state + "\n[plan]"))
.step(DefaultResumableRun.Step.sideEffect("post-pr",
state -> { githubClient.createPr(...); return state + "\n[posted]"; }))
.step(DefaultResumableRun.Step.pure("verify", state -> state + "\n[verified]"))
.build();
// 4. Execute with a config bound by duration / retries / cost / progress timeout.
RunConfig config = RunConfig.builder()
.maxDuration(Duration.ofHours(8))
.costCeilingUSD(new BigDecimal("5.00"))
.checkpointInterval(Duration.ofMinutes(5))
.maxRetries(3)
.progressTimeout(Duration.ofMinutes(10))
.build();
Outcome<String> result = run.execute("start", config);The orchestrator returns one of four Outcome variants when the run terminates.
Outcome shape
Outcome<T> is a sealed interface — pattern-match exhaustively:
| Variant | When |
|---|---|
Outcome.Completed<T> | Run reached the end of its step list |
Outcome.Failed<T> | A step exhausted its retry budget; carries the throwable |
Outcome.Aborted<T> | Policy stopped the run (AbortReason); user-requested, budget exhausted, max-duration exceeded, progress timeout, retry-limit exceeded |
Outcome.Suspended<T> | Run yielded cleanly at a step boundary; resumable via the carried checkpointId |
Outcome<String> result = run.execute(input, config);
String final = switch (result) {
case Outcome.Completed<String> c -> c.result();
case Outcome.Failed<String> f -> { logger.error("Failed: {}", f.reason()); yield ""; }
case Outcome.Aborted<String> a -> { alert("Aborted: " + a.reason()); yield ""; }
case Outcome.Suspended<String> s -> { /* resume later via run.resume(s.runId()) */ yield ""; }
};Resume after a crash
After a process crash, runtime upgrade, or OOMError, the run is recoverable as long as the configured CheckpointStore survives:
// New process, same store backend.
DefaultResumableRun<String> run = /* ... same builder, same store paths ... */;
Outcome<String> result = run.resume(savedRunId);
// Resume loads the latest checkpoint, decodes the state via the codec,
// and continues from the next step.If run.resume(runId) finds no checkpoint for the id, it returns Outcome.Failed with a clear "no such run" message rather than silently starting a fresh run — confusing one for the other was the bug pattern this primitive prevents.
Side-effect safety: the kill-and-resume invariant
The unit test that defines the contract:
A side-effecting step that runs once and crashes mid-execution must not run a second time on resume — its cached result is replayed instead.
Implementation: DefaultResumableRun keys side-effecting steps by runId:stepIndex:stepName against the configured IdempotencyStore. On the original run the orchestrator records the step's return value via IdempotencyEntry.forSuccess(...); on resume the orchestrator checks the store before invoking the body. Cached hit → replay; miss → execute + record.
Concretely: a step that posts a GitHub PR, captured the PR id in its return state, then crashed before saving its checkpoint, will on resume:
- find the cached PR id in the idempotency store under
<runId>:<stepIndex>:post-pr - skip the body (no second
POST /pullsto GitHub) - thread the cached state forward into the next step
This is what makes hour-scale runs against rate-limited / cost-bearing external APIs feasible.
CheckpointStore selection
| Store | When |
|---|---|
InMemoryCheckpointStore | Tests, single-process workloads, "checkpointing-on" baseline. Doesn't survive restarts. |
FilesystemCheckpointStore | Single-process production. JSON files under <root>/<runId>/<stepIndex>.json. Atomic write-tmp-then-rename with fsync. |
SqliteCheckpointStore | Personal-fleet workloads (one binary on a laptop / NAS). Single SQLite file, durable across restarts. Optional xerial:sqlite-jdbc dep — pull only when used. |
RedisCheckpointStore | Multi-process fleet sharing a Redis instance. Per-checkpoint blob + per-run sorted set for O(log N) latest / range queries. Reuses the existing Jedis dep (no new artefact). |
S3CheckpointStore | Distributed deployments where checkpoints must outlive a single host — runs started on box A may resume on box B / region 2. JSON object per checkpoint at s3://<bucket>/<keyPrefix><runId>/<stepIndex>.json. Optional software.amazon.awssdk:s3 dep. |
Picking between distributed stores
| Concern | Redis | S3 |
|---|---|---|
| Latency | sub-ms intra-region | tens of ms (eventually-consistent regions: more) |
| Durability | RDB / AOF — operator-tunable | 11 nines — managed |
| Cost | proportional to memory + ops | proportional to storage + requests |
| Operations | needs a Redis cluster | managed by AWS |
| Region affinity | typically per-region cluster | cross-region replication available |
Use Redis for fast resume cycles (operator-paused runs picked up within minutes); use S3 for archival / cross-region durability. The two are not mutually exclusive — a layered stack with a Redis hot path and an S3 cold path is a reasonable production shape, but ships as a consumer-side composite rather than a built-in.
Cost ceiling enforcement
DefaultResumableRun.Builder#costMonitor is optional. Provide one and pair it with RunConfig.costCeilingUSD to make the orchestrator check spend before every step:
// Wire to the cost-governance store from tnsai-quality.
CostBudgetStore costStore = ...; // your spend ledger
DefaultResumableRun.CostMonitor monitor = () ->
Optional.of(costStore.currentSpend(
CostScope.tenant(tenantId), Duration.ofHours(24)));
DefaultResumableRun<S> run = DefaultResumableRun.<S>builder()
// ...
.costMonitor(monitor)
.build();
run.execute(input, RunConfig.builder()
.costCeilingUSD(new BigDecimal("10.00")) // $10 cap
.build());When currentSpend ≥ ceiling, the orchestrator:
- Saves a final checkpoint at the current state
- Publishes
Progress.RunAborted(BUDGET_EXHAUSTED) - Returns
Outcome.Aborted
The consumer can resume the run on a higher ceiling once it's available — the saved checkpoint preserves work-done-so-far.
Progress events
Progress is a sealed interface with eight variants. Subscribe to a run's events for live dashboards or progress-timeout enforcement:
ProgressSink.Subscription sub = sink.subscribe(runId, event -> {
switch (event) {
case Progress.StepStarted s -> dashboard.markStepRunning(s.stepIndex());
case Progress.StepCompleted s -> dashboard.markStepDone(s.stepIndex(), s.took());
case Progress.CheckpointSaved c -> dashboard.checkpointAt(c.checkpointId());
case Progress.CostUpdate c -> dashboard.spendTick(c.spentUSD());
case Progress.Heartbeat h -> dashboard.heartbeat(h.currentStep());
case Progress.RunCompleted r -> dashboard.markDone();
case Progress.RunFailed r -> dashboard.markFailed(r.reason());
case Progress.RunAborted r -> dashboard.markAborted(r.reason());
}
});
// Cancel when the dashboard tab closes.
sub.cancel();InMemoryProgressSink is the default. KafkaProgressSink and similar fan-out adapters are tracked as follow-ups; the SPI accepts third-party impls today.
RunConfig defaults
| Field | Default | Notes |
|---|---|---|
maxDuration | 8h | Hard wall-clock cap; MAX_DURATION_EXCEEDED on breach |
costCeilingUSD | empty | Optional; without it, runtime cost-ceiling checks are skipped |
checkpointInterval | 5 min | Caps worst-case replay window after a crash |
maxRetries | 3 | Per-step retry budget before Outcome.Failed |
progressTimeout | 10 min | "stuck" detector — abort if no Progress event in this window |
Use RunConfig.defaults() for an overnight-run-friendly starting point, or RunConfig.builder() for finer control.
Tool annotations
Tools that participate in resumable runs benefit from declaring their effect classification + idempotency expectation. Two annotations were added to @Tool for this:
import com.tnsai.annotations.*;
public class GithubTools {
@Tool(name = "github.create_pr",
description = "Open a pull request on a GitHub repo",
sideEffect = SideEffect.EXTERNAL,
idempotencyHint = IdempotencyHint.REQUIRED)
public PullRequest createPr(@ToolParam("title") String title, ...) { ... }
@Tool(name = "github.list_branches",
description = "List branches on a GitHub repo",
sideEffect = SideEffect.READ) // idempotencyHint defaults to NONE
public List<Branch> listBranches(...) { ... }
}SideEffect:
| Value | Meaning |
|---|---|
NONE | Pure function. Default. |
READ | Reads external state, doesn't mutate. |
WRITE | Mutates state inside the framework or a system the framework controls. |
EXTERNAL | Calls a third-party system whose effect we can't reverse. |
IdempotencyHint:
| Value | Meaning |
|---|---|
NONE | No tracking. Default. |
OPTIONAL | Track when caller supplies a key, otherwise dispatch unguarded. |
REQUIRED | Refuse to dispatch without an idempotency key. Reserve for payments / public posts / irreversible deletes. |
Both fields default to safe values so existing tools retain their current behaviour without modification — the annotations are purely additive metadata.
What's not in this layer
Out of scope for the framework primitives, tracked separately:
- Distributed transaction coordination (sagas) — checkpointing is local; cross-system consistency is a different problem.
- Run inspection UI — the
ProgressSinkAPI is enough for now; UI is downstream. - Backfill for runs without checkpointing — additive feature, not retroactive.
- Harness execution-loop refactor — the existing
AgentExecutorwill move ontoResumableRunas part of the harness work (TNS-291).
See Also
- Idempotency — the SPI consumed by side-effecting steps
- Cost Governance — the spend ledger the cost monitor wires into
- Resilience — companion retry / fallback / circuit-breaker layer for individual operations
- Error Handling — how step exceptions surface and propagate