Sandbox
Run untrusted code through a lightweight, isolated, fast primitive. The framework's answer to: how do I run LLM-generated code or untrusted shell commands without giving them access to the host's filesystem, network, or unbounded resources?
The com.tnsai.quality.sandbox package in tnsai-quality ships the primitives:
Sandbox— interface:start / execute / stop / terminate+AutoCloseableSandboxSpec— declarative config (image, env, fsMounts, networkPolicy, resourceLimits, workingDir, warmable)SandboxResult— exit code, stdout/stderr, ResourceUsage, timedOut flagSandboxFactory— SPI for backend selection (process / container / wasm / future firecracker)ResourceLimits— cpu / memory / disk / timeout / maxProcessesNetPolicy— sealed:DenyAll/AllowList(hosts)/InheritSandboxPool— warmable-instance reuse for high-concurrency workloadsObservedSandbox+SandboxExecutionListener— per-execute observability events
Pairs with Accountability (sandbox events correlate with AgentLiabilityRecords on the same correlationId) and the future tnsai-tools refactor (existing PythonExecutionTools / JsExecutionTools move to the SPI in a follow-up).
Why a separate primitive
Three forces motivate sandbox as a first-class layer:
- LLM-generated code is untrusted by definition. Even when an agent is benevolent, hallucinated
rm -rf /happens. Process boundary alone doesn't protect the host filesystem; we need FS jail + env scrubbing + timeout enforcement. - Per-instance cost matters. Multi-agent fan-out (TNS-294 group), code review (TNS-291 deepsec harness), Agents-of-Chaos benchmarks — all push concurrent sandbox counts into the hundreds. Tencent CDB-style targets (60ms cold start, 5MB RAM) become the budget.
- Backend selection is deployment-specific. A laptop dev wants speed (ProcessSandbox); a CI runner wants real isolation (ContainerSandbox); a production microVM host wants Firecracker. The framework ships an SPI so the same calling code targets all three.
Quick start
import com.tnsai.quality.sandbox.*;
import java.time.Duration;
// 1. Pick a backend. preferred() auto-selects the highest-priority
// one available; explicit selection via byId(...) when you want
// a specific backend.
SandboxFactory factory = SandboxFactory.preferred();
// or: SandboxFactory factory = SandboxFactory.byId("container");
// 2. Build a spec. Defaults: deny-all network, standard resource
// limits (1 CPU, 256MB, 30s, 64 maxProcs), workingDir = backend
// default, warmable = false.
SandboxSpec spec = SandboxSpec.builder()
.image("python:3.12-slim") // backend-specific; "" for ProcessSandbox
.resourceLimits(ResourceLimits.standard())
.networkPolicy(NetPolicy.denyAll())
.build();
// 3. Run a command. The sandbox is created lazily by create();
// each execute() runs to completion or to the timeout budget.
try (Sandbox sb = factory.create(spec)) {
SandboxResult r = sb.execute(Command.shell("python -c 'print(1+1)'"));
System.out.println("exit=" + r.exitCode() + " stdout=" + r.stdoutString());
}Backend choice tree
| Backend | Cold start | RAM/instance | Network isolation | Use when |
|---|---|---|---|---|
process | ~50–150ms | ~10–30MB | Not enforced (host-shared) | Dev / CI / portable fallback |
container | ~150–400ms | ~30–80MB | Real (--network=none) | Production code-exec, untrusted shell |
wasm (v1 stub) | sub-50ms (target) | sub-MB (target) | Capability-gated | Pyodide / Wasmer adapter (follow-up) |
firecracker (deferred) | ~125ms | ~5MB | microVM-isolated | Linux production at scale |
SandboxFactory.preferred() picks the highest-priority backend whose available() returns true:
| Backend | priority |
|---|---|
| process | 10 |
| wasm | 75 (when adapter ships) |
| container | 50 |
| firecracker | 100 (when adapter ships) |
Resource limits
ResourceLimits limits = new ResourceLimits(
1.0, // cpuShares (1.0 = one full core)
256, // memoryMb
128, // diskMb (scratch space)
Duration.ofSeconds(30), // timeout (per-execute wall clock)
64); // maxProcesses (0 = no limit)Validation rejects:
cpuShares <= 0memoryMb <= 0diskMb < 0timeoutzero or negative — sandbox without a deadline is a footgunmaxProcesses < 0
Presets:
ResourceLimits.minimal()— 0.25 CPU / 64MB / 16MB disk / 5s / 32 procs (policy checks, regex eval)ResourceLimits.standard()— 1 CPU / 256MB / 256MB disk / 30s / 64 procs (typical code-exec)ResourceLimits.of(cpu, memMb, timeoutSec)— convenience for the common shape
Network policy
NetPolicy.denyAll(); // recommended default
NetPolicy.allow(List.of("github.com", "api.openai.com:443"));
NetPolicy.inherit(); // sandbox inherits host network| Policy | ProcessSandbox | ContainerSandbox |
|---|---|---|
DenyAll | Logged, NOT enforced (JVM child inherits host network) | --network=none — real |
AllowList | Logged, NOT enforced | Rejected at create time (deferred to follow-up) |
Inherit | Default, no warning | --network=host |
ProcessSandbox is honest about its limits — it logs WARN at create time when the requested policy isn't enforceable, rather than silently downgrading. Real network isolation requires container or firecracker.
Filesystem mounts
SandboxSpec.builder()
.fsMount(FsMount.readOnly(Path.of("./inputs"), Path.of("/data")))
.fsMount(FsMount.readWrite(Path.of("./outputs"), Path.of("/work")))
// …
.build();| Backend | Read-only | Read-write |
|---|---|---|
| ProcessSandbox | Copy-in (host file → jail dir) | Rejected at create (copy-in can't propagate writes back) |
| ContainerSandbox | --mount type=bind,readonly | --mount type=bind |
Sandbox path MUST be absolute — the sandbox sees its filesystem rooted at /.
Pool reuse
For high-concurrency workloads, reuse warm instances through a pool:
SandboxPool pool = new SandboxPool(
SandboxFactory.preferred(),
spec.toBuilder().warmable(true).build(),
/* maxSize */ 16,
Duration.ofSeconds(2));
try (SandboxPool.Lease lease = pool.borrow()) {
SandboxResult r = lease.execute(Command.of("python", "task.py"));
// ... lease.close() returns sandbox to the pool
}
pool.close(); // drains every idle sandboxThe pool degrades gracefully when full + timeout-exceeded (creates a non-pooled sandbox so callers never block forever); explicit Lease.terminate() evicts a sandbox the caller has reason to mark unhealthy.
Observability
Every execute() emits a SandboxExecutionEvent to the wired listener:
SandboxExecutionListener listener = event ->
log.info("[sandbox] backend={} image={} exit={} timeoutMs={} cpuMs={}",
event.backend(), event.image(), event.exitCode(),
event.resourceUsage().wallClockMs(), event.resourceUsage().cpuMillis());
try (Sandbox sb = new ObservedSandbox(
factory.create(spec),
listener,
factory.id())) {
sb.execute(Command.shell("python -c 'print(1)'"));
}The event carries the sandboxId, backend, image, argv, exit, timedOut flag, ResourceUsage, and the network-policy class name (DenyAll / Inherit / AllowList). Listener exceptions are caught + logged so observability outages never break execution.
Pairs with accountability
Sandbox events correlate with AgentLiabilityRecord entries on the same correlation id — a single audit timeline for "agent X attempted action Y inside sandbox Z, used N CPU-ms, exited with code C". Operators wire both listeners on the same agent and downstream consumers join on correlationId.
What's not in v1 (deferred to follow-ups)
FirecrackerSandbox— Linux microVM backend; child issue- WASM runtime adapters — Pyodide / Wasmer / Bun WASM; child issues per language
AllowListenforcement on ContainerSandbox — needs custom network + iptables; v2- GPU sandbox — model inference inside sandbox; v3
- Multi-tenant resource quota — per-tenant aggregate limits; v2
- Snapshot/restore — running sandbox state save; v2
tnsai-toolsrefactor —PythonExecutionTools/JsExecutionToolsmove to the SPI; child issue (current implementations document the gap explicitly via "WARNING: not a sandbox" headers)
See also
- Accountability — sandbox events correlate with
AgentLiabilityRecordfor the same dispatch - Approvals and Annotations —
@ApprovalRequiredworks alongside sandboxing (approvals gate access; sandbox bounds the blast radius) - Enforcement —
SecurityEnforcerruns OUTSIDE the sandbox; sandbox is the inner ring