SkyPilot Sandboxes: Run Agent Code on Your Own Kubernetes, at Scale

Every agent, coding assistant, and RL pipeline eventually hits the same wall: the model wrote code, and now someone has to run it. Today, most teams hand that code to a hosted sandbox vendor paying a multiple of raw compute to execute untrusted code on someone else’s machines, while their prompts, test cases, and model outputs leave their cloud. Meanwhile, the Kubernetes cluster they already operate sits right there, capable of running 50,000 sandboxes at once. This post is about closing that gap: SkyPilot Sandboxes, a BYOC code execution layer, with a full RL post-training example and head-to-head benchmarks against Modal.

Two side-by-side bar charts of the cost per hour to run 50,000 sandboxes. General-purpose nodes (2 vCPU / 4 GB each): hosted $19,030 per hour vs SkyPilot BYOC $4,650 per hour, 4.1x cheaper. Burstable nodes (2 vCPU / 2 GB each): hosted $16,610 per hour vs SkyPilot BYOC $1,680 per hour, 9.9x cheaper. — The full pricing math is worked out in the cost section below.

What is a sandbox, and why do you need one?

LLMs generate code. Whether it is an agent, a coding assistant, or an RL reward loop scoring the output of a half-trained model, at some point you have to run that code, and you cannot trust it. It can loop forever, exhaust memory, write files, spawn processes, or import something that tries to phone home. You need a disposable, isolated place to run it, and you usually need a lot of them at once.

Today that means reaching for a hosted sandbox vendor. It works, but the trade is real:

Cost. You pay the vendor’s per-sandbox rate on top of the compute you already own.
Privacy. Your code and data (the model’s output, your test cases, your prompts) leave your environment for a third party.
Latency for non-US users. The vendor runs in their regions. Reach them from somewhere else and every call pays a network-distance tax.

SkyPilot Sandboxes run on your own infra

A SkyPilot Sandbox is a lightweight, isolated pod you create on demand, run commands in, and tear down, running on the Kubernetes you already have (BYOC: bring your own cloud).

Per-pod isolation. Each sandbox is its own pod with a dedicated image, CPU, and memory. Code that misbehaves is contained to its pod, and the pod is destroyed when you are done.
Massively parallel. Launch many sandboxes in a single call and fan commands out across them concurrently.
Sub-second launches with warm pools. A pool keeps pre-provisioned pods idle and ready, so creating a sandbox claims a running pod instead of waiting on Kubernetes scheduling and an image pull. That cuts a single sandbox’s launch time by more than 50%.
Your infra, your data. Code and data never leave your cloud. If grading needs credentials (a private package index, a database for integration tests), they are injected from the SkyPilot Secrets Manager at create time, never baked into an image.
Modal-style API. create(), exec(), terminate(), each with an async sibling on .aio for massive fan-out. If you have used a hosted sandbox SDK, you already know this one.

import sky.sandbox

sb = sky.sandbox.create(image="python:3.12", cpus=1, memory_gb=2)
result = sb.exec("python", "-c", "print(2 + 2)")
print(result["stdout"])                   # "4" (also: stderr, exit_code)
sb.terminate()

# One call returns a LIST of isolated sandboxes.
sandboxes = sky.sandbox.create(image="python:3.12", num_sandboxes=100)
for sb in sandboxes:
    sb.exec("pytest", "-q", "tests/")

# Every entrypoint has an async sibling on `.aio`.
sandboxes = await sky.sandbox.create.aio(image="python:3.12", num_sandboxes=64)
results = await asyncio.gather(
    *(sb.exec.aio("python", "-c", code) for sb in sandboxes))
await asyncio.gather(*(sb.terminate.aio() for sb in sandboxes))

Example: RL-training a code-generation model, with sandboxed reward

Untrusted code at volume shows up most sharply in reinforcement learning. This example post-trains a code-generation LLM, a policy model that, given a programming problem, writes a Python function to solve it. The training goal is simple to state: make the model’s generated functions pass the tests more often.

On every training step, for every rollout in the batch, we execute code that a half-trained model just wrote (buggy, occasionally infinite-looping, untrusted by definition) and that execution sits on the critical path of training. This is the same shape of problem HuggingFace’s Open R1 hit when they used hosted sandboxes for their RL reward; here, the execution runs on your own Kubernetes cluster via SkyPilot Sandboxes.

We use a standard distributed RL layout: five services in a SkyPilot job group, talking over HTTP.

The Data Server serves prompts (MBPP-style problems with hidden tests) to the Rollout Server
The Rollout Server (SGLang) has the current policy generate candidate solutions and sends them to the reward server.
The Sandbox Reward Server scores each candidate. This is where sandboxes come in. For every batch it receives, it claims a batch of sandboxes from a warm pool, runs each candidate against its hidden tests in its own sandbox, and returns 1.0 (all tests passed) or 0.0 (anything else).
The PPO trainer writes the scored rollouts to the Replay Buffer.
The PPO Trainer (GRPO) uses the rewards to update the policy, and the loop repeats.

The RL training loop: a data server sends coding problems to the rollout server (SGLang), which generates candidate code; the sandbox reward server runs each candidate in its own sandbox and returns a reward to the PPO trainer; the trainer updates the policy, writes scored rollouts to a replay buffer, and periodically syncs updated model weights back to the rollout server. — The rollout server generates code, the sandbox reward server runs each candidate in its own pod and returns a reward, and the PPO trainer updates the policy, stores scored rollouts in the replay buffer, and syncs new weights back to the rollout server.

Inside the reward server. The PPO trainer already POSTs a batch of {prompt, response, tests} to the /batch_reward endpoint on the reward server. The only change from a string-matching reward server is what happens inside it: we run code. We create one sandbox for each of the generated scripts and call the scoring function on each pair of created sandbox and script:

import asyncio
import sky.sandbox

async def score_batch(items):
    # One call returns a LIST of sandboxes, claimed from the warm pool.
    sandboxes = await sky.sandbox.create.aio(
        name="reward", num_sandboxes=len(items), pool=POOL_NAME)
    try:
        # Score every rollout concurrently, one sandbox each.
        rewards = await asyncio.gather(
            *(score_one(sb, item) for sb, item in zip(sandboxes, items)))
    finally:
        # ALWAYS tear sandboxes down, even if an exec raised above.
        await asyncio.gather(*(sb.terminate.aio() for sb in sandboxes),
                             return_exceptions=True)
    return list(rewards)

Scoring one rollout is where execution happens. We extract the code block, concatenate it with the setup code and hidden tests, and run that script in the sandbox. The reward is the cleanest possible signal: exit 0 means every test passed, and anything else (an assertion failure, a runtime error, an infinite loop that hits the timeout, a sandbox-level error) is reward 0.0. The rule is that a bad rollout must never raise out of the reward function; early in training, most rollouts are bad, and the loop has to keep going.

async def score_one(sb, item):
    code = extract_code(item.response)
    if not code or not item.tests:
        return RewardResponse(reward=0.0, passed=False)

    script = build_test_script(code, item.setup_code, item.tests)
    try:
        result = await asyncio.wait_for(
            sb.exec.aio("python", "-c", script),
            timeout=EXEC_TIMEOUT_SECONDS)
    except (asyncio.TimeoutError, Exception):
        # A crash or a timeout is a 0.0 reward, never an exception that
        # escapes and kills the batch.
        return RewardResponse(reward=0.0, passed=False)

    passed = result["exit_code"] == 0       # stdout / stderr / exit_code
    return RewardResponse(reward=1.0 if passed else 0.0, passed=passed)

The warm pool is created once when the server starts, and the shared session is released once when it stops:

sky.sandbox.create_pool(
    name=POOL_NAME, image="python:3.11-slim",
    cpus=1, memory_gb=2, replicas=8)
# ... on shutdown:
await sky.sandbox.aclose()

Swap this reward server in for a string-matching one and the rest of a standard GRPO pipeline does not change.

Performance: faster to first command, scales with your clusters

We benchmarked BYOC sandboxes against Modal Sandboxes, a managed, multi-tenant service hosted in Modal’s US infrastructure (internal benchmarks, June 2026). Three takeaways.

Scale is determined by your cluster. A single Kubernetes cluster sustained ~50,000 healthy sandboxes across 220 nodes. Add clusters to go higher and SkyPilot will intelligently route requests to clusters with capacity.

Time to first command is ~20% faster, with a much tighter tail. The metric that matters is how long until a command you run in a fresh sandbox comes back: create, then immediately exec. At p50, a SkyPilot sandbox completes create + first exec in ~1.0s vs Modal’s ~1.2s, and the tails diverge further (p99 ~1.5s vs ~2.0s). Modal’s create() returns quickly but hands back a not-yet-ready handle; readiness lands on the first exec, which is where its variance lives. SkyPilot front-loads readiness into create(), so the first exec is quick and predictable.

Create + first exec	p50	p99
SkyPilot (BYOC, warm pool)	~1.0s	~1.5s
Modal (US)	~1.2s	~2.0s

Run the benchmark yourself:

curl -fsSLO https://gist.githubusercontent.com/lloyd-brown/58bdefdea5ff15f1563efa81fbed272a/raw/benchmark.py
python benchmark.py

The benchmark: 200 create, exec, terminate cycles per platform, wall time from create() until an echo returns

import time

import modal

try:
    import sky.sandbox
except ImportError:
    sky = None # no SkyPilot client? bench Modal only

print("Benchmarking SkyPilot + Modal" if sky else
      "No SkyPilot client found; benchmarking Modal only")

N = 200
app = modal.App.lookup("bench", create_if_missing=True)
image = modal.Image.debian_slim(python_version="3.12")

if sky:
    # Comparable slim Python 3.12 image; pre-provision warm capacity once.
    print("Creating warm pool (one-time, untimed)...")
    sky.sandbox.create_pool(name="bench", image="python:3.12-slim",
                            cpus=1, memory_gb=2, replicas=5, blocking=True)

# One untimed warmup cycle per platform, so one-time setup (Modal image
# resolution on first use, client session init) never lands in the numbers.
print("Warmup cycle per platform (untimed)...")
msb = modal.Sandbox.create("sleep", "infinity", app=app, image=image)
msb.exec("echo", "hi").wait()
msb.terminate()
if sky:
    sb = sky.sandbox.create(name="bench-warmup", pool="bench")
    sb.exec("echo", "hi")
    sb.terminate()

def pctl(xs, q):
    return sorted(xs)[round(q / 100 * (len(xs) - 1))]

print(f"Timing {N} create -> exec -> terminate cycles per platform...")
skypilot_s, modal_s = [], []
for i in range(N):
    if sky:
        t0 = time.perf_counter()
        sb = sky.sandbox.create(name=f"bench-{i}", pool="bench")   # exec-ready
        sb.exec("echo", "hi")
        skypilot_s.append(time.perf_counter() - t0)
        sb.terminate()

    t0 = time.perf_counter()
    msb = modal.Sandbox.create("sleep", "infinity", app=app, image=image)
    msb.exec("echo", "hi").wait()             # container readiness lands here
    modal_s.append(time.perf_counter() - t0)
    msb.terminate()

    if (i + 1) % 20 == 0:
        print(f"  {i + 1}/{N} cycles done", flush=True)

for name, xs in [("SkyPilot", skypilot_s), ("Modal", modal_s)]:
    if xs:
        print(f"{name}: p50 {pctl(xs, 50):.2f}s  p99 {pctl(xs, 99):.2f}s")

if sky:
    print("Cleaning up the warm pool...")
    sky.sandbox.delete_pool("bench")

Latency stays local. Modal’s best-case US exec latency is genuinely low (~0.096s) when the client sits right next to its US region. Move that client to APAC and Modal jumps 3.9x to ~0.37s, essentially a fixed trans-Pacific round trip. Because BYOC sandboxes run in your own region, next to your users, that distance tax never appears.

Cost: up to 10x cheaper

On your own cluster you pay only for the machines. Here are two fully worked comparisons you can rerun with your own numbers, both against Modal’s per-core-second and per-GiB-second billing (published on-demand pricing, June 2026): a conservative one on general-purpose nodes, and a leaner one on burstable nodes that approaches 10x. The scenario for both: the 50,000 sandboxes a single cluster sustains, priced per hour for the whole fleet.

The conservative case, on general-purpose nodes: ~4x cheaper. Each sandbox gets 2 vCPUs and 4 GB of memory. Hosted:

50,000 x (2 cores x $0.00003942/core-s + 4 GiB x $0.00000672/GiB-s)
  = 50,000 x $0.38 per sandbox-hour
  = $19,030 per hour

On your own cluster, we run one sandbox per n4-standard-2 node (2 vCPUs, 8 GB), which leaves memory headroom for the kubelet and system pods. At $67.01 per month, or $0.093 per hour per node (GKE on-demand pricing, June 2026):

50,000 nodes x $0.093/hr
  = $4,650 per hour

The lean case, on burstable nodes: ~10x cheaper. Most sandbox workloads are idle-then-burst (run a snippet, grade a test, exit), which is exactly the load burstable instances are priced for. This time we size the sandbox leaner too: 2 vCPUs and 2 GB of memory, one per AWS t4g.medium (2 vCPUs, 4 GB, $0.0336/hr, EC2 on-demand pricing, June 2026):

Hosted: 50,000 x (2 cores x $0.00003942/core-s + 2 GiB x $0.00000672/GiB-s)
  = 50,000 x $0.33 per sandbox-hour
  = $16,610 per hour
BYOC:   50,000 x $0.0336/hr (one t4g.medium per sandbox)
  = $1,680 per hour

That is a ~9.9x reduction, with the caveat burstable instances always carry: sustained all-core load eventually exhausts CPU credits, so for continuously hot sandboxes use the general-purpose math above.

Either way, hosted costs a multiple of the underlying compute. Four times on general-purpose nodes, nearly ten on burstable ones. That multiple is the vendor’s margin, and on your own cluster you simply do not pay it.

The whole comparison in one table.

	SkyPilot Sandboxes (BYOC)	Hosted service (e.g. Modal)
Where it runs	Your own K8s cluster	Vendor’s regions
Create + first exec (p50)	~1.0s	~1.2s
Cost/hr for 50k sandboxes	$1,680 (burstable) to $4,650	$16,610 to $19,030
Exec latency	Local to your users	Low in-region; ~3.9x tax cross-region
Code & data	Never leave your cloud	Sent to a third party
Secrets	Injected from the Secrets Manager	Configured in the vendor dashboard

Takeaways

Untrusted, LLM-generated code needs a real execution environment: isolated, massively concurrent, fast to start. SkyPilot Sandboxes give you that on the Kubernetes clusters you already own, 50,000 sandboxes on a single cluster, individual launches in under a second, with your code and data never leaving your cloud. The async SDK makes the fan-out a few lines, whether you’re scoring RL rollouts, running parallel evals, or giving coding agents disposable environments.

Try it:

Get access: SkyPilot Sandboxes is in limited early access. Sign up here, it takes 20 seconds.
Run the RL example: the full five-service pipeline lives in the SkyPilot repo, including a CPU-only connectivity test so you can verify the reward path before committing GPUs.
Read the docs: the Sandboxes guide covers the SDK, warm pools, and secrets injection.

To receive latest updates, please star and watch the project’s GitHub repo, follow @skypilot_org, or join the SkyPilot community Slack.

What is a sandbox, and why do you need one?#

SkyPilot Sandboxes run on your own infra#

Example: RL-training a code-generation model, with sandboxed reward#

Competitive with Modal, on your own clusters#

Performance: faster to first command, scales with your clusters#

Cost: up to 10x cheaper#

Takeaways#