Self-host open-source LLM agent sandbox on your own cloud

Your AI writes code. Now what?

If you’re building AI agents in 2025, you probably wondered that as well. Your LLM generates some Python code that analyzes data, manipulates files, or calls APIs. But where does it run? Most people either pay for managed execution services where they don’t control the data flow, or just YOLO it and run everything locally.

Neither option scales, and both come with problems you probably haven’t thought about yet.

The managed route seems convenient until you get the bill. E2B charges $150/month plus usage that escalates fast when your agents start doing real work. That said, vendor lock-in means you’re stuck when costs spiral or you need custom configurations. Plus, your sensitive data flows through someone else’s infrastructure, which makes compliance teams nervous.

The local Docker approach works until it doesn’t. Cold starts take forever, resource management becomes a nightmare at scale, and good luck sharing environments across your team.

There’s a third way: self-hosted sandboxes that give you the control of local execution with the scalability of the cloud. This is where SkyPilot Code Sandbox comes in.

The solution that actually makes sense

SkyPilot Code Sandbox combines three pieces that solve different parts of the puzzle:

SkyPilot for multi-cloud orchestration and cost optimization
llm-sandbox for secure code execution with Docker isolation
Model Context Protocol (MCP) for standardized AI tool integration

Deploy once, use everywhere. Run in your own clouds to leverage special pricing, credits, or compliance requirements. Scale automatically when you need it, shut down when you don’t.

This isn’t just another sandbox - it’s infrastructure that adapts to your needs instead of forcing you to adapt to its limitations.

Why self-host when managed services exist?

The sandbox market has split into two camps, and both have problems:

Managed services like E2B offer convenience but come with variable costs, vendor lock-in, and limited customization. When Together.ai acquired CodeSandbox, it signalled growing competition, but also highlighted how dependent teams become on these platforms.

Local Docker solutions don’t scale and create sharing nightmares. You can’t easily distribute environments across team members, and performance degrades as complexity increases.

Self-hosted with SkyPilot gives you the best of both worlds. You control the infrastructure, data never leaves your environment, and costs scale predictably. SkyPilot users report 3-6x cost savings through intelligent spot instance management and multi-cloud deployment.

Works with your existing Kubernetes setup. If you already run K8s clusters, SkyPilot can use them directly. It doesn’t matter if you’re on EKS, GKE, on-premises hardware, or GPU providers like CoreWeave - SkyPilot handles the pod scheduling and resource management so you don’t have to wrestle with Kubernetes YAML files for AI workloads.

Plus, you’re not locked into a single vendor’s ecosystem. SkyPilot works across 16+ cloud providers, so you can chase the best prices or meet specific compliance requirements.

What you actually get

The demo, while not quite production-ready, provides you with quite a few features:

Universal integration through MCP servers that work with Claude Desktop, VS Code, and any MCP client
Multi-language support for Python, JavaScript, Java, C++, Go, and R with automatic package installation
Kubernetes deployment flexibility that works with your existing clusters or cloud-managed K8s services
Team data sharing via S3 bucket mounts (read-only for security)
Auto-scaling that provisions resources when needed and shuts down when idle
Session persistence that maintains state between executions for better performance

The technical stack handles the complexity for you. Here’s how the session management works:

Sessions are pooled and reused intelligently. If you’re running Python code that needs NumPy, the system checks for an existing session with those libraries before spinning up a new one. This cuts startup time dramatically compared to cold Docker containers.

Performance benchmarks

To validate the performance claims, we ran head-to-head benchmarks against E2B and Modal using identical Python code execution tasks. The results reveal significant performance advantages for the self-hosted approach.

Benchmark methodology

Each platform executed five common code patterns three times each:

Basic arithmetic operations
Iterative calculations (summing ranges)
String manipulations
List comprehensions
Module imports

All tests measured end-to-end execution time from API call to response, including any cold start penalties.

Performance results

Platform	Average Execution Time	Range	Success Rate
SkyPilot Code Sandbox	0.284s	0.262s - 0.349s	100%
E2B	0.747s	0.570s - 1.307s	100%
Modal	2.046s	1.537s - 5.154s	100%

The self-hosted solution is 2.6x faster than E2B and 7.2x faster than Modal on average. More importantly, the performance is consistent - the SkyPilot solution shows minimal variance between runs, while Modal exhibits significant cold start penalties.

Modal optimizes for resource efficiency by aggressively spinning down idle containers, which saves costs but creates significant cold start penalties of 2-5+ seconds for simple operations. When your AI agent requests code execution, Modal often needs to provision a fresh container from scratch. The self-hosted SkyPilot solution uses session pooling with persistent containers that stay warm between executions. When you run Python code needing NumPy, it checks for existing sessions with those libraries already loaded rather than starting from zero, delivering consistent 300ms response times that feel instantaneous during AI conversations.

Performance comparison

Why the performance difference matters

These aren’t synthetic benchmarks designed to make one solution look good. The test cases represent realistic AI agent workloads: quick calculations, data processing, and library imports that happen hundreds of times during development.

When your AI agent needs to execute code to answer a question, waiting 2+ seconds for a simple math operation breaks the conversational flow. The 300ms response time from the self-hosted solution feels instantaneous and maintains the interactive experience users expect.

Session pooling and persistent containers eliminate the cold start penalty that serverless execution environments suffer from. While E2B and Modal optimize for resource efficiency by spinning down idle containers, this optimization creates latency that’s maybe undesirable for some interactive AI applications.

Cost comparison: Self-hosting wins at scale

The performance advantage becomes even more compelling when you factor in the economics. Let’s break down the real costs across different usage patterns.

Infrastructure costs per hour

Platform	Cost per Hour for 4 vCPUs
SkyPilot (AWS m6i.xlarge)	$0.1920
E2B	$0.2016 ($0.000056/s)
Modal	$0.1886 ($0.0473/core/h)

Real-world usage scenarios

Self-hosting costs are predictable since you pay for the underlying AWS infrastructure regardless of usage patterns.

Light development (100 executions/day)

SkyPilot: $138/month (AWS m6i.xlarge running 24/7)
E2B: $150/month base + usage fees
Modal: $8-15/month

For light usage, Modal appears cheaper but delivers 7x slower execution times.

Production AI agent (1,000 executions/day)

SkyPilot: $138/month
E2B: $200-250/month (base + usage fees)
Modal: $50-100/month

The performance advantage of self-hosting becomes valuable for user-facing applications where response time matters.

Enterprise workload (10,000+ executions/day)

SkyPilot: $276/month (2x AWS m6i.xlarge running 24/7)
E2B: $650+/month
Modal: $300-500/month

At enterprise scale, the fixed infrastructure cost of self-hosting provides significant savings while maintaining consistent performance regardless of execution volume.

The hidden costs of managed services

Beyond the obvious pricing differences, managed services carry hidden costs that only surface at scale:

Vendor lock-in migration costs when you need to switch providers or customize the execution environment. Self-hosted solutions use standard Docker containers that run anywhere.

Compliance overhead for regulated industries that require data residency controls or audit trails. Managed services add complexity to compliance frameworks, while self-hosted solutions keep everything within your existing cloud governance.

When managed services make sense

Self-hosting isn’t always the answer. Managed services excel for:

Prototyping and early development where setup speed matters more than costs
Infrequent usage patterns where you can’t justify maintaining infrastructure
Teams without DevOps expertise who prefer vendor-managed operations

But once you’re running production AI agents with consistent workloads, the economics or compliance requirements might in favour self-hosted solutions that give you control over performance, costs and security.

How easy is it to deploy?

Getting this running takes one command:

export AUTH_TOKEN=<YOUR_AUTH_TOKEN>
sky serve up -n code-executor src/code-execution-service.yaml --env AUTH_TOKEN --secret AUTH_TOKEN

SkyPilot handles the rest. It provisions compute resources, sets up networking, deploys your service, and configures auto-scaling. The YAML configuration is remarkably simple:

# SkyPilot YAML to deploy a service.
name: code-executor

service:
  readiness_probe:
    path: /health
    headers:
      Authorization: Bearer $AUTH_TOKEN
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.5

resources:
  ports: 8080
  infra: aws  # Or 'k8s' for Kubernetes clusters
  cpus: 4

file_mounts:
  /bucket_data:
    source: s3://skypilot-code-sandbox-bucket/

run: |
  python -m uvicorn api:app --host 0.0.0.0 --workers 4 --port 8080

That’s it. SkyPilot figures out the cheapest available resources, provisions them, and keeps your service running. If your primary region gets expensive, it can automatically migrate to cheaper zones.

MCP changes everything

The real power comes from Model Context Protocol integration. Since MCP standardized how AI agents interact with tools, you can use the same sandbox across development environments.

The MCP server implementation wraps the FastAPI backend:

@mcp.tool()
async def execute_code(
    code: str,
    language: str = "python",
    libraries: Any = None,
    timeout: Optional[int] = 30,
    session_id: Optional[str] = None
) -> str:
    """
    Execute code in a sandboxed environment.

    IMPORTANT: Always reuse the same session_id for consecutive executions
    unless you need to change the language or libraries.
    """
    request_data = {
        "code": code,
        "language": language,
        "timeout": timeout or 30,
        "session_id": session_id
    }

    if libraries_list:
        request_data["libraries"] = libraries_list

    result = await make_api_request("POST", "/execute", request_data)
    return json.dumps(result, indent=2)

Configure it once in Claude Desktop:

macOS: ~/Library/Application\\ Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\\Claude\\claude_desktop_config.json

{
  "mcpServers": {
    "code-execution-server": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/alex000kim/skypilot-code-sandbox.git",
        "mcp-server"
      ],
      "env": {
        "API_BASE_URL": "<YOUR_ENDPOINT>",
        "AUTH_TOKEN": "<YOUR_AUTH_TOKEN>"
      }
    }
  }
}

Now Claude can execute code through your self-hosted infrastructure. The same configuration works in VS Code with Agent Mode, Cursor, or any MCP client.

Claude Desktop: Run code securely via your self-hosted sandbox.

VS Code Agent Mode: Execute code in the self-hosted sandbox.

Team collaboration

The S3 read-only mount feature solves a problem most encounter at some point: how do you share files and datasets across team members without copying everything?

With traditional approaches, each developer needs local copies of data, or you end up with convoluted shared drive setups. This system mounts your S3 bucket directly into the execution environment:

file_mounts:
  /bucket_data:
    source: s3://<your-bucket-name>

All team members access the same data without duplication. For additional security, the bucket can be configured with read-only access (in addition to it being mounted as read-only to Docker containers created by llm-sandbox) so AI agents can’t accidentally modify your datasets. Cost efficiency comes from single storage with multiple execution environments.

Great for collaborative research projects where everyone needs access to the same datasets but you want to prevent accidental data corruption.

Putting It All Together

Architecture of SkyPilot Code Sandbox

The limitations of this demo

Let’s be honest about limitations. This is a proof-of-concept that shows what’s possible, not a production-ready solution.

Debugging and observability are minimal. When code fails, you get basic error messages but no sophisticated debugging tools.

Resource optimization is simple - you set CPU limits, but there’s no intelligent resource allocation based on workload characteristics.

Session isolation presents a security concern. The current POC implementation pools sessions globally, which means data could potentially leak between different users’ code executions. The issue could be addressed with user-scoped sessions (using user IDs in session keys), but this would introduce cold-start delays for each user’s first execution.

Enterprise features like audit trails, fine-grained permissions, and CI/CD integration aren’t there.

These gaps represent opportunities rather than fundamental problems. The architecture supports these features - they just need implementation.

The bigger picture

The code execution sandbox landscape is evolving fast. Over 1,000 MCP servers have been created since the protocol launched in November 2024. Major players like OpenAI, Google DeepMind, and Replit are adopting the standard.

This convergence around open standards indicates market maturation from experimental tools to production infrastructure. The combination of MCP for standardization, SkyPilot for orchestration, and proper isolation for security creates a viable solution for AI development tools.

Try it yourself

The complete implementation is available on GitHub. Setup takes about 10 minutes if you have cloud credentials configured.

You’ll need:

SkyPilot installed (pip install skypilot)
Valid cloud credentials (AWS, GCP, or Azure)
An authentication token for the API

Clone the repo, set your auth token, and run:

export AUTH_TOKEN=<YOUR_AUTH_TOKEN>
sky serve up -n code-executor src/code-execution-service.yaml --env AUTH_TOKEN --secret AUTH_TOKEN

Get your endpoint:

sky serve status code-executor-service --endpoint

Configure your MCP client and start running AI-generated code on your own infrastructure.

The solution that actually makes sense#

Why self-host when managed services exist?#

What you actually get#

Performance benchmarks#

Benchmark methodology#

Performance results#

Why the performance difference matters#

Cost comparison: Self-hosting wins at scale#

Infrastructure costs per hour#

Real-world usage scenarios#

The hidden costs of managed services#

When managed services make sense#

How easy is it to deploy?#

MCP changes everything#

Team collaboration#

Putting It All Together#

The limitations of this demo#

The bigger picture#

Try it yourself#