Your AI writes code. Now what?
If you’re building AI agents in 2025, you probably wondered that as well. Your LLM generates some Python code that analyzes data, manipulates files, or calls APIs. But where does it run? Most people either pay for managed execution services where they don’t control the data flow, or just YOLO it and run everything locally.
Neither option scales, and both come with problems you probably haven’t thought about yet.
The managed route seems convenient until you get the bill. E2B charges $150/month plus usage that escalates fast when your agents start doing real work. That said, vendor lock-in means you’re stuck when costs spiral or you need custom configurations. Plus, your sensitive data flows through someone else’s infrastructure, which makes compliance teams nervous.
The local Docker approach works until it doesn’t. Cold starts take forever, resource management becomes a nightmare at scale, and good luck sharing environments across your team.
There’s a third way: self-hosted sandboxes that give you the control of local execution with the scalability of the cloud. This is where SkyPilot Code Sandbox comes in.
The solution that actually makes sense
SkyPilot Code Sandbox combines three pieces that solve different parts of the puzzle:
- SkyPilot for multi-cloud orchestration and cost optimization
- llm-sandbox for secure code execution with Docker isolation
- Model Context Protocol (MCP) for standardized AI tool integration
Deploy once, use everywhere. Run in your own clouds to leverage special pricing, credits, or compliance requirements. Scale automatically when you need it, shut down when you don’t.
This isn’t just another sandbox - it’s infrastructure that adapts to your needs instead of forcing you to adapt to its limitations.
Why self-host when managed services exist?
The sandbox market has split into two camps, and both have problems:
Managed services like E2B offer convenience but come with variable costs, vendor lock-in, and limited customization. When Together.ai acquired CodeSandbox, it signalled growing competition, but also highlighted how dependent teams become on these platforms.
Local Docker solutions don’t scale and create sharing nightmares. You can’t easily distribute environments across team members, and performance degrades as complexity increases.
Self-hosted with SkyPilot gives you the best of both worlds. You control the infrastructure, data never leaves your environment, and costs scale predictably. SkyPilot users report 3-6x cost savings through intelligent spot instance management and multi-cloud deployment.
Works with your existing Kubernetes setup. If you already run K8s clusters, SkyPilot can use them directly. It doesn’t matter if you’re on EKS, GKE, on-premises hardware, or GPU providers like CoreWeave - SkyPilot handles the pod scheduling and resource management so you don’t have to wrestle with Kubernetes YAML files for AI workloads.
Plus, you’re not locked into a single vendor’s ecosystem. SkyPilot works across 16+ cloud providers, so you can chase the best prices or meet specific compliance requirements.
What you actually get
The demo, while not quite production-ready, provides you with quite a few features:
- Universal integration through MCP servers that work with Claude Desktop, VS Code, and any MCP client
- Multi-language support for Python, JavaScript, Java, C++, Go, and R with automatic package installation
- Kubernetes deployment flexibility that works with your existing clusters or cloud-managed K8s services
- Team data sharing via S3 bucket mounts (read-only for security)
- Auto-scaling that provisions resources when needed and shuts down when idle
- Session persistence that maintains state between executions for better performance
The technical stack handles the complexity for you. Here’s how the session management works:

Sessions are pooled and reused intelligently. If you’re running Python code that needs NumPy, the system checks for an existing session with those libraries before spinning up a new one. This cuts startup time dramatically compared to cold Docker containers.
Performance benchmarks
To validate the performance claims, we ran head-to-head benchmarks against E2B and Modal using identical Python code execution tasks. The results reveal significant performance advantages for the self-hosted approach.
Benchmark methodology
Each platform executed five common code patterns three times each:
- Basic arithmetic operations
- Iterative calculations (summing ranges)
- String manipulations
- List comprehensions
- Module imports
All tests measured end-to-end execution time from API call to response, including any cold start penalties.
Performance results
Platform | Average Execution Time | Range | Success Rate |
---|---|---|---|
SkyPilot Code Sandbox | 0.284s | 0.262s - 0.349s | 100% |
E2B | 0.747s | 0.570s - 1.307s | 100% |
Modal | 2.046s | 1.537s - 5.154s | 100% |
The self-hosted solution is 2.6x faster than E2B and 7.2x faster than Modal on average. More importantly, the performance is consistent - the SkyPilot solution shows minimal variance between runs, while Modal exhibits significant cold start penalties.
Modal optimizes for resource efficiency by aggressively spinning down idle containers, which saves costs but creates significant cold start penalties of 2-5+ seconds for simple operations. When your AI agent requests code execution, Modal often needs to provision a fresh container from scratch. The self-hosted SkyPilot solution uses session pooling with persistent containers that stay warm between executions. When you run Python code needing NumPy, it checks for existing sessions with those libraries already loaded rather than starting from zero, delivering consistent 300ms response times that feel instantaneous during AI conversations.
Why the performance difference matters
These aren’t synthetic benchmarks designed to make one solution look good. The test cases represent realistic AI agent workloads: quick calculations, data processing, and library imports that happen hundreds of times during development.
When your AI agent needs to execute code to answer a question, waiting 2+ seconds for a simple math operation breaks the conversational flow. The 300ms response time from the self-hosted solution feels instantaneous and maintains the interactive experience users expect.
Session pooling and persistent containers eliminate the cold start penalty that serverless execution environments suffer from. While E2B and Modal optimize for resource efficiency by spinning down idle containers, this optimization creates latency that’s maybe undesirable for some interactive AI applications.
Cost comparison: Self-hosting wins at scale
The performance advantage becomes even more compelling when you factor in the economics. Let’s break down the real costs across different usage patterns.
Infrastructure costs per hour
Platform | Cost per Hour for 4 vCPUs |
---|---|
SkyPilot (AWS m6i.xlarge) | $0.1920 |
E2B | $0.2016 ($0.000056/s) |
Modal | $0.1886 ($0.0473/core/h) |
Real-world usage scenarios
Self-hosting costs are predictable since you pay for the underlying AWS infrastructure regardless of usage patterns.
Light development (100 executions/day)
- SkyPilot: $138/month (AWS m6i.xlarge running 24/7)
- E2B: $150/month base + usage fees
- Modal: $8-15/month
For light usage, Modal appears cheaper but delivers 7x slower execution times.
Production AI agent (1,000 executions/day)
- SkyPilot: $138/month
- E2B: $200-250/month (base + usage fees)
- Modal: $50-100/month
The performance advantage of self-hosting becomes valuable for user-facing applications where response time matters.
Enterprise workload (10,000+ executions/day)
- SkyPilot: $276/month (2x AWS m6i.xlarge running 24/7)
- E2B: $650+/month
- Modal: $300-500/month
At enterprise scale, the fixed infrastructure cost of self-hosting provides significant savings while maintaining consistent performance regardless of execution volume.
The hidden costs of managed services
Beyond the obvious pricing differences, managed services carry hidden costs that only surface at scale:
Vendor lock-in migration costs when you need to switch providers or customize the execution environment. Self-hosted solutions use standard Docker containers that run anywhere.
Compliance overhead for regulated industries that require data residency controls or audit trails. Managed services add complexity to compliance frameworks, while self-hosted solutions keep everything within your existing cloud governance.
When managed services make sense
Self-hosting isn’t always the answer. Managed services excel for:
- Prototyping and early development where setup speed matters more than costs
- Infrequent usage patterns where you can’t justify maintaining infrastructure
- Teams without DevOps expertise who prefer vendor-managed operations
But once you’re running production AI agents with consistent workloads, the economics or compliance requirements might in favour self-hosted solutions that give you control over performance, costs and security.
How easy is it to deploy?
Getting this running takes one command:
export AUTH_TOKEN=<YOUR_AUTH_TOKEN>
sky serve up -n code-executor src/code-execution-service.yaml --env AUTH_TOKEN --secret AUTH_TOKEN
SkyPilot handles the rest. It provisions compute resources, sets up networking, deploys your service, and configures auto-scaling. The YAML configuration is remarkably simple:
# SkyPilot YAML to deploy a service.
name: code-executor
service:
readiness_probe:
path: /health
headers:
Authorization: Bearer $AUTH_TOKEN
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.5
resources:
ports: 8080
infra: aws # Or 'k8s' for Kubernetes clusters
cpus: 4
file_mounts:
/bucket_data:
source: s3://skypilot-code-sandbox-bucket/
run: |
python -m uvicorn api:app --host 0.0.0.0 --workers 4 --port 8080
That’s it. SkyPilot figures out the cheapest available resources, provisions them, and keeps your service running. If your primary region gets expensive, it can automatically migrate to cheaper zones.
MCP changes everything
The real power comes from Model Context Protocol integration. Since MCP standardized how AI agents interact with tools, you can use the same sandbox across development environments.
The MCP server implementation wraps the FastAPI backend:
@mcp.tool()
async def execute_code(
code: str,
language: str = "python",
libraries: Any = None,
timeout: Optional[int] = 30,
session_id: Optional[str] = None
) -> str:
"""
Execute code in a sandboxed environment.
IMPORTANT: Always reuse the same session_id for consecutive executions
unless you need to change the language or libraries.
"""
request_data = {
"code": code,
"language": language,
"timeout": timeout or 30,
"session_id": session_id
}
if libraries_list:
request_data["libraries"] = libraries_list
result = await make_api_request("POST", "/execute", request_data)
return json.dumps(result, indent=2)
Configure it once in Claude Desktop:
- macOS:
~/Library/Application\\ Support/Claude/claude_desktop_config.json
- Windows:
%APPDATA%\\Claude\\claude_desktop_config.json
{
"mcpServers": {
"code-execution-server": {
"command": "uvx",
"args": [
"--from",
"git+https://github.com/alex000kim/skypilot-code-sandbox.git",
"mcp-server"
],
"env": {
"API_BASE_URL": "<YOUR_ENDPOINT>",
"AUTH_TOKEN": "<YOUR_AUTH_TOKEN>"
}
}
}
}
Now Claude can execute code through your self-hosted infrastructure. The same configuration works in VS Code with Agent Mode, Cursor, or any MCP client.
Team collaboration
The S3 read-only mount feature solves a problem most encounter at some point: how do you share files and datasets across team members without copying everything?
With traditional approaches, each developer needs local copies of data, or you end up with convoluted shared drive setups. This system mounts your S3 bucket directly into the execution environment:
file_mounts:
/bucket_data:
source: s3://<your-bucket-name>
All team members access the same data without duplication. For additional security, the bucket can be configured with read-only access (in addition to it being mounted as read-only to Docker containers created by llm-sandbox
) so AI agents can’t accidentally modify your datasets. Cost efficiency comes from single storage with multiple execution environments.
Great for collaborative research projects where everyone needs access to the same datasets but you want to prevent accidental data corruption.
Putting It All Together
The limitations of this demo
Let’s be honest about limitations. This is a proof-of-concept that shows what’s possible, not a production-ready solution.
Debugging and observability are minimal. When code fails, you get basic error messages but no sophisticated debugging tools.
Resource optimization is simple - you set CPU limits, but there’s no intelligent resource allocation based on workload characteristics.
Session isolation presents a security concern. The current POC implementation pools sessions globally, which means data could potentially leak between different users’ code executions. The issue could be addressed with user-scoped sessions (using user IDs in session keys), but this would introduce cold-start delays for each user’s first execution.
Enterprise features like audit trails, fine-grained permissions, and CI/CD integration aren’t there.
These gaps represent opportunities rather than fundamental problems. The architecture supports these features - they just need implementation.
The bigger picture
The code execution sandbox landscape is evolving fast. Over 1,000 MCP servers have been created since the protocol launched in November 2024. Major players like OpenAI, Google DeepMind, and Replit are adopting the standard.
This convergence around open standards indicates market maturation from experimental tools to production infrastructure. The combination of MCP for standardization, SkyPilot for orchestration, and proper isolation for security creates a viable solution for AI development tools.
Try it yourself
The complete implementation is available on GitHub. Setup takes about 10 minutes if you have cloud credentials configured.
You’ll need:
- SkyPilot installed (
pip install skypilot
) - Valid cloud credentials (AWS, GCP, or Azure)
- An authentication token for the API
Clone the repo, set your auth token, and run:
export AUTH_TOKEN=<YOUR_AUTH_TOKEN>
sky serve up -n code-executor src/code-execution-service.yaml --env AUTH_TOKEN --secret AUTH_TOKEN
Get your endpoint:
sky serve status code-executor-service --endpoint
Configure your MCP client and start running AI-generated code on your own infrastructure.