We are excited to bring you SkyPilot 0.11! This release introduces Pools for batch inference across clouds and Kubernetes, brings Managed Jobs Consolidation Mode to GA with 6x faster submission, and delivers enterprise-ready improvements supporting hundreds of AI engineers on a single API server instance.
Get it now:
uv pip install -U "skypilot>=0.11.0"
Or upgrade your team SkyPilot API server:
NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.0
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--set apiService.image=berkeleyskypilot/skypilot:$VERSION \
--version $VERSION --devel --reuse-values
[Beta] Pools: Batch inference across clouds & Kubernetes
SkyPilot now supports spawning a pool that launches workers across multiple clouds and Kubernetes clusters. Jobs can be scheduled on this pool and distributed to workers as they become available.
Key benefits include:
- Fully utilize your GPU capacity across clouds & Kubernetes
- Unified queue for jobs on all infrastructure
- Keep workers warm, scale elastically
- Step aside for higher priority jobs; reschedule when GPUs become available

Learn more in our blog post, documentation, and example.
[GA] Managed Jobs Consolidation Mode
Consolidation Mode is now generally available, which allows the jobs controller to collocate with the API server, delivering:
- 6x faster job submission
- Consistent credentials across the API server and jobs controller
- Persistent managed jobs state on PostgreSQL
Enable it in your configuration:
# config.yaml
jobs:
controller:
consolidation_mode: true

We’ve also optimized the Managed Jobs controller to handle 2,000+ parallel jobs on a single 8-CPU controller - an 18x improvement in job capacity.

Enterprise-ready SkyPilot at large scale
SkyPilot 0.11 brings significant enterprise improvements (API server docs), enabling support for hundreds of AI engineers with a single SkyPilot API server instance.
SSO support with Microsoft Entra ID
Secure your SkyPilot deployment with enterprise single sign-on (auth docs):

Memory and performance improvements
We’ve made significant reductions in memory consumption with OOM avoidance for the API server, along with CLI/SDK/Dashboard speedups when handling large amounts of clusters and jobs:


Comprehensive API server metrics
Monitor your SkyPilot deployment with detailed operational metrics:

Kubernetes improvements
SkyPilot 0.11 delivers robust Kubernetes support (docs):
- Robust SSH for SkyPilot clusters on Kubernetes
- Improved resource cleanup after termination
- Intelligent GPU name detection
- Retry on transient Kubernetes API server issues
- Improved volume support with labels, name validation, and SDK support (volumes docs)
Existing PVC support: reference pre-existing Kubernetes PersistentVolumeClaims as a SkyPilot volume:
# volume.yaml
name: existing-pvc-name
type: k8s-pvc
infra: k8s/context1
use_existing: true
config:
namespace: namespace
Ephemeral volumes: automatically create volumes when a cluster launches and delete them on teardown - ideal for temporary storage like caches and intermediate results:
# task.sky.yaml
file_mounts:
/mnt/cache:
size: 100Gi
CoreWeave and AMD GPU support
CoreWeave now officially supports SkyPilot with Infiniband support, object storage, and autoscaling (docs, example). See the CoreWeave blog for details.

AMD GPUs are fully supported on Kubernetes clusters with GPU detection and scheduling, dashboard metrics, and ROCm support (docs, example). See the AMD ROCm blog for details.

User Experience
SkyPilot templates
SkyPilot now ships predefined YAML templates for launching clusters with popular frameworks. Templates are automatically available on all new SkyPilot clusters.
Launch a multi-node Ray cluster with a single line:
run: |
# One-line setup for a distributed Ray cluster
~/.sky/templates/ray/start_cluster
# Submit your job
python train.py
Improved Python SDK
The SkyPilot Python SDK (docs) is significantly improved with:
Type hints for better IDE support and code completion:

Log streaming for real-time job monitoring:
logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
if line is not None:
if 'needle in the haystack' in line:
print("found it!")
break
logs.close()
Admin policy helpers (docs) for building policies programmatically:
resource_config = user_request.task.get_resource_config()
resource_config['use_spot'] = True
user_request.task.set_resources(resource_config)
GPU count in setup: New SKYPILOT_SETUP_NUM_GPUS_PER_NODE environment variable available during setup phase for configuring software based on GPU count (env vars docs).
CI/CD integration
With the improved SDK, you can integrate SkyPilot with GitHub Actions and other orchestrators to automatically spin up your AI workloads:

Native Git support
Use private git repositories directly as your SkyPilot workdir (docs). SkyPilot handles cloning and syncing automatically:
# task.sky.yaml
workdir:
url: https://github.com/my-org/my-repo.git
ref: 1234ab # commit hash or branch name
You can also use --git-url and --git-ref options with sky serve up. View commit hashes directly in the Dashboard.

Autostop based on SSH sessions
Configure autostop/autodown to wait for active SSH sessions in addition to running jobs (docs):
# task.sky.yaml
resources:
autostop:
wait_for: jobs_and_ssh
Distributed LLM examples
We released high-performance distributed training examples with checkpointing support:
- Kimi-K2 multi-node serving (docs, example)
- Torchtitan framework support (docs, example)
- Llama 4 Maverick 400B training on 16+ H200 GPUs (docs, example)
- OpenAI GPT-OSS pretraining and finetuning (docs, example)
- VeRL for agentic reinforcement learning (docs, example)

Other improvements
- Air-gapped deployments supported via private container registries in Helm charts
- Resolved OOM issues with the API server
- Make Pod status visible during SkyPilot cluster launching on k8s
- Post-provision commands on Kubernetes with
post_provision_runcmdfor custom pod initialization - Fixed
--retry-until-upto properly retry failed launches across all cloud zones - GCP B200 spot instances now supported for cost-effective access to latest NVIDIA GPUs
- AWS Trainium & Inferentia dynamic accelerator detection via AWS API
- Together AI instant cluster support for fast GPU access (docs)
- Seeweb cloud provider with Docker image support (docs)
Many more improvements and fixes in the full release notes.
Get started today
SkyPilot 0.11 makes running AI workloads across clouds and Kubernetes more efficient than ever. With Pools for batch inference, faster managed jobs, and enterprise-scale improvements, you can focus on building AI while SkyPilot handles the infrastructure.
Install SkyPilot 0.11:
uv pip install -U "skypilot>=0.11.0"
Or upgrade your team SkyPilot API server:
NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.0
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--set apiService.image=berkeleyskypilot/skypilot:$VERSION \
--version $VERSION --devel --reuse-values
Check out the documentation to get started.
Thanks to the SkyPilot community for contributing PRs and feedback that helped shape this release!
To receive latest updates, please star and watch the project’s GitHub repo, follow @skypilot_org, or join the SkyPilot community Slack.
