SkyPilot 0.11: Multi-Cloud Pools for Batch Inference, Fast Managed Jobs, Enterprise-Ready at Scale, Programmability

We are excited to bring you SkyPilot 0.11! This release introduces Pools for batch inference across clouds and Kubernetes, brings Managed Jobs Consolidation Mode to GA with 6x faster submission, and delivers enterprise-ready improvements supporting hundreds of AI engineers on a single API server instance.

Get it now:

uv pip install -U "skypilot>=0.11.0"

Or upgrade your team SkyPilot API server:

NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.0

helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --set apiService.image=berkeleyskypilot/skypilot:$VERSION \
  --version $VERSION --devel --reuse-values

[Beta] Pools: Batch inference across clouds & Kubernetes

SkyPilot now supports spawning a pool that launches workers across multiple clouds and Kubernetes clusters. Jobs can be scheduled on this pool and distributed to workers as they become available.

Key benefits include:

Fully utilize your GPU capacity across clouds & Kubernetes
Unified queue for jobs on all infrastructure
Keep workers warm, scale elastically
Step aside for higher priority jobs; reschedule when GPUs become available

Batch Inference Architecture

Learn more in our blog post, documentation, and example.

[GA] Managed Jobs Consolidation Mode

Consolidation Mode is now generally available, which allows the jobs controller to collocate with the API server, delivering:

6x faster job submission
Consistent credentials across the API server and jobs controller
Persistent managed jobs state on PostgreSQL

Enable it in your configuration:

# config.yaml
jobs:
  controller:
    consolidation_mode: true

Managed Jobs Consolidation Mode

We’ve also optimized the Managed Jobs controller to handle 2,000+ parallel jobs on a single 8-CPU controller - an 18x improvement in job capacity.

Managed Jobs Scalability

Enterprise-ready SkyPilot at large scale

SkyPilot 0.11 brings significant enterprise improvements (API server docs), enabling support for hundreds of AI engineers with a single SkyPilot API server instance.

SSO support with Microsoft Entra ID

Secure your SkyPilot deployment with enterprise single sign-on (auth docs):

Microsoft Entra ID SSO

Memory and performance improvements

We’ve made significant reductions in memory consumption with OOM avoidance for the API server, along with CLI/SDK/Dashboard speedups when handling large amounts of clusters and jobs:

Memory Improvements

Performance Improvements

Comprehensive API server metrics

Monitor your SkyPilot deployment with detailed operational metrics:

API Server Metrics

Kubernetes improvements

SkyPilot 0.11 delivers robust Kubernetes support (docs):

Robust SSH for SkyPilot clusters on Kubernetes
Improved resource cleanup after termination
Intelligent GPU name detection
Retry on transient Kubernetes API server issues
Improved volume support with labels, name validation, and SDK support (volumes docs)

Existing PVC support: reference pre-existing Kubernetes PersistentVolumeClaims as a SkyPilot volume:

# volume.yaml
name: existing-pvc-name
type: k8s-pvc
infra: k8s/context1
use_existing: true
config:
  namespace: namespace

Ephemeral volumes: automatically create volumes when a cluster launches and delete them on teardown - ideal for temporary storage like caches and intermediate results:

# task.sky.yaml
file_mounts:
  /mnt/cache:
    size: 100Gi

CoreWeave and AMD GPU support

CoreWeave now officially supports SkyPilot with Infiniband support, object storage, and autoscaling (docs, example). See the CoreWeave blog for details.

CoreWeave Integration

AMD GPUs are fully supported on Kubernetes clusters with GPU detection and scheduling, dashboard metrics, and ROCm support (docs, example). See the AMD ROCm blog for details.

AMD GPU Support

User Experience

SkyPilot templates

SkyPilot now ships predefined YAML templates for launching clusters with popular frameworks. Templates are automatically available on all new SkyPilot clusters.

Launch a multi-node Ray cluster with a single line:

run: |
  # One-line setup for a distributed Ray cluster
  ~/.sky/templates/ray/start_cluster

  # Submit your job
  python train.py

Improved Python SDK

The SkyPilot Python SDK (docs) is significantly improved with:

Type hints for better IDE support and code completion:

Type Hints

Log streaming for real-time job monitoring:

logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
    if line is not None:
        if 'needle in the haystack' in line:
            print("found it!")
            break
logs.close()

Admin policy helpers (docs) for building policies programmatically:

resource_config = user_request.task.get_resource_config()
resource_config['use_spot'] = True
user_request.task.set_resources(resource_config)

GPU count in setup: New SKYPILOT_SETUP_NUM_GPUS_PER_NODE environment variable available during setup phase for configuring software based on GPU count (env vars docs).

CI/CD integration

With the improved SDK, you can integrate SkyPilot with GitHub Actions and other orchestrators to automatically spin up your AI workloads:

CI/CD Integration

Native Git support

Use private git repositories directly as your SkyPilot workdir (docs). SkyPilot handles cloning and syncing automatically:

# task.sky.yaml
workdir:
  url: https://github.com/my-org/my-repo.git
  ref: 1234ab  # commit hash or branch name

You can also use --git-url and --git-ref options with sky serve up. View commit hashes directly in the Dashboard.

Git Commit in Dashboard

Autostop based on SSH sessions

Configure autostop/autodown to wait for active SSH sessions in addition to running jobs (docs):

# task.sky.yaml
resources:
  autostop:
    wait_for: jobs_and_ssh

Distributed LLM examples

We released high-performance distributed training examples with checkpointing support:

Kimi-K2 multi-node serving (docs, example)
Torchtitan framework support (docs, example)
Llama 4 Maverick 400B training on 16+ H200 GPUs (docs, example)
OpenAI GPT-OSS pretraining and finetuning (docs, example)
VeRL for agentic reinforcement learning (docs, example)

Distributed Training Examples

Other improvements

Air-gapped deployments supported via private container registries in Helm charts
Resolved OOM issues with the API server
Make Pod status visible during SkyPilot cluster launching on k8s
Post-provision commands on Kubernetes with post_provision_runcmd for custom pod initialization
Fixed --retry-until-up to properly retry failed launches across all cloud zones
GCP B200 spot instances now supported for cost-effective access to latest NVIDIA GPUs
AWS Trainium & Inferentia dynamic accelerator detection via AWS API
Together AI instant cluster support for fast GPU access (docs)
Seeweb cloud provider with Docker image support (docs)

Many more improvements and fixes in the full release notes.

Get started today

SkyPilot 0.11 makes running AI workloads across clouds and Kubernetes more efficient than ever. With Pools for batch inference, faster managed jobs, and enterprise-scale improvements, you can focus on building AI while SkyPilot handles the infrastructure.

Install SkyPilot 0.11:

uv pip install -U "skypilot>=0.11.0"

Or upgrade your team SkyPilot API server:

NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.0

helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --set apiService.image=berkeleyskypilot/skypilot:$VERSION \
  --version $VERSION --devel --reuse-values

Check out the documentation to get started.

Thanks to the SkyPilot community for contributing PRs and feedback that helped shape this release!

To receive latest updates, please star and watch the project’s GitHub repo, follow @skypilot_org, or join the SkyPilot community Slack.

[Beta] Pools: Batch inference across clouds & Kubernetes#

[GA] Managed Jobs Consolidation Mode#

Enterprise-ready SkyPilot at large scale#

SSO support with Microsoft Entra ID#

Memory and performance improvements#

Comprehensive API server metrics#

Kubernetes improvements#

CoreWeave and AMD GPU support#

User Experience#

SkyPilot templates#

Improved Python SDK#

CI/CD integration#

Native Git support#

Autostop based on SSH sessions#

Distributed LLM examples#

Other improvements#

Get started today#