This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration tools are bridging that gap.
We need AI-Native Control Plane for Any Infrastructure
While Neoclouds, specialized GPU cloud providers, have solved the hardware accessibility problem by offering cost-effective, high-performance clusters with advanced networking like InfiniBand, they’ve left a critical gap: the orchestration layer. Most hand you access to a Kubernetes or Slurm cluster and assume you’ll magically figure out how to run your training and inference jobs on it - but Kubernetes wasn’t designed for the iterative, resource-intensive nature of ML workloads.
This is where SkyPilot enters the picture, and honestly, it feels like someone finally understood the assignment. Rather than forcing ML teams to become Kubernetes experts, SkyPilot provides an AI-native abstraction that makes GPU clusters actually usable for machine learning.
Why SkyPilot + Neoclouds Is a Game Changer
This combination solves a problem that neither piece could address alone. Neoclouds give you access to high-performance GPU clusters, but each provider operates in isolation. When capacity runs out, you’re stuck waiting or manually hunting for alternatives across different dashboards and APIs.
Imagine this scenario: Your team needs to launch a distributed training job requiring 16 H100s. You typically use Lambda Labs. But when you try to launch, Lambda’s cluster is fully occupied by another team’s month-long foundation model training run.
With vanilla Kubernetes, you’re dead in the water. You’d need to either wait indefinitely for Lambda’s capacity to free up, or spend hours reconfiguring your workload for a different provider - updating network settings, transferring data, and debugging environment differences.
With SkyPilot, you just specify:
resources:
infra: k8s
accelerators: H100:8
num_nodes: 2
...
and it automatically discovers that Lambda’s cluster is at capacity, then seamlessly fails over to Nebius where 16 H100s are available. Your job launches immediately on Nebius’s InfiniBand-connected cluster, using the same training configuration that would have worked on Lambda. The failover is transparent - your training script doesn’t even know it’s running on a different provider.
SkyPilot orchestrates AI workloads across multiple Neocloud Kubernetes clusters with automatic failover capabilities.
This isn’t just convenience; it’s a fundamental shift in how you think about GPU capacity. Instead of being locked into the availability constraints of a single provider, SkyPilot transforms your entire portfolio of Neocloud accounts into one large, distributed compute fabric where capacity constraints from individual providers become irrelevant.
Automatic failover process when SkyPilot seamlessly redirects workloads from capacity-constrained clusters to available resources without manual intervention.
Solving the Pain Points
Simplicity is king. Instead of wrestling with multiple Kubernetes manifests, SkyPilot lets you define your workload in a single, human-readable YAML file. Where you might need dozens of lines of Kubernetes configuration to specify resources, networking, and storage, SkyPilot distills it down to the essentials:
# my-training-job.yaml
name: train-my-large-model
resources:
infra: k8s
accelerators: H100:8 # or e.g. H100:1, A100:8
# upload a working directory to remote ~/sky_workdir.
workdir: .
setup: | # install dependencies
uv pip install -r requirements.txt
run: | # run multi-gpu training
torchrun \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
main.py
That’s it. One short YAML file replaces several long Kubernetes manifests. The complexity is still there - SkyPilot is handling pod creation, resource allocation, and networking behind the scenes- but it’s abstracted away from the user (see “AI on Kubernetes Without the Pain”)
Interactivity is built-in. SkyPilot understands that ML development is interactive. SSH access, IDE integration, and Jupyter notebook support are first-class features. You can launch a cluster, SSH into it, and work on your code as if it were a local machine. No more rebuilding Docker images for every code change.

Connecting VSCode to a remote development cluster.Source: https://blog.skypilot.co/ai-on-kubernetes/
ML-aware scheduling. SkyPilot knows about gang scheduling and can handle the complexities of distributed training out of the box. It’s not fighting against the nature of ML workloads - it’s designed around them.
Automated cost optimization and failover. This is where SkyPilot really shines. It can automatically find the cheapest available GPUs across all your enabled clouds and regions. If your primary cluster is full, it can burst to other providers to ensure your workloads are never blocked.
High-Level Workflow: Launching an AI Job on a Neocloud’s K8s with SkyPilot
Let me walk you through what this actually looks like in practice:
Step 1: Configure Your Infrastructure
Point SkyPilot to your Neocloud’s Kubernetes cluster by ensuring your kubeconfig is properly set up. (see Kubernetes Deployment)
Web Console for Nebius AI, one of the Neocloud providers. Shown are Kubernetes dashboard, Compute Instances and Monitoring.
SkyPilot will automatically detect and enable the cluster. You’ll also want to configure cloud bucket mounting, NFS or shared filesystem access for your data.
This is typically a one-time setup that your infrastructure team handles. Once it’s done, ML engineers don’t need to think about it.
Step 2: Define Your Task
Write a simple YAML file specifying your code, setup requirements, and resource needs:
# my-training-job.yaml
name: train-my-large-model
resources:
infra: k8s
accelerators: H100:8 # or e.g. H100:1, A100:8
# upload a working directory to remote ~/sky_workdir.
workdir: .
file_mounts:
/my_data:
source: s3://my-bucket/ # GCS, Azure Blob, R2 also supported
mode: MOUNT
setup: | # install dependencies
uv pip install -r requirements.txt
run: | # run multi-gpu training
torchrun \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
main.py
This single file replaces what would be multiple Kubernetes manifests with complex resource specifications, init containers, and service definitions .
Testing Your InfiniBand Setup
Want to know if your InfiniBand setup is actually working? Run an NCCL all-reduce tests. This is the networking equivalent of a stress test - it exercises the exact communication patterns your distributed training will use.
The test pushes data between all GPUs in your cluster simultaneously, measuring both bandwidth and latency under realistic conditions. If you’re seeing close to theoretical bandwidth (think 300+ GB/s for H100s over InfiniBand), you’re in good shape. If not, something’s misconfigured.
Below is an example of how to run an NCCL all-reduce test with SkyPilot.
NCCL All-Reduce test with SkyPilot on a Neocloud’s Kubernetes
NCCL All-Reduce test with SkyPilot on a Neocloud’s Kubernetes
Step 3: Launch with a Single Command
Run sky launch my-training-job.yaml
. SkyPilot handles everything else: provisioning the pods, syncing your code, running the setup commands, and executing your job .
No need to manually create deployments, services, or persistent volumes. No need to debug why your pods are stuck in “Pending” state. SkyPilot abstracts all of that complexity.
Step 4: Monitor and Manage with the SkyPilot Dashboard
Access your centralized management interface at http://<api-server-url>/dashboard
. The SkyPilot Dashboard provides real-time visibility and control across your entire AI infrastructure.
It gives you immediate insight into resource utilization, job progress, and infrastructure health. Team members can see shared clusters and coordinate GPU usage without complex permissions setup.
Key Dashboard Views
- Clusters: Monitor resource utilization and status across all clouds.
- Jobs: Track training runs, check logs, and manage queued workloads.
- Workspaces: Manage team isolation and resource allocation. Create workspaces for different projects with configurable access permissions.
- Infrastructure: Unified view of connected cloud accounts and Kubernetes clusters across AWS, GCP, Azure, and Neoclouds.
- Users: Oversee user access, workspace assignments, and usage patterns.
CLI Integration
SkyPilot’s unified CLI interface works alongside the dashboard:
sky status
shows all your running clusterssky queue
displays your job queue across all cloudssky logs
streams logs from any job, regardless of where it’s running
At the same time, you can still use kubectl
commands or Kubernetes UI tools like k9s or Lens to inspect the underlying resources when you need that level of detail.
Multi-cluster orchestration that actually works
One of SkyPilot’s most powerful features is its ability to treat multiple Kubernetes clusters as a unified compute fabric. While vanilla Kubernetes locks you into a single cluster, SkyPilot gives you the freedom to work across your entire infrastructure ecosystem.
Unified visibility across clusters. Instead of juggling multiple kubectl
contexts and dashboards, SkyPilot provides a single interface to see all your available compute resources:
$ sky check k8s
Checking credentials to enable infra for SkyPilot.
Checking compute credentials for kubernetes
Kubernetes: enabled [compute]
Allowed contexts:
├── nebius-cluster: enabled.
└── lambda-cluster: enabled.
🎉 Enabled infra 🎉
Kubernetes [compute]
Allowed contexts:
├── nebius-cluster
└── lambda-cluster
GPU discovery across your entire fleet. Need to find available H100s? SkyPilot searches across all your configured clusters automatically:
$ sky show-gpus --infra k8s
Kubernetes GPUs
GPU UTILIZATION
H200 24 of 24 free
H100 24 of 24 free
Context: nebius-cluster
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
H100 1, 2, 4, 8 24 of 24 free
Context: lambda-cluster
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
H200 1, 2, 4, 8 24 of 24 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
nebius-cluster <node_id-...> H100 8 of 8 free
nebius-cluster <node_id-...> H100 8 of 8 free
nebius-cluster <node_id-...> H100 8 of 8 free
lambda-cluster <node_id-...> H200 8 of 8 free
lambda-cluster <node_id-...> H200 8 of 8 free
lambda-cluster <node_id-...> H200 8 of 8 free
Intelligent failover. When your primary cluster is at capacity, SkyPilot automatically fails over to the next available cluster with the resources you need. No manual intervention, no downtime, no frustrated ML engineers waiting for GPUs.
This multi-cluster orchestration means your team can focus on model development while SkyPilot handles the complexity of resource discovery and allocation across your entire infrastructure landscape.
Neocloud K8s + SkyPilot = ❤️
The Problem We Solved:
- Neoclouds democratized GPU access but left the “last mile” unsolved
- Vanilla Kubernetes creates friction for ML teams who need to iterate fast
- ML engineers shouldn’t need to become Kubernetes experts to train models
Why This Combination Works:
- Neoclouds provide cost-effective, high-performance GPU clusters with InfiniBand
- SkyPilot adds the missing AI-native control plane that ML teams actually want to use
- Together, they create a unified compute fabric across multiple providers
What You Get:
- One YAML file replaces dozens of Kubernetes manifests
- Automatic failover when your primary cluster hits capacity
- Built-in SSH access, IDE integration, and interactive development
- ML-aware scheduling that understands distributed training requirements
- Cost optimization across your entire Neocloud portfolio
The Bottom Line: Infrastructure should empower ML engineers, not bog them down. The fastest-iterating AI teams will win, and iteration speed is inversely correlated with infrastructure friction. SkyPilot + Neoclouds finally makes Kubernetes “just work” for machine learning.