The Evolution of AI Job Orchestration. Part 2: The AI-Native Control Plane & Orchestration that Finally Works for ML

This is Part 2 of our series on the evolution of AI Job Orchestration. In Part 1, we explored how Neoclouds are democratizing GPU access but leaving the “last mile” unsolved. Now we’ll discover how AI-native orchestration tools are bridging that gap.

We need AI-Native Control Plane for Any Infrastructure

While Neoclouds, specialized GPU cloud providers, have solved the hardware accessibility problem by offering cost-effective, high-performance clusters with advanced networking like InfiniBand, they’ve left a critical gap: the orchestration layer. Most hand you access to a Kubernetes or Slurm cluster and assume you’ll magically figure out how to run your training and inference jobs on it - but Kubernetes wasn’t designed for the iterative, resource-intensive nature of ML workloads.

This is where SkyPilot enters the picture, and honestly, it feels like someone finally understood the assignment. Rather than forcing ML teams to become Kubernetes experts, SkyPilot provides an AI-native abstraction that makes GPU clusters actually usable for machine learning.

Why SkyPilot + Neoclouds Is a Game Changer

This combination solves a problem that neither piece could address alone. Neoclouds give you access to high-performance GPU clusters, but each provider operates in isolation. When capacity runs out, you’re stuck waiting or manually hunting for alternatives across different dashboards and APIs.

Imagine this scenario: Your team needs to launch a distributed training job requiring 16 H100s. You typically use Lambda Labs. But when you try to launch, Lambda’s cluster is fully occupied by another team’s month-long foundation model training run.

With vanilla Kubernetes, you’re dead in the water. You’d need to either wait indefinitely for Lambda’s capacity to free up, or spend hours reconfiguring your workload for a different provider - updating network settings, transferring data, and debugging environment differences.

With SkyPilot, you just specify:

resources:
  infra: k8s
  accelerators: H100:8
num_nodes: 2
...

and it automatically discovers that Lambda’s cluster is at capacity, then seamlessly fails over to Nebius where 16 H100s are available. Your job launches immediately on Nebius’s InfiniBand-connected cluster, using the same training configuration that would have worked on Lambda. The failover is transparent - your training script doesn’t even know it’s running on a different provider.

_{^{SkyPilot orchestrates AI workloads across multiple Neocloud Kubernetes clusters with automatic failover capabilities.}}

This isn’t just convenience; it’s a fundamental shift in how you think about GPU capacity. Instead of being locked into the availability constraints of a single provider, SkyPilot transforms your entire portfolio of Neocloud accounts into one large, distributed compute fabric where capacity constraints from individual providers become irrelevant.

_{^{Automatic failover process when SkyPilot seamlessly redirects workloads from capacity-constrained clusters to available resources without manual intervention.}}

Solving the Pain Points

Simplicity is king. Instead of wrestling with multiple Kubernetes manifests, SkyPilot lets you define your workload in a single, human-readable YAML file. Where you might need dozens of lines of Kubernetes configuration to specify resources, networking, and storage, SkyPilot distills it down to the essentials:

# my-training-job.yaml
name: train-my-large-model

resources:
    infra: k8s
    accelerators: H100:8  # or e.g. H100:1, A100:8
   
# upload a working directory to remote ~/sky_workdir.
workdir: . 

setup: | # install dependencies
    uv pip install -r requirements.txt

run: | # run multi-gpu training
    torchrun \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    main.py

That’s it. One short YAML file replaces several long Kubernetes manifests. The complexity is still there - SkyPilot is handling pod creation, resource allocation, and networking behind the scenes- but it’s abstracted away from the user (see “AI on Kubernetes Without the Pain”)

Interactivity is built-in. SkyPilot understands that ML development is interactive. SSH access, IDE integration, and Jupyter notebook support are first-class features. You can launch a cluster, SSH into it, and work on your code as if it were a local machine. No more rebuilding Docker images for every code change.

_{^{Connecting VSCode to a remote development cluster.Source: https://blog.skypilot.co/ai-on-kubernetes/}}

ML-aware scheduling. SkyPilot knows about gang scheduling and can handle the complexities of distributed training out of the box. It’s not fighting against the nature of ML workloads - it’s designed around them.

Automated cost optimization and failover. This is where SkyPilot really shines. It can automatically find the cheapest available GPUs across all your enabled clouds and regions. If your primary cluster is full, it can burst to other providers to ensure your workloads are never blocked.

High-Level Workflow: Launching an AI Job on a Neocloud’s K8s with SkyPilot

Let me walk you through what this actually looks like in practice:

Step 1: Configure Your Infrastructure

Point SkyPilot to your Neocloud’s Kubernetes cluster by ensuring your kubeconfig is properly set up. (see Kubernetes Deployment)

_{^{Web Console for Nebius AI, one of the Neocloud providers. Shown are Kubernetes dashboard, Compute Instances and Monitoring.}}

SkyPilot will automatically detect and enable the cluster. You’ll also want to configure cloud bucket mounting, NFS or shared filesystem access for your data.

This is typically a one-time setup that your infrastructure team handles. Once it’s done, ML engineers don’t need to think about it.

Step 2: Define Your Task

Write a simple YAML file specifying your code, setup requirements, and resource needs:

# my-training-job.yaml
name: train-my-large-model

resources:
    infra: k8s
    accelerators: H100:8  # or e.g. H100:1, A100:8
   
# upload a working directory to remote ~/sky_workdir.
workdir: . 

file_mounts:
  /my_data:
    source: s3://my-bucket/ # GCS, Azure Blob, R2 also supported
    mode: MOUNT

setup: | # install dependencies
    uv pip install -r requirements.txt

run: | # run multi-gpu training
    torchrun \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    main.py

This single file replaces what would be multiple Kubernetes manifests with complex resource specifications, init containers, and service definitions .

Testing Your InfiniBand Setup

Want to know if your InfiniBand setup is actually working? Run an NCCL all-reduce tests. This is the networking equivalent of a stress test - it exercises the exact communication patterns your distributed training will use.

The test pushes data between all GPUs in your cluster simultaneously, measuring both bandwidth and latency under realistic conditions. If you’re seeing close to theoretical bandwidth (think 300+ GB/s for H100s over InfiniBand), you’re in good shape. If not, something’s misconfigured.

Below is an example of how to run an NCCL all-reduce test with SkyPilot.

_{^{NCCL All-Reduce test with SkyPilot on a Neocloud’s Kubernetes}}

NCCL All-Reduce test with SkyPilot on a Neocloud’s Kubernetes

Step 3: Launch with a Single Command

Run sky launch my-training-job.yaml. SkyPilot handles everything else: provisioning the pods, syncing your code, running the setup commands, and executing your job .

No need to manually create deployments, services, or persistent volumes. No need to debug why your pods are stuck in “Pending” state. SkyPilot abstracts all of that complexity.

Step 4: Monitor and Manage with the SkyPilot Dashboard

Access your centralized management interface at http://<api-server-url>/dashboard. The SkyPilot Dashboard provides real-time visibility and control across your entire AI infrastructure.

It gives you immediate insight into resource utilization, job progress, and infrastructure health. Team members can see shared clusters and coordinate GPU usage without complex permissions setup.

Key Dashboard Views

Clusters: Monitor resource utilization and status across all clouds.
Jobs: Track training runs, check logs, and manage queued workloads.
Workspaces: Manage team isolation and resource allocation. Create workspaces for different projects with configurable access permissions.
Infrastructure: Unified view of connected cloud accounts and Kubernetes clusters across AWS, GCP, Azure, and Neoclouds.
Users: Oversee user access, workspace assignments, and usage patterns.

CLI Integration

SkyPilot’s unified CLI interface works alongside the dashboard:

sky status shows all your running clusters
sky queue displays your job queue across all clouds
sky logs streams logs from any job, regardless of where it’s running

At the same time, you can still use kubectl commands or Kubernetes UI tools like k9s or Lens to inspect the underlying resources when you need that level of detail.

Multi-cluster orchestration that actually works

One of SkyPilot’s most powerful features is its ability to treat multiple Kubernetes clusters as a unified compute fabric. While vanilla Kubernetes locks you into a single cluster, SkyPilot gives you the freedom to work across your entire infrastructure ecosystem.

Unified visibility across clusters. Instead of juggling multiple kubectl contexts and dashboards, SkyPilot provides a single interface to see all your available compute resources:

$ sky check k8s
Checking credentials to enable infra for SkyPilot.
Checking compute credentials for kubernetes
  Kubernetes: enabled [compute]
    Allowed contexts:
    ├── nebius-cluster: enabled.
    └── lambda-cluster: enabled.

🎉 Enabled infra 🎉
  Kubernetes [compute]
    Allowed contexts:
    ├── nebius-cluster
    └── lambda-cluster

GPU discovery across your entire fleet. Need to find available H100s? SkyPilot searches across all your configured clusters automatically:

$ sky show-gpus --infra k8s
Kubernetes GPUs
GPU   UTILIZATION    
H200  24 of 24 free
H100  24 of 24 free  
Context: nebius-cluster
GPU   REQUESTABLE_QTY_PER_NODE  UTILIZATION    
H100  1, 2, 4, 8                24 of 24 free  
Context: lambda-cluster
GPU   REQUESTABLE_QTY_PER_NODE  UTILIZATION    
H200  1, 2, 4, 8                24 of 24 free  
Kubernetes per-node GPU availability
CONTEXT         NODE                                GPU   UTILIZATION  
nebius-cluster  <node_id-...>                       H100  8 of 8 free  
nebius-cluster  <node_id-...>                       H100  8 of 8 free  
nebius-cluster  <node_id-...>                       H100  8 of 8 free  
lambda-cluster  <node_id-...>                       H200  8 of 8 free  
lambda-cluster  <node_id-...>                       H200  8 of 8 free  
lambda-cluster  <node_id-...>                       H200  8 of 8 free

Intelligent failover. When your primary cluster is at capacity, SkyPilot automatically fails over to the next available cluster with the resources you need. No manual intervention, no downtime, no frustrated ML engineers waiting for GPUs.

This multi-cluster orchestration means your team can focus on model development while SkyPilot handles the complexity of resource discovery and allocation across your entire infrastructure landscape.

Neocloud K8s + SkyPilot = ❤️

The Problem We Solved:

Neoclouds democratized GPU access but left the “last mile” unsolved
Vanilla Kubernetes creates friction for ML teams who need to iterate fast
ML engineers shouldn’t need to become Kubernetes experts to train models

Why This Combination Works:

Neoclouds provide cost-effective, high-performance GPU clusters with InfiniBand
SkyPilot adds the missing AI-native control plane that ML teams actually want to use
Together, they create a unified compute fabric across multiple providers

What You Get:

One YAML file replaces dozens of Kubernetes manifests
Automatic failover when your primary cluster hits capacity
Built-in SSH access, IDE integration, and interactive development
ML-aware scheduling that understands distributed training requirements
Cost optimization across your entire Neocloud portfolio

The Bottom Line: Infrastructure should empower ML engineers, not bog them down. The fastest-iterating AI teams will win, and iteration speed is inversely correlated with infrastructure friction. SkyPilot + Neoclouds finally makes Kubernetes “just work” for machine learning.

We need AI-Native Control Plane for Any Infrastructure#

Why SkyPilot + Neoclouds Is a Game Changer#

Solving the Pain Points#

High-Level Workflow: Launching an AI Job on a Neocloud’s K8s with SkyPilot#

Step 1: Configure Your Infrastructure#

Step 2: Define Your Task#

Step 3: Launch with a Single Command#

Step 4: Monitor and Manage with the SkyPilot Dashboard#

Multi-cluster orchestration that actually works#

Neocloud K8s + SkyPilot = ❤️#