If you’re an infrastructure or MLOps engineer at a large company, you know the drill. The ML team comes to you with requirements that change weekly. They need GPUs yesterday, but the budget was set six months ago. They want to use the latest framework, but it breaks your carefully crafted Kubernetes deployments. They need to comply with data locality requirements while also optimizing for cost.
Sound familiar? You’re not alone, and there’s a better way.
There’s something ironic about the current state of AI infrastructure. We’re living through one of the most exciting periods in computing history - where a well-crafted prompt can generate poetry, code, or entire business plans - yet the very infrastructure powering this revolution feels like it was never designed with large-scale model training in mind.
It’s 2025, and we’re still asking AI Researchers and ML Engineers to become Kubernetes experts just to run a training job. Something’s got to give.
The GPU Gold Rush and the Rise of Neoclouds
Enter the AI Neoclouds
Let’s start with what’s actually working. Over the past few years, we’ve witnessed the emergence of so-called AI Neoclouds - specialized cloud providers like CoreWeave, Fluidstack, Lambda Labs, Nebius and Crusoe that have built their entire business model around one simple premise: rent GPUs, lots of them, and make them affordable.
These aren’t your traditional cloud providers trying to be everything to everyone. They’re laser-focused on solving the GPU scarcity problem in the AI community. While AWS and Google Cloud are still treating GPUs like precious gems hidden behind complex pricing tiers and availability zones, the Neoclouds are saying, “Here, take our H100s. We have thousands of them, and we’ll give you a better price than the hyperscalers.”
It’s a beautiful thing, really. The Neoclouds have democratized access to the computational power that was once the exclusive domain of Big Tech.
Neocloud landscape. Source: https://semianalysis.com/2024/10/03/ai-neocloud-playbook-and-anatomy/
The Neocloud Advantage: Built for AI Performance and Cost-Efficiency
While the hyperscalers are still treating high-performance networking like an afterthought, the Neoclouds have made InfiniBand (IB) a core part of their value proposition. Plus, they typically offer their high-performance GPU clusters at 30-50+% lower prices than AWS or Google Cloud, making cutting-edge AI infrastructure much more accessible.
What InfiniBand Actually Does (And Why You Should Care)
IB isn’t just faster networking - it’s a fundamentally different approach to how GPUs talk to each other. Traditional Ethernet networking forces data to take a scenic route through the CPU and kernel, copying data multiple times and burning precious cycles on context switching. IB with GPUDirect RDMA says “why don’t we just let the GPUs talk directly?”
The result? Data flows directly between GPU memory and network interfaces at speeds that make Ethernet look like pigeon mail. We’re talking 400 Gb/s per port, with some Neocloud setups pushing 3.2 TB/s of aggregate bandwidth per node. When you’re training a model with billions of parameters across dozens of GPUs, this isn’t luxury - it’s a requirement. IB can speed up training by a factor of 2x-3x (see 1, 2). This is where the Neoclouds have made a strategic decision: CoreWeave, Lambda Labs, and Nebius have built their entire infrastructure around IB fabrics. Take Lambda’s 1-Click Clusters: every GPU node comes with NVIDIA Quantum-2 400 Gb/s InfiniBand in a rail-optimized topology, enabling peer-to-peer GPUDirect RDMA out of the box. CoreWeave’s or Nebius’s H100/H200 clusters? Same story. These aren’t marketing gimmicks - they’re architectural choices that recognize the reality of modern AI workloads.
Kubernetes as the Foundation
Here’s where things get interesting… and complicated. Most Neoclouds have standardized on Kubernetes as their orchestration layer (for examples, see 1, 2, 3 and 4).
On paper, this makes perfect sense. Kubernetes is battle-tested, scalable, and has become the de facto standard for container orchestration. It’s what every infrastructure/MLOps team knows, and it’s what every enterprise expects.
However, here’s the thing about Kubernetes: it’s a brilliant solution for orchestrating a scalable web service consisting of hundreds of microservices, not necessarily for the problems of machine learning and AI workloads, especially at the sheer scale required for training modern, multi-billion-parameter models.
Making Kubernetes Play Nice with InfiniBand
Getting InfiniBand working on Kubernetes requires a few moving pieces, but the Neoclouds have done most of the heavy lifting for you.
Driver Setup: The OFED (OpenFabrics Enterprise Distribution) drivers need to be installed on every node. Most Neoclouds pre-install these at the kernel level, so you don’t have to worry about driver compatibility nightmares.
NCCL Configuration: Your containers need the right environment variables to discover and use the InfiniBand interfaces e.g.:
NCCL_SOCKET_IFNAME
for the network interfaceNCCL_IB_HCA
for the host channel adapterUCX_NET_DEVICES
for UCX communication
The beauty of working with Neoclouds is that they’ve already figured out these configuration details. Neocloud VMs come with NCCL and OFED pre-installed or with container images with the right drivers baked in. All so that you can train models instead of debugging configs.
However, while Neoclouds have solved the infrastructure complexity, there’s still a fundamental mismatch between Kubernetes and ML workloads that creates friction for your team.
Why Kubernetes Still Fails Your ML Team
Despite the Neoclouds’ infrastructure advantages, there’s still a fundamental mismatch between Kubernetes and ML workloads that creates friction for your team.
The Problem: When Infrastructure Gets in the Way of Innovation
The fundamental mismatch is that Kubernetes was designed for stateless services that require horizontal scaling. Machine learning workloads are stateful and resource-intensive. It’s like trying to use a Formula 1 car to haul furniture - technically possible, but you’re going to have a bad time.
The Steep Learning Curve Nobody Asked For
Let’s be honest about what we’re asking ML engineers to do. We’re saying: “Hey, you know that PhD in computer vision you spent five years earning? Great! Now, please also become an expert in container networking, service meshes, and YAML syntax.”
The cognitive overhead is enormous. Every time an ML engineer needs to run an experiment, they’re forced to context-switch from thinking about model architectures and loss functions to considering resource quotas, pod affinities, and node labels. It’s like asking a chef also to be an expert in restaurant HVAC systems - technically related to their job, but not where their expertise adds the most value.
Built for Services, Not Experiments
The typical ML development workflow looks nothing like a traditional web service deployment. ML development is inherently iterative and interactive. You write some code, run an experiment, look at the results, tweak the hyperparameters, and repeat. This cycle might happen dozens of times in a single day.
Kubernetes, with its emphasis on immutable containers and declarative deployments, makes this workflow painful. Want to change a single line of code? Rebuild the Docker image, push it to a registry, update your deployment manifest, and wait for the rollout to complete. What should be a 30-second iteration becomes a 10-minute ordeal.
The interactive nature of ML work: SSHing into machines, running Jupyter notebooks, and debugging with print statements feels like swimming upstream against Kubernetes’ design philosophy. You end up with hacky workarounds like long-running “development” pods that defeat the purpose of using Kubernetes in the first place.
The Gang Scheduling Problem
Here’s a technical detail that perfectly illustrates the mismatch: distributed training requires gang scheduling. When training a large model across multiple GPUs, all resources must be allocated simultaneously. If you can only get 7 out of 8 GPUs, the job can’t start - it’s all or nothing.
Vanilla Kubernetes doesn’t understand this. Its default scheduler will happily allocate resources as they become available, leading to deadlocks where multiple jobs are each holding some resources and waiting for others. The result is expensive GPUs sitting idle while jobs queue indefinitely.
The Single-Cluster Prison
Perhaps the most frustrating limitation is Kubernetes’s single-cluster worldview. Each cluster is an island, and moving workloads between clusters requires manual intervention. In a world where GPU availability fluctuates wildly and prices vary significantly across regions and providers, this inflexibility is a significant competitive disadvantage.
Your team’s productivity shouldn’t grind to a halt because a region on your preferred Neocloud ran out of H100s. But with vanilla Kubernetes, that’s precisely what happens. You’re locked into the availability and pricing of a single cluster, unable to leverage the broader ecosystem of GPU providers.
Trying to Bridge the Gap: AI Schedulers and Slurm-on-K8s
The friction between Kubernetes and large-scale AI is a well-known problem. In response, several job schedulers have gained popularity to run on top of Kubernetes, such as Kubeflow, Volcano, and, recently open-sourced, KAI-scheduler, which all aim to provide more robust batch scheduling capabilities. However, many of these are still generic container schedulers and not truly AI-native tools.
Recognizing the deep familiarity many researchers have with traditional HPC schedulers, some Neoclouds have developed their own Slurm-on-Kubernetes solutions like CoreWeave’s SUNK (”SlUrm oN Kubernetes”) or Soperator (a Kubernetes Operator for Slurm) by Nebius.
CoreWeave’s SUNK architecture. Source: https://slurm.schedmd.com/SLUG23/CoreWeave-SLUG23.pdf
However, Slurm itself is not an AI-native tool; it was originally designed for HPC workloads long before the current AI boom. Rather, many AI Scientists and ML Engineers are familiar with Slurm from their years in academia and find comfort in the familiar interface, even if it’s not the ideal tool for the job. This non-native design introduces several practical challenges for modern AI workflows:
- No Native Multi-Cluster or Cloud Bursting Support: Traditional Slurm deployments are designed around single-cluster environments and lack native support for multi-cloud or hybrid cloud scenarios. Adding new clusters requires complex setups, and dynamic bursting to additional cloud resources when capacity is exhausted is not straightforward. This creates bottlenecks when teams need to scale beyond a single provider’s capacity or want to optimize costs across multiple cloud environments.
- Dependency Management: Maintaining an identical state of software dependencies and packages across every cluster node is a known limitation of traditional Slurm installations. To address this, providers like Nebius have had to implement a shared root file system to create a unified environment across all nodes.
- Difficulty with Containerization: Running modern, containerized AI jobs is not a trivial task in Slurm. It is described as “challenging” and requires specialized plugins like
pyxis
to integrate container runtimes to achieve a reproducible environment. - Lack of Native GUI and Monitoring: Slurm’s core components are often managed via a command-line interface, and it lacks the built-in, user-friendly web UIs or monitoring dashboards that modern ML teams expect. Top-tier providers differentiate themselves by building custom UI/UX layers and integrating tools like Grafana for monitoring, a feature not native to Slurm itself.
- High Administrative Overhead: With Slurm-over-K8s solutions, DevOps and MLOps teams now have to manage both Kubernetes and Slurm and this introduces significant complexity. The configuration of Slurm itself, with files like
slurm.conf
, can be very complex.
The Missing Piece: Making Neocloud Power Usable
Here’s the irony: Neoclouds have solved the hard problems - they’ve built massive GPU clusters with enterprise-grade networking at consumer-friendly prices. But they’ve left the last mile unsolved. They hand you Kubernetes or Slurm access and say “have fun.”
What’s missing is tooling that can harness the raw power of Neocloud infrastructure while actually being usable by ML teams. You need something that understands both the economics of Neocloud GPU arbitrage and the reality of ML workflows.
In part 2 of this series, we’ll explore how SkyPilot solves these challenges and creates a unified compute plane that finally makes AI infrastructure work for ML teams.