Slurm vs K8s for AI Infra: Academic HPC vs Cloud-Native Reality - the non-ideal solutions

There are a lot of discussions happening in AI infrastructure right now. On one side, we have researchers who trained on Slurm in grad school, comfortable with sbatch train_model.sh and the predictability of academic HPC clusters. On the other side, we have platform engineers who’ve spent the last several years of their career mastering Kubernetes, building sophisticated cloud-native architectures for web-scale applications.

The problem? Modern AI workloads don’t fit cleanly into either world, and we’re watching both communities attempt increasingly creative solutions to bridge this gap.

In the last few years, the developments in the AI infrastructure landscape have been incredibly diverse. Meta has been running distributed training across 24,000-GPU clusters, while OpenAI scaled Kubernetes to 7,500 nodes for GPT-3 training back in 2021. Meanwhile, every startup with a decent model is burning through GPU credits trying to figure out whether to bet on Slurm’s batch scheduling capabilities or Kubernetes’ cloud-native flexibility.

The truth is neither tool was designed for this moment. Slurm emerged from the scientific computing world of ~2003, optimized for fixed clusters running long-batch jobs where every CPU cycle mattered. Kubernetes was born at Google in 2014 to orchestrate stateless microservices that could scale horizontally and fail gracefully. Now both are being stretched to handle AI workloads that combine the resource intensity of HPC with the dynamic scaling needs of modern applications.

ML Teams and Slurm

In many AI labs, you’ll find researchers who swear by Slurm. There’s something deeply satisfying about the directness of it:

#SBATCH --job-name=train-llama
#SBATCH --nodes=8
#SBATCH --gres=gpu:H100:8
#SBATCH --time=72:00:00
...
srun python train.py --model-size=70B

Slurm genuinely excels at what AI researchers need: gang scheduling, where all resources for a distributed job are allocated simultaneously. When you’re training a 70B parameter model across 64 GPUs, you can’t start with 63 and hope the last one shows up eventually. It’s all or nothing, and Slurm’s scheduler understands this fundamentally.

Note on terminology: what K8s calls “gang scheduling” (coordinated resource allocation) is just Slurm’s default scheduling behaviour. In contrast, Slurm uses “gang scheduling” to refer to a different concept of time-slicing multiple jobs on shared resources. This linguistic divergence reflects the different origins and priorities of these two ecosystems.

The resource allocation guarantees are equally important. Once Slurm allocates 8 H100s to your job, those GPUs are reserved for you until the job completes. No surprise evictions, no resource contention for GPU access. For a training run that might cost $50,000 in GPU hours, this allocation predictability is valuable.

But Slurm’s HPC roots also create significant friction for modern AI teams:

Lack of Resource Isolation: While Slurm guarantees you’ll get access to the resources you requested, it relies on user discipline to enforce those limits. A misconfigured job that requests 32GB of RAM but actually consumes 64GB can cause other jobs on the same node to crash with out-of-memory errors. In AI workloads, where memory usage can spike unpredictably during data loading or model initialization, this lack of enforcement creates some risks.

The Static Cluster Problem: Traditional Slurm clusters are fixed pools of resources. When your on-premise cluster hits capacity, you’re stuck waiting or manually provisioning additional nodes. There’s no elegant way to burst to the cloud when you need extra GPUs for a deadline.

Inference is an Afterthought: Slurm excels at batch training jobs but struggles with serving models for inference. Standing up a REST API or auto-scaling inference service on Slurm just feels awkward as it wasn’t designed for inference to begin with.

Limited User Interface: Slurm lacks an official web UI, relying primarily on command-line interfaces. While there have been community attempts to build web dashboards (1, 2, 3), there’s no standardized, officially supported graphical interface for job management and cluster monitoring.

Dependency Hell: Keeping software environments consistent across hundreds of nodes requires careful orchestration. As one engineer at Nebius noted, “Slurm has a very annoying requirement: all nodes must be identical (Linux user and group IDs, software versions and the like).”

The Infrastructure Team’s Counter-Proposal: Cloud-Native Kubernetes

Platform engineers see these limitations and advocate for Kubernetes. And honestly, they have compelling arguments. Kubernetes was designed from day one for the dynamic, heterogeneous world of cloud computing.

Elastic Everything: Need more GPUs? Kubernetes’ cluster autoscaler can spin up new nodes in minutes. Traffic to your inference service drops to zero at night? Scale down to zero and pay nothing. This elasticity is particularly valuable given that a single GPU can cost $3-6 per hour and sitting idle represents pure waste.

Unified Platform: The same Kubernetes cluster can handle training jobs, inference services, data pipelines, and monitoring dashboards. No need to maintain separate infrastructure stacks or learn multiple orchestration systems.

Rich Ecosystem: The cloud-native ecosystem around Kubernetes offers sophisticated tools for everything from GPU monitoring with DCGM to distributed training with Volcano. Projects like Kubeflow provide ML-specific abstractions on top of Kubernetes.

The problem? Kubernetes treats AI workloads like any other containerized application, and AI workloads are decidedly not like web services.

Where Kubernetes Falls Short for AI

The Kubernetes Learning Curve. Source: r/kubernetes

The mismatch becomes apparent when you try to run serious AI workloads on vanilla Kubernetes. The default scheduler has no concept of gang scheduling, leading to deadlocks where multiple jobs each hold some GPUs while waiting for others. As Google’s team discovered, managing workloads across 50,000 TPU chips required building sophisticated extensions on top of Kubernetes.

The complexity tax is real. Compare a simple Slurm job script with a K8s manifest doing the same thing, and you’ll find that the K8s manifest is often at least 3x longer and is, arguably, 10x more difficult to write and read.

Running a distributed training job on Slurm vs K8s

Interactive development, so crucial for ML research, goes against the declarative philosophy of Kubernetes. For example, imagine trying to connect a tool like Cursor or VSCode to a running Kubernetes pod for interactive debugging — it’s a painful journey involving port forwarding, pod execs, and YAML edits, rather than the seamless experience researchers expect.

Feature	Slurm	Kubernetes
Learning Curve	✅ Simple bash scripts	❌ Complex YAML manifests
Training Jobs	✅ Gang scheduling built-in	❌ Requires extensions
Interactive Development	🟨 Salloc for non-standard bash	❌ Port forwarding, pod execs
Inference/Serving	❌ Awkward for APIs	✅ Designed for services
Resource Elasticity	❌ Static clusters	✅ Auto-scaling
Multi-Cloud	❌ Single cluster/region	✅ Cloud-agnostic

Hybrid Solutions Offer Reconciliation

Recognizing that neither pure approach works perfectly, the industry has developed increasingly sophisticated hybrid solutions.

Slurm-on-Kubernetes Projects: CoreWeave’s SUNK runs a full Slurm cluster inside Kubernetes, while Nebius’s Soperator provides a Kubernetes operator that manages Slurm clusters as native resources. These solutions aim to give researchers the familiar Slurm interface while leveraging Kubernetes’ cloud-native capabilities under the hood. However, they come with a significant trade-off: these implementations typically reserve entire Kubernetes nodes exclusively for Slurm processes (even if there are no running Slurm jobs), making those nodes unavailable for other uses (like scheduling jobs with other job schedulers).

Slurm-on-Kubernetes Resource Reservation Problem

Advanced Batch Schedulers: Projects like Volcano, YuniKorn, and Kueue add sophisticated batch scheduling capabilities to Kubernetes. Volcano delivers 2x-4x scheduling performance improvements over default Kubernetes, while Kueue’s MultiKueue feature enables multi-cluster job dispatching.

However, these solutions inherit the complexity of Kubernetes itself. While they address some of the scheduling limitations for AI workloads, they do not reduce the operational burden or steep learning curve that AI teams face when working directly with Kubernetes.

Train-on-Slurm, Serve-on-Kubernetes: Many organizations adopt a hybrid approach where training happens on Slurm clusters optimized for batch workloads, while inference services run on Kubernetes clusters optimized for dynamic scaling and high availability. However, this split approach introduces its own challenges: teams must maintain two separate infrastructure stacks, increasing operational overhead and maintenance burden. It can also lead to resource fragmentation, where GPUs and other resources are stranded on one system and can’t be easily shared or reallocated, reducing overall utilization and efficiency.

Moving Infrastructure Out of the Way

Are we optimizing for the wrong thing? The question isn’t whether Slurm or Kubernetes is “better” for AI workloads. The question is why AI teams should need to become infrastructure experts at all.

This is where tools like SkyPilot become interesting. Rather than forcing teams to choose between orchestration systems or build complex hybrid solutions, SkyPilot provides a unified interface that abstracts away the underlying infrastructure complexity. Whether your job runs on a Kubernetes cluster or across multiple cloud providers becomes an implementation detail rather than a fundamental architectural decision.

Remember that clean Slurm snippet from earlier? SkyPilot offers similar simplicity, but with multi-cloud portability and the ability to burst to any region or cloud:

# One YAML file that works across any infrastructure (K8s or different clouds)
resources:
  accelerators: H100:8
num_nodes: 2

setup: |
  pip install torch torchvision  

run: |
  torchrun --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE train.py

The system automatically discovers available resources, handles failover between providers, and manages the underlying orchestration complexity. When Lambda Labs runs out of H100s or H200s, it can fail over to Nebius. Unlike Slurm, which is tied to a specific cluster and region, SkyPilot enables seamless scaling across different regions and clouds. With over 2.5M downloads (according to https://pepy.tech/), SkyPilot represents a growing recognition that the future lies in abstraction rather than standardization.

AI Infrastructure Solutions: Operational Complexity vs Cloud-Native Flexibility

AI Infrastructure Solutions: Operational Complexity vs Cloud-Native Flexibility

Feature	Pure Slurm	Slurm-on-K8s	K8s + Volcano/Kueue	Train/Serve Split	Abstraction Layer
Multi-Cloud	❌	Limited	✅	Partial	✅
Burst to New Regions/Clouds	❌	❌	❌	❌	✅
Resource Utilization	High (per-node)	Low (nodes reserved for Slurm)	Medium	Low (resource fragmentation)	High (fine-grained, per-GPU)
Learning Curve	Low	Low	High	High	Low
Operational Overhead	Medium	High	High	High	Low

The Path Forward

The Slurm vs Kubernetes debate about specialized tools optimized for specific use cases and general-purpose platforms that handle diverse workloads. Both approaches have merit, and both will continue to evolve.

What’s clear is that the AI infra tooling landscape is far more nuanced than simple either-or choices. Microsoft and Google are investing heavily in making Kubernetes more AI-friendly, while SchedMD continues advancing Slurm’s cloud integration capabilities. However, the most successful organizations will be those that use abstractions that let AI teams focus on models rather than infrastructure.

ML Teams and Slurm#

The Infrastructure Team’s Counter-Proposal: Cloud-Native Kubernetes#

Where Kubernetes Falls Short for AI#

Hybrid Solutions Offer Reconciliation#

Moving Infrastructure Out of the Way#

The Path Forward#