From SLURM to SkyPilot: How Avataar cut costs 11x with multi-cloud AI infrastructure

Use Case Highlight: This is a technical deep-dive from Avataar’s engineering team. Learn how they solved GPU allocation inefficiencies and simplified multi-cloud management while scaling AI workloads for major enterprises like HP and Lowe’s.

At Avataar, we’re building specialized AI models for enterprise use-cases. Our flagship product Velocity uses advanced computer vision and AI to help major companies like HP and Lowe’s automatically generate compelling product videos, shopping experiences, and marketing materials - transforming static catalogs into dynamic content at unprecedented scale.

Avataar's Velocity platform transforms static product images into dynamic videos and interactive shopping experiences.

As we’ve scaled our AI capabilities to serve our growing customer base, our infrastructure needs have become increasingly complex. We need:

Cost-effective AI infrastructure that can scale with our business
Multi-cloud capabilities for easy enterprise integration
Flexibility across all scales to run workloads from single to 1000s of GPUs
Simplified management for our lean infra team

SLURM is not enough

For the past two years, SLURM has been our go-to solution for training infrastructure. It served us well in the early days, but as Avataar has grown, we’ve hit several critical limitations that were starting to impact our business.

Inelastic GPU provisioning was hurting our margins

Our SLURM cluster couldn’t dynamically scale to match our actual GPU requirements. We were forced to use a fixed allocation model that required manual scaling and led to over-provisioning. For a growing company focused on margins, this inefficiency was becoming unsustainable.

A significant challenge with our SLURM setup was its rigid GPU allocation model. When we needed just 1-2 H100 GPUs for smaller fine-tuning runs, our cloud required us to provision entire 8-GPU nodes. This meant we were paying for 8 GPUs while only using 1, significantly increasing our costs for what should have been small training runs.

Inefficient GPU allocation in Avataar's previous SLURM setup — **Inefficient GPU allocation in Avataar’s previous SLURM setup.** When only 1 GPU was needed for small training runs, the rigid provisioning model forced allocation of entire 8-GPU p5.48xlarge nodes, resulting in 7 idle GPUs and $48/hour in wasted compute.

Multi-cloud is a necessity, not just a nice-to-have

Multi-cloud infrastructure isn’t just a strategic advantage - it’s critical to our business growth. Without it, we simply can’t compete effectively or serve our customers properly.

GPU scarcity forces diversification. Relying on a single cloud provider is a recipe for project delays and missed deadlines. Different providers have varying GPU availability and pricing at different times. When AWS runs out of H100 capacity during critical training windows, we need immediate access to alternatives like RunPod or GCP. Multi-cloud flexibility protects us from this GPU scarcity.

Enterprises need multi-cloud. Many of our enterprise clients have specific data-access policies and compliance requirements. Some prefer or require that their workloads run within their own cloud environments. Being able to deploy on any cloud helps us better serve these enterprise customers and their cloud preferences.

Cloud credits represent real money that can’t be wasted. Like most growing companies, we receive credits from multiple cloud providers through startup programs and partnerships. These credits have expiration dates and usage restrictions. Without the ability to seamlessly move workloads between clouds, we were essentially leaving thousands of dollars on the table.

Supporting diverse workloads requires flexible infrastructure

Our business runs three distinct types of workloads, each with different infrastructure requirements:

Training workloads range from fine-tuning small language and vision models with specialized datasets to training large-scale Diffusion-Transformer video models that require significant compute resources.

Real-time inference involves serving video models and LLMs on EKS with strict latency requirements while optimizing for cost efficiency.

Batch rendering handles large-scale product video rendering using our Rust/WGPU renderer, which has different resource patterns than our ML workloads.

Workload Type	Typical Duration	GPUs	Scaling Pattern	Priority
Training	Hours to days	2-1000+ GPUs	Burst capacity	High performance
Real-time Inference	Continuous	1-8 GPUs	Stable, low-latency	Cost efficiency
Batch Rendering	Minutes to hours	Variable	On-demand bursts	Flexible capacity

Workload types at Avataar AI

The varied resource usage patterns and bursting needs of these workloads meant we needed infrastructure that could scale dynamically to match demand rather than requiring us to maintain separate, over-provisioned systems for each use case.

How SkyPilot Powers Avataar’s Multi-Cloud AI Infra

Moving from SLURM to SkyPilot has fundamentally changed how we approach AI infrastructure at Avataar. It has enabled us to build a truly flexible, cost-effective multi-cloud system.

Achieving 11x cost savings through smart resource allocation

The most immediate impact of adopting SkyPilot was dramatic cost reduction - we decreased our infrastructure costs by 11x compared to our previous SLURM setup.

The key breakthrough was being able to provision exactly the resources we need. Instead of being forced into 8-GPU AWS nodes for small training runs, we can now provision exactly 2 H100s from cost-effective providers like RunPod:

Provider	Instance Type	Hourly Cost	Cost per GPU hour
AWS	`p5.48xlarge`	$55.04	$6.88
RunPod	`2x_H100_SECURE`	$4.78	$2.39

H100 pricing on AWS and RunPod. RunPod and other neoclouds can provide rightsized instances for our workloads, saving up to 11x in cost.

SkyPilot’s multi-provider access and elastic scaling capabilities allow us to seamlessly switch between AWS, RunPod, Azure, and GCP based on availability and cost, finding the most cost-effective resources for each workload.

Unlike SLURM’s rigid allocation model, we can now dynamically scale resources up and down based on demand, eliminating the GPU idling that was eating into our margins.

Maintaining the SLURM experience with multi-cloud flexibility

From an MLOps perspective, SkyPilot’s centralization has been transformative for our infra team. Instead of managing multiple cloud consoles and navigating different documentation for each provider, we now use one centralized SkyPilot API server to monitor and manage all our cloud resources across AWS, RunPod, Azure, and GCP. Centralized credential management means no more individual engineers managing cloud credentials on their laptops or worrying about access control across multiple providers.

For our researchers, the transition has been seamless because SkyPilot preserved the familiar SLURM workflow our team loves. Researchers can easily SSH into their provisioned nodes to run and debug their code and they can submit jobs just like before. The key difference is that these familiar workflows now work across multiple clouds, giving researchers access to better GPU availability and pricing without any change to their day-to-day development experience.

Avataar's infrastructure transformation from SLURM to SkyPilot — **Before and After: Avataar’s migration from a rigid single-cloud SLURM setup to SkyPilot’s flexible multi-cloud infrastructure**. The transition maintained familiar workflows for researchers while unlocking access to AWS, RunPod, Azure, and GCP with centralized management and elastic scaling.

Multi-cloud helps our business strategy

Beyond cost savings and operational efficiency, SkyPilot has enabled strategic advantages that directly support our business growth. Multi-cloud capabilities have become essential for our enterprise go-to-market strategy. We can seamlessly run our ML workloads across multiple clouds, enabling us to offer our solutions in clients’ preferred cloud and reducing the integration time.

The ability to efficiently utilize cloud credits has unlocked previously wasted value. Because SkyPilot makes switching between clouds seamless, we can now fully utilize credits from Azure and GCP that would otherwise go unused. This has provided additional cost savings beyond our direct infrastructure efficiency gains.

"SkyPilot has significantly reduced infra management workload for us by unifying multi-cloud orchestration and making it easily programmable."

— Shubham Jain, AI & Engineering Lead, Avataar

Results: 11x cost reduction with multi-cloud flexibility, saving tens of hours per week

By moving off our SLURM cluster and onto SkyPilot, we’ve transformed our infrastructure approach:

11x lower costs with SkyPilot have made a significant impact on our margins. We’ve eliminated waste from over-provisioned GPU resources by gaining access to exactly what we need, when we need it.

Simplified multi-cloud management has been a game-changer for our lean infrastructure team. Our 3-person team can now manage all our cloud resources through a single interface, dramatically reducing operational overhead and allowing us to focus on strategic improvements rather than day-to-day infrastructure management.

Multi-cloud flexibility has positioned us for enterprise success. We can serve clients across multiple clouds and leverage cloud partnerships effectively, reducing time-to-market for new customer implementations.

"SkyPilot has helped us in both reducing costs and increasing productivity. ML engineers can run training across any number of GPUs and service providers via a single terminal, without copying any data or setting up environments.

Infra engineers are the biggest winners — SkyPilot saves them tens of hours per week."

— Shubham Jain, AI & Engineering Lead

As we continue scaling our video generation and agentic capabilities to serve more enterprise customers, SkyPilot provides the foundation for cost-effective, flexible infrastructure that supports both our technical requirements and business objectives. The platform allows our engineering team to focus on building AI products and serving customers rather than managing the complexities of multi-cloud infrastructure.

SLURM is not enough#

Inelastic GPU provisioning was hurting our margins#

Multi-cloud is a necessity, not just a nice-to-have#

Supporting diverse workloads requires flexible infrastructure#

How SkyPilot Powers Avataar’s Multi-Cloud AI Infra#

Achieving 11x cost savings through smart resource allocation#

Maintaining the SLURM experience with multi-cloud flexibility#

Multi-cloud helps our business strategy#

Results: 11x cost reduction with multi-cloud flexibility, saving tens of hours per week#