Scaling AI Infrastructure at Abridge with SkyPilot

Use Case Highlight: This is a technical deep-dive from Abridge’s engineering team exploring their journey from a fragmented collection of SLURM clusters with capacity challenges to seamless distributed training with SkyPilot.

At Abridge, we’re on a mission to revolutionize healthcare documentation. Our GenAI platform for Clinical Conversations is currently deployed to more than 100 healthcare systems, helping clinicians generate clinical documentation from patient conversations. AI is at the core of everything we do - it’s what makes our platform possible.

Abridge's GenAI platform for Clinical Conversations transforms patient conversations into structured clinical documentation — Abridge’s GenAI platform for Clinical Conversations transforms patient conversations into structured and compliant clinical documentation.

As we’ve scaled from serving a handful of health systems to over 100, our AI infrastructure demands have grown exponentially. We need systems that can:

Scale with our business as we onboard new healthcare providers
Handle training of dozens of model variants across multiple cloud providers
Keep operational complexity low for our small infrastructure team

Building and maintaining AI infrastructure at healthcare scale is complex, and the stakes are high when healthcare systems depend on reliable service.

The Challenges of Scaling AI Infrastructure

Scaling our AI infrastructure was not without its challenges. As we quickly discovered, our lean infrastructure team needed to secure GPU capacity across multiple clouds and make the GPUs usable quickly.

Securing GPU capacity requires multi-cloud infrastructure

Abridge's multi-cloud infrastructure before SkyPilot — **Abridge’s multi-cloud infra before SkyPilot.** Each cloud had its own quirks and limitations, hurting usability and requiring inefficient manual scheduling and orchestration.

GPUs are hard to find. A single cloud provider couldn’t meet our GPU capacity needs, especially during peak training periods when multiple models need to be trained and updated. We had to distribute our infrastructure across hyperscalers and neo clouds.

Each cloud required different setups - VMs with no orchestrator on one, Kubernetes with Ray on another, SLURM on a third. Each environment had its own quirks, deployment processes, networking configurations, and management overhead.

Managing these different environments became a maintenance nightmare. Engineers spent significant time debugging environment-specific issues instead of focusing on model development. For example, jobs that worked on SLURM would fail on Kubernetes clusters running on other clouds due to configuration differences.

Supporting interactive development and distributed training

We need infrastructure that can handle both interactive development and production-scale distributed training.

Interactive development: Our ML scientists, coming from SLURM backgrounds, need interactive environments (SSH access, Jupyter notebooks) for rapid experimentation - launching mini runs, tweaking code, and iterating on model architectures.
Large scale distributed training: Production training runs need to scale across GPUs with fault tolerance and optimized data loading for multi-day training jobs.

Making both workflows work seamlessly across multiple clouds with different orchestrators was challenging. Each setup handled interactive sessions and distributed training differently, requiring us to maintain separate infrastructure stacks and increasing our operational complexity.

Precious engineer time and GPU hours wasted managing infra

Infrastructure diversity makes setting up new training environments hard. It could take days: installing CUDA drivers, configuring networking, setting up storage, installing dependencies, and debugging environment-specific issues.

As a result, GPU utilization and team productivity suffered from machines sitting idle while engineers worked through configuration problems.

Worse, if the job needs to be moved to a different cluster, the process would need to be repeated.

"The ML Research team spent a long time moving their training jobs between SLURM clusters and across other orchestrators. They had to debug and test these jobs on each orchestrator separately and maintain different job configs.

Every minute spent moving jobs was time not spent on model training."

— Sisil Mehta, ML Platform Lead

How SkyPilot accelerates AI at Abridge

Unified experience across all clusters

SkyPilot transformed our fragmented infrastructure into a cohesive system:

Abridge's multi-cloud infrastructure after SkyPilot — **Multi-cloud infra made simple with SkyPilot.** A single unified interface provides access to GPUs across all our Kubernetes clusters. Jobs and development pods can be run anywhere without any migration overhead.

sky launch everywhere: Instead of maintaining separate deployment scripts for each provider, we now use one sky launch command that works across all our GPU clusters and clouds. SkyPilot automatically finds available GPUs and provisions resources.

No more slack scheduling: SkyPilot eliminates manual coordination for GPU allocation by automatically finding available resources across our multi-cloud setup.

Fast onboarding: New ML engineers learn one tool instead of multiple cloud-specific deployment systems, reducing ramp-up time from weeks to hours.

"Previously, the ML infra team was pinged daily on issues like 'Why is the job not being scheduled?', 'Why is scheduling taking so long?', 'Why did my job fail?', 'Why are my dependencies not installing correctly?' etc.

With SkyPilot we rarely get pinged. It just works!"

— Sisil Mehta, ML Platform Lead

SLURM-like experience with K8s-like flexibility

SkyPilot delivered the familiar experience our researchers wanted with the reliability our production workloads required:

Interactive development: sky launch --gpus H100:4 provides immediate SSH access to a GPU-enabled shell without complex setup. Just like srun --gres=gpu:4 --pty bash in SLURM, but works seamlessly across all our infrastructure.
Jupyter notebook hosting: We can spin up Jupyter notebooks directly on GPU clusters, enabling researchers to prototype with high-end hardware that wasn’t available locally.
Managed jobs: SkyPilot’s managed jobs provide the same convinience as SLURM’s job scheduler but works across all our infrastructure - automatic restart on job failures, strong isolation, and reliable job management for long running training jobs.
Model evals: Quick model evaluation became simple - we can deploy models as FastAPI services in minutes for testing. Unlike SLURM which lacks native API endpoint support, SkyPilot makes it easy to expose models as services.

As one of our team members put it:

"Skypilot is pretty nice actually... I must admit even as a die hard slurm guy."

— John Giorgi, Research Scientist

Distributed training made easy

SkyPilot made distributed training work consistently across clouds without adding complexity to our code.

Researchers can use pure PyTorch and Hugging Face without wrappers or additional abstraction layers. SkyPilot seamlessly sets up multiple nodes, populates environment variables with the cluster topology information (number of nodes, GPUs, IP addresses) and kicks off the jobs.

This is a huge win since ML engineers don’t need to struggle with scaffolding for distributed training and it works with any package manager. It’s as simple as:

# SkyPilot YAML for distributed training
resources:
  gpus: H100:8

num_nodes: 2

setup: |
  pip install torch transformers datasets  

run: |
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  tune run \
  --nnodes $SKYPILOT_NUM_NODES \
  --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
  --rdzv_id $SKYPILOT_TASK_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint=$MASTER_ADDR:29500 \
  full_finetune_distributed \
  --config model_config.json \
  model_dir=/tmp/path_to_model

Results: 10x faster development with reduced infrastructure overhead

Moving from SLURM to SkyPilot transformed how we develop AI models at scale:

10x faster iteration cycles: Training runs that previously required a day of setup now launch in under 5 minutes. As one engineer noted: “I can now launch ten experiments in the time it used to take me to set up one.”
Moving clouds in minutes, not days: We moved from managing multiple cloud-specific systems to a single, consistent interface. Switching between clouds and Kubernetes clusters is as simple as changing a context in a kubeconfig.
Maximized GPU availability: We no longer face capacity constraints from individual cloud providers. SkyPilot automatically finds available resources across our multi-cloud infrastructure.
Reduced operational overhead: Our small infrastructure team supports a growing ML organization without proportional headcount increases. We focus on strategic improvements rather than firefighting operational issues.
Infra that scales with us: Adding new cloud providers or regions is as simple as adding credentials or a context to a kubeconfig, reducing multi-week SLURM cluster setup projects to simple configuration changes.

Most importantly, our researchers now focus on model development and healthcare AI innovation instead of infrastructure management. SkyPilot provides the foundation that lets us move quickly while maintaining reliability as we continue expanding our multi-cloud infrastructure to serve more healthcare systems.

The Challenges of Scaling AI Infrastructure#

Securing GPU capacity requires multi-cloud infrastructure#

Supporting interactive development and distributed training#

Precious engineer time and GPU hours wasted managing infra#

How SkyPilot accelerates AI at Abridge#

Unified experience across all clusters#

SLURM-like experience with K8s-like flexibility#

Distributed training made easy#

Results: 10x faster development with reduced infrastructure overhead#