High-Performance Model Checkpointing on the Cloud

AI training can last days, weeks, or even months. Unfortunately, this process is often prone to unexpected interruptions. Whether you want to use spot instances to save up to 3x in costs or guard against a random GPU failure from ruining your day, saving your progress periodically with checkpoints is a crucial step to building a fault-tolerant training process.

In this post, we explore techniques to accelerate AI model checkpointing on the cloud and how to easily achieve them in SkyPilot.

TL;DR:

Use high-performance disks for writing checkpoints.
Upload checkpoints to a cloud bucket to securely store the checkpoints.
Use a local disk as a cache for the cloud bucket to speed up checkpointing by 9.6x.

Here’s a quick SkyPilot YAML configuration demonstrating this approach:

resources:
  accelerators: A100:8
  disk_tier: best

workdir: .

file_mounts:
  /checkpoints:
    source: gs://my-checkpoints-bucket
    mode: MOUNT_CACHED

run: |
  python train.py --outputs /checkpoints

Checkpointing is more than just saving files

Checkpointing might seem straightforward—just save the in-memory state to disk—but reality is more complex.

Specifically, checkpointing comes with a trade-off in performance. As the in-memory model must remain unchanged while it is being exported to disk, model training must be paused for the duration of checkpoint writing. Keeping the GPU idle while waiting for disk I/O reduces GPU utilization, leading to significantly higher costs (up to 30% higher as shown below).

The negative performance impact on checkpoint can be mitigated by minimizing the time to write each checkpoint, therefore minimizing the idle time of GPUs. The time to write a checkpoint is dependent on two main factors: checkpoint size and write speed.

Finetuning a Llama 7B on 8x A100 GCP with a 30-minute checkpointing period.

The size of the checkpoint to be written is proportional to the size of the model. As AI models continue to scale post-LLM boom, the size of checkpoints grows with them, making checkpoints more expensive to write. Without proper setup, you might find checkpointing occupying up to one-third of training time. Expensive GPUs remain idle during this time, waiting for checkpoint operations to complete.

Tip 1: Use High-Performance Disks

Using high-end disks dramatically enhances checkpoint performance. As shown in the figure, using high-end disks significantly reduces checkpointing overhead, allowing more training compute within the same period.

Finetuning a Llama 7B on 8x A100 GCP using high-end disk. Faster disk reduces time spent checkpointing, increasing GPU utilization.

SkyPilot simplifies selecting high-performance disk:

sky launch --disk-tier best train.yaml

Or, directly specify it in the SkyPilot YAML, train.yaml:

resources:
  accelerators: A100:8
  disk_tier: best

Using disk_tier: best tells SkyPilot to choose the volume/disk with the highest supported performance. For example:

On AWS, SkyPilot configures the VM to use an io2 volume
On GCP, SkyPilot uses a pd-ssd volume
On Azure, SkyPilot uses a Premium_LRS disk

Persisting checkpoints in cloud buckets

Disks are ephemeral in a cloud-native environment. When a VM or a pod is deleted, any locally saved checkpoints can disappear with it. To truly persist the checkpoints, they must be uploaded to a remote storage separated from the compute resource lifecycle. A cloud bucket is one such option to store the checkpoints remotely.

A cloud bucket behaves differently from common POSIX file systems, so it has a different set of APIs. Due to this difference, the PyTorch script for saving and loading checkpoints on disk no longer works directly on a cloud bucket.

def save_training_state(step: int, model: torch.nn.Module, optimizer: torch.optim.Optimizer):
    torch.save({
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'step': step
    }, f"checkpoints/checkpoint_{step}.pt")

Tip 2: Upload checkpoints to a cloud bucket, no code changes

SkyPilot offers a simple option to mount a cloud bucket to a pod/VM you created, so you can read and write on the bucket just like how you interact with a normal path on disk.*

file_mounts:
  /checkpoints:
    source: gs://my-checkpoint-bucket

With this block in train.yaml, all the files written to the path /checkpoints are directly synced to the cloud bucket specified: gs://my-checkpoint-bucket.

This works well for small models, e.g., a 1.3B LLM model takes 24 seconds to write the checkpoint to /checkpoints. However, when it comes to a larger model, it can take more than 8 minutes to save the model checkpoints once. This comes to be almost one-third of the computation time if a checkpoint were to be written every 30 minutes. The performance limitations of cloud bucket upload make it behave like a low-end disk, undoing the improvements from tip 1.

Parameter Size	Checkpoint size (GB)	Checkpoint Latency (s)
1.3B	2.3	24
6.7B	13	491

* The default mount mode does not support some specific POSIX operations, such as random write or append.

Tip 3: Use a local disk as a cache for cloud buckets

In performance-sensitive applications, slow IO operations can be masked by adding an intermediary cache. For example, CPUs rely on multiple levels of hardware caches to mask the IO latency of interacting with RAM.

The same approach can be applied for mounting cloud buckets. SkyPilot recently released a new mounting mode for cloud buckets, MOUNT_CACHED, that unlocks a remarkable 9.6x performance boost.

Finetuning a Llama 7B on 8x A100 GCP with MOUNT_CACHED. Making the checkpoint upload async unblocks the GPU, increasing utilization.

This only requires a one-line change to the SkyPilot YAML, train.yaml:

file_mounts:
  /checkpoints:
    source: s3://my-checkpoint-bucket
    mode: MOUNT_CACHED

This approach combines the learnings from the previous two tips to provide the best of both worlds. A checkpoint is initially saved to a local high-performance disk, allowing training to continue as soon as possible. As the GPUs continue to churn away, the checkpoint is asynchronously uploaded to the cloud bucket for persistent storage.

By interleaving checkpoint upload to the cloud bucket with the training process, we can optimize the checkpoint speed and resource utilization.

Bonus: Use asynchronous checkpointing by ML frameworks

Moving the slow checkpointing out of the critical path of training is a well-known concept. Some ML frameworks implement async checkpointing functionality, e.g., PyTorch Distributed Asynchronous Checkpointing (dcp.async_save), which leverages RAM to make the blocking writes complete faster. Those could be a complementary approach to what we discussed above, introducing another level of caching.

Illustration of MOUNT_CACHED + async checkpointing

Conclusion

To accelerate model checkpointing on the cloud, use high-end disks as a cache and upload checkpoints to a cloud bucket asynchronously. SkyPilot’s recent MOUNT_CACHED mode for cloud buckets boosts the checkpointing performance by 9.6x with a simple one-line change to the SkyPilot YAML.

# Install via: pip install 'skypilot-nightly[aws,gcp,azure,kubernetes]'

resources:
  accelerators: A100:8
  disk_tier: best

workdir: .

file_mounts:
  /checkpoints:
    source: gs://my-checkpoint-bucket
    mode: MOUNT_CACHED

run: |
  python train.py --outputs /checkpoints

Next Steps

Complete guide for model training with high-performance checkpointing.
Guide for training on spot instance for 3x cost savings.

Acknowledgement: We thank Doyoung Kim for significant effort for the initial design and implementation of MOUNT_CACHED, and Hriday Sheth for the early benchmarking and testing of the prototype.

Checkpointing is more than just saving files#

Tip 1: Use High-Performance Disks#

Persisting checkpoints in cloud buckets#

Tip 2: Upload checkpoints to a cloud bucket, no code changes#

Tip 3: Use a local disk as a cache for cloud buckets#

Bonus: Use asynchronous checkpointing by ML frameworks#

Conclusion#

Next Steps#