AI training can last days, weeks, or even months. Unfortunately, this process is often prone to unexpected interruptions. Whether you want to use spot instances to save up to 3x in costs or guard against a random GPU failure from ruining your day, saving your progress periodically with checkpoints is a crucial step to building a fault-tolerant training process.
In this post, we explore techniques to accelerate AI model checkpointing on the cloud and how to easily achieve them in SkyPilot.
TL;DR:
- Use high-performance disks for writing checkpoints.
- Upload checkpoints to a cloud bucket to securely store the checkpoints.
- Use a local disk as a cache for the cloud bucket to speed up checkpointing by 9.6x.
Here’s a quick SkyPilot YAML configuration demonstrating this approach:
resources:
accelerators: A100:8
disk_tier: best
workdir: .
file_mounts:
/checkpoints:
source: gs://my-checkpoints-bucket
mode: MOUNT_CACHED
run: |
python train.py --outputs /checkpoints
Checkpointing is more than just saving files
Checkpointing might seem straightforward—just save the in-memory state to disk—but reality is more complex.
Specifically, checkpointing comes with a trade-off in performance. As the in-memory model must remain unchanged while it is being exported to disk, model training must be paused for the duration of checkpoint writing. Keeping the GPU idle while waiting for disk I/O reduces GPU utilization, leading to significantly higher costs (up to 30% higher as shown below).
The negative performance impact on checkpoint can be mitigated by minimizing the time to write each checkpoint, therefore minimizing the idle time of GPUs. The time to write a checkpoint is dependent on two main factors: checkpoint size and write speed.

The size of the checkpoint to be written is proportional to the size of the model. As AI models continue to scale post-LLM boom, the size of checkpoints grows with them, making checkpoints more expensive to write. Without proper setup, you might find checkpointing occupying up to one-third of training time. Expensive GPUs remain idle during this time, waiting for checkpoint operations to complete.
Tip 1: Use High-Performance Disks
Using high-end disks dramatically enhances checkpoint performance. As shown in the figure, using high-end disks significantly reduces checkpointing overhead, allowing more training compute within the same period.

SkyPilot simplifies selecting high-performance disk:
sky launch --disk-tier best train.yaml
Or, directly specify it in the SkyPilot YAML, train.yaml
:
resources:
accelerators: A100:8
disk_tier: best
Using disk_tier: best
tells SkyPilot to choose the volume/disk with the highest supported performance. For example:
- On AWS, SkyPilot configures the VM to use an
io2
volume - On GCP, SkyPilot uses a
pd-ssd
volume - On Azure, SkyPilot uses a
Premium_LRS
disk
Persisting checkpoints in cloud buckets
Disks are ephemeral in a cloud-native environment. When a VM or a pod is deleted, any locally saved checkpoints can disappear with it. To truly persist the checkpoints, they must be uploaded to a remote storage separated from the compute resource lifecycle. A cloud bucket is one such option to store the checkpoints remotely.
A cloud bucket behaves differently from common POSIX file systems, so it has a different set of APIs. Due to this difference, the PyTorch script for saving and loading checkpoints on disk no longer works directly on a cloud bucket.
def save_training_state(step: int, model: torch.nn.Module, optimizer: torch.optim.Optimizer):
torch.save({
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'step': step
}, f"checkpoints/checkpoint_{step}.pt")
Tip 2: Upload checkpoints to a cloud bucket, no code changes
SkyPilot offers a simple option to mount a cloud bucket to a pod/VM you created, so you can read and write on the bucket just like how you interact with a normal path on disk.*
file_mounts:
/checkpoints:
source: gs://my-checkpoint-bucket
With this block in train.yaml
, all the files written to the path /checkpoints
are directly synced to the cloud bucket specified: gs://my-checkpoint-bucket
.
This works well for small models, e.g., a 1.3B LLM model takes 24 seconds to write the checkpoint to /checkpoints
. However, when it comes to a larger model, it can take more than 8 minutes to save the model checkpoints once. This comes to be almost one-third of the computation time if a checkpoint were to be written every 30 minutes. The performance limitations of cloud bucket upload make it behave like a low-end disk, undoing the improvements from tip 1.
Parameter Size | Checkpoint size (GB) | Checkpoint Latency (s) |
---|---|---|
1.3B | 2.3 | 24 |
6.7B | 13 | 491 |
* The default mount mode does not support some specific POSIX operations, such as random write or append.
Tip 3: Use a local disk as a cache for cloud buckets
In performance-sensitive applications, slow IO operations can be masked by adding an intermediary cache. For example, CPUs rely on multiple levels of hardware caches to mask the IO latency of interacting with RAM.
The same approach can be applied for mounting cloud buckets. SkyPilot recently released a new mounting mode for cloud buckets, MOUNT_CACHED
, that unlocks a remarkable 9.6x performance boost.

This only requires a one-line change to the SkyPilot YAML, train.yaml
:
file_mounts:
/checkpoints:
source: s3://my-checkpoint-bucket
mode: MOUNT_CACHED
This approach combines the learnings from the previous two tips to provide the best of both worlds. A checkpoint is initially saved to a local high-performance disk, allowing training to continue as soon as possible. As the GPUs continue to churn away, the checkpoint is asynchronously uploaded to the cloud bucket for persistent storage.


Bonus: Use asynchronous checkpointing by ML frameworks
Moving the slow checkpointing out of the critical path of training is a well-known concept. Some ML frameworks implement async checkpointing functionality, e.g., PyTorch Distributed Asynchronous Checkpointing (dcp.async_save
), which leverages RAM to make the blocking writes complete faster. Those could be a complementary approach to what we discussed above, introducing another level of caching.
Illustration of MOUNT_CACHED
+ async checkpointing

Conclusion
To accelerate model checkpointing on the cloud, use high-end disks as a cache and upload checkpoints to a cloud bucket asynchronously. SkyPilot’s recent MOUNT_CACHED
mode for cloud buckets boosts the checkpointing performance by 9.6x with a simple one-line change to the SkyPilot YAML.
# Install via: pip install 'skypilot-nightly[aws,gcp,azure,kubernetes]'
resources:
accelerators: A100:8
disk_tier: best
workdir: .
file_mounts:
/checkpoints:
source: gs://my-checkpoint-bucket
mode: MOUNT_CACHED
run: |
python train.py --outputs /checkpoints
Next Steps
- Complete guide for model training with high-performance checkpointing.
- Guide for training on spot instance for 3x cost savings.
Acknowledgement: We thank Doyoung Kim for
significant effort for the initial design and implementation of MOUNT_CACHED
,
and Hriday Sheth for the early benchmarking
and testing of the prototype.