We are excited to announce SkyPilot 0.3, bringing LLM support (Vicuna, LLaMA), new clouds (Lambda, IBM, Cloudflare R2), and enhanced production readiness.

v0.3 is the most substantial release in the project’s history to date. Users can now install/upgrade by pip install -U skypilot.

This post walks through the release highlights. For the full release note, see here.

LLM support in SkyPilot

Vicuna

SkyPilot powered the finetuning of Vicuna, one of the most popular open-source LLM chatbots. The Vicuna team successfully used SkyPilot to train Vicuna models on the cloud via Managed Spot (auto-management of spot instances) for $300.

Vicuna was trained using SkyPilot, which automatically finds higher A100 availability across regions and clouds.

We collaborated closely with the Vicuna team to implement several system optimizations for LLMs, and we’re excited to bring them to you in SkyPilot 0.3. For example, we added a --disk-tier={low,medium,high} feature, which selects the “performance tier” of the disks attached to VMs. Using the high tier is crucial for the fast checkpointing of LLMs and large models.

To enable users to get started with their own LLMs, we have released runnable YAMLs and full instructions to reproduce the finetuning and serving of Vicuna. It’s especially exciting to hear from the community that SkyPilot has powered several new LLM projects!

LLaMA

LLaMA is the LLM model family from Meta that enabled the open LLM race since February 2023. Due to its importance, the SkyPilot team released:

New Clouds: delivering the highest GPU availability & zero-egress storage

SkyPilot 0.3 shipped support for three new clouds: Lambda Cloud (from Lambda Labs), IBM Cloud, and Cloudflare (for its R2 object store). This means SkyPilot now supports 6 major clouds in total—with more coming! SkyPilot 0.3 now supports 6 cloud providers Lambda Cloud: We’ve added Lambda Cloud, the first specialized cloud for AI compute in the Sky ecosystem. Lambda delivers high-end GPUs at low prices:

sky show-gpus --cloud lambda --all

IBM Cloud: IBM contributed support for IBM Cloud, the first hyperscaler after AWS/Azure/GCP to be supported in SkyPilot. (Thank you, IBM contributors!)

You can access it using SkyPilot’s unified interface, just like any other supported clouds:

sky launch --cloud ibm --gpus V100 -c my-ibm-cluster

IBM cloud console showing a node launched by SkyPilot
SkyPilot auto-translates a high-level resource request (V100) into an actual VM on IBM (gx2-8x64x1v100).

Cloudflare R2, zero-egress fee object store: SkyPilot 0.3 now supports R2, an object store that charges zero fees for “egress” (transferring data out to Internet/other clouds).

For SkyPilot jobs, which may be dynamically placed on different regions or clouds to execute (based on changing availability/cost), R2 is an ideal solution for storing datasets so users don’t get charged unexpected egress fees.

It’s easy to get started with R2 thanks to its S3-compatible API. It even comes with an one-click tool to import data from S3. To learn more, see our setup docs and usage docs.

What does more clouds mean for AI workloads?

High GPU availability: SkyPilot’s approach to the current GPU shortage on the cloud is simple: find resources in more regions and clouds. The more choices, the more availability.

Crucially, SkyPilot entirely automates finding resources in many regions/clouds, with no user effort required:

SkyPilot’s auto-failover across regions and clouds to improve GPU availability
SkyPilot's auto-failover. Regions from all enabled clouds are sorted by price and attempted in that order. If a launch request is constrained to use a specific cloud, only that cloud's regions are used in failover. Ordering in figure is illustrative and may not reflect the most up-to-date prices.

Seamless switching between clouds: Once a team has access to multiple clouds, ease-of-use becomes a top concern (interface compatibility, learning curves, or choosing which cloud to use for a particular job).

SkyPilot simplifies all of this by enabling the seamless switching between clouds. As a demonstration of this, we’ve heard from many users who run jobs on both a hyperscaler cloud and a specialized AI cloud (such as Lambda) seamlessly.

Enhancing production-readiness

A suite of production-readiness features are released:

  • Private IP-only VPCs is now supported for AWS. This opt-in mode satisfies the security requirement that all SkyPilot-launched nodes do not expose a public IP. To get started, write to the file ~/.sky/config.yaml the following settings:
    aws:
      vpc_name: <name>  # Name tag of the VPC(s) to use; can exist on many regions
      use_internal_ips: true
      ssh_proxy_command: ssh -W %h:%p -o StrictHostKeyChecking=no me@<jump server public ip>  # '-o ProxyCommand' option for any SSH connections
    
    This is an experimental feature. Community feedback is much welcome to improve it (let us know if you need this on other clouds)!
  • AWS SSO is now supported. No actions are required to use it—simply login via your usual method such as aws sso login or Okta. Under the hood, SkyPilot makes all launched nodes assume a role from the client machine (so that they can access private S3 buckets, etc.).
  • User identity: Users can now freely switch between different AWS profiles or different GCP projects (“identities”). SkyPilot will remember which identity owns each cluster and protects it from unintentional operations.
  • Cluster leakage prevention is significantly improved. This minimizes any chance of run-away clusters (idle nodes are the top contributor to cloud overspending!).

Cloud costs: more savings and visibility

We also released two major new features that offer higher cloud cost savings & visibility:

  • New feature: Fine-grained optimizer. SkyPilot’s optimizer now operates at the level of zones, when applicable. This means for resources whose prices differ across zones (e.g., AWS’s spot VMs), it optimizes cost and provisions at the zone level to deliver the highest cost savings.
    • To see why this is useful, AWS’s g4dn.metal instance (T4:8) can have >40% differences across zones for spot pricing, gaps that are changing constantly:
      Spot pricing on AWS can differ by &gt;40% across zones
      Spot pricing on AWS can differ by >40% across zones.
    • With SkyPilot, users need not worry about manually picking the best region/zone—SkyPilot now automatically uses the latest pricing to ensure the largest savings.
  • New CLI sky cost-report reports the estimated costs of all clusters.
    New CLI <code>sky cost-report</code>
    New CLI `sky cost-report`.
    • Currently, it offers accurate cost tracking for clusters whose lifecycles are completely managed via sky launch/stop/start/down.
    • Showing accurate costs for clusters with autostop/autodown, or those terminated on the cloud, is work-in-progress.

Summary

SkyPilot’s goal is to allow apps to seamlessly use a collection of regions/clouds (the Sky) for higher resource availability, cost savings, and their unique software and hardware.

Many months in the making, the v0.3 release is a significant step towards this goal.

There’s much more to be done. Join us in this journey to make running critical workloads (LLMs or others) on any region, any cloud much simpler and cheaper than ever.


To receive latest updates, please star and watch the project’s GitHub repo, follow @skypilot_org, or join the SkyPilot community Slack.