We are excited to announce SkyPilot 0.3, bringing LLM support (Vicuna, LLaMA), new clouds (Lambda, IBM, Cloudflare R2), and enhanced production readiness.
v0.3 is the most substantial release in the project’s history to date. Users can now install/upgrade by pip install -U skypilot
.
This post walks through the release highlights. For the full release note, see here.
LLM support in SkyPilot
Vicuna
SkyPilot powered the finetuning of Vicuna, one of the most popular open-source LLM chatbots. The Vicuna team successfully used SkyPilot to train Vicuna models on the cloud via Managed Spot (auto-management of spot instances) for $300.
We collaborated closely with the Vicuna team to implement several system optimizations for LLMs, and we’re excited to bring them to you in SkyPilot 0.3.
For example, we added a --disk-tier={low,medium,high}
feature, which selects the “performance tier” of the disks attached to VMs. Using the high
tier is crucial for the fast checkpointing of LLMs and large models.
To enable users to get started with their own LLMs, we have released runnable YAMLs and full instructions to reproduce the finetuning and serving of Vicuna. It’s especially exciting to hear from the community that SkyPilot has powered several new LLM projects!
LLaMA
LLaMA is the LLM model family from Meta that enabled the open LLM race since February 2023. Due to its importance, the SkyPilot team released:
- A simple chatbot for LLaMA, and
- A full guide and runnable YAMLs on how to serve a LLaMA LLM on any cloud with one command.
New Clouds: delivering the highest GPU availability & zero-egress storage
SkyPilot 0.3 shipped support for three new clouds: Lambda Cloud (from Lambda Labs), IBM Cloud, and Cloudflare (for its R2 object store). This means SkyPilot now supports 6 major clouds in total—with more coming! Lambda Cloud: We’ve added Lambda Cloud, the first specialized cloud for AI compute in the Sky ecosystem. Lambda delivers high-end GPUs at low prices:
sky show-gpus --cloud lambda --all
IBM Cloud: IBM contributed support for IBM Cloud, the first hyperscaler after AWS/Azure/GCP to be supported in SkyPilot. (Thank you, IBM contributors!)
You can access it using SkyPilot’s unified interface, just like any other supported clouds:
sky launch --cloud ibm --gpus V100 -c my-ibm-cluster
Cloudflare R2, zero-egress fee object store: SkyPilot 0.3 now supports R2, an object store that charges zero fees for “egress” (transferring data out to Internet/other clouds).
For SkyPilot jobs, which may be dynamically placed on different regions or clouds to execute (based on changing availability/cost), R2 is an ideal solution for storing datasets so users don’t get charged unexpected egress fees.
It’s easy to get started with R2 thanks to its S3-compatible API. It even comes with an one-click tool to import data from S3. To learn more, see our setup docs and usage docs.
What does more clouds mean for AI workloads?
High GPU availability: SkyPilot’s approach to the current GPU shortage on the cloud is simple: find resources in more regions and clouds. The more choices, the more availability.
Crucially, SkyPilot entirely automates finding resources in many regions/clouds, with no user effort required:
Seamless switching between clouds: Once a team has access to multiple clouds, ease-of-use becomes a top concern (interface compatibility, learning curves, or choosing which cloud to use for a particular job).
SkyPilot simplifies all of this by enabling the seamless switching between clouds. As a demonstration of this, we’ve heard from many users who run jobs on both a hyperscaler cloud and a specialized AI cloud (such as Lambda) seamlessly.
Enhancing production-readiness
A suite of production-readiness features are released:
- Private IP-only VPCs is now supported for AWS. This opt-in mode satisfies the security requirement that all SkyPilot-launched nodes do not expose a public IP. To get started, write to the file
~/.sky/config.yaml
the following settings:This is an experimental feature. Community feedback is much welcome to improve it (let us know if you need this on other clouds)!aws: vpc_name: <name> # Name tag of the VPC(s) to use; can exist on many regions use_internal_ips: true ssh_proxy_command: ssh -W %h:%p -o StrictHostKeyChecking=no me@<jump server public ip> # '-o ProxyCommand' option for any SSH connections
- AWS SSO is now supported. No actions are required to use it—simply login via your usual method such as
aws sso login
or Okta. Under the hood, SkyPilot makes all launched nodes assume a role from the client machine (so that they can access private S3 buckets, etc.). - User identity: Users can now freely switch between different AWS profiles or different GCP projects (“identities”). SkyPilot will remember which identity owns each cluster and protects it from unintentional operations.
- Cluster leakage prevention is significantly improved. This minimizes any chance of run-away clusters (idle nodes are the top contributor to cloud overspending!).
Cloud costs: more savings and visibility
We also released two major new features that offer higher cloud cost savings & visibility:
- New feature: Fine-grained optimizer. SkyPilot’s optimizer now operates at the level of zones, when applicable. This means for resources whose prices differ across zones (e.g., AWS’s spot VMs), it optimizes cost and provisions at the zone level to deliver the highest cost savings.
- To see why this is useful, AWS’s
g4dn.metal
instance (T4:8
) can have >40% differences across zones for spot pricing, gaps that are changing constantly: - With SkyPilot, users need not worry about manually picking the best region/zone—SkyPilot now automatically uses the latest pricing to ensure the largest savings.
- To see why this is useful, AWS’s
- New CLI
sky cost-report
reports the estimated costs of all clusters.- Currently, it offers accurate cost tracking for clusters whose lifecycles are completely managed via
sky launch/stop/start/down
. - Showing accurate costs for clusters with autostop/autodown, or those terminated on the cloud, is work-in-progress.
- Currently, it offers accurate cost tracking for clusters whose lifecycles are completely managed via
Summary
SkyPilot’s goal is to allow apps to seamlessly use a collection of regions/clouds (the Sky) for higher resource availability, cost savings, and their unique software and hardware.
Many months in the making, the v0.3 release is a significant step towards this goal.
There’s much more to be done. Join us in this journey to make running critical workloads (LLMs or others) on any region, any cloud much simpler and cheaper than ever.
To receive latest updates, please star and watch the project’s GitHub repo, follow @skypilot_org, or join the SkyPilot community Slack.