Want to host your own LLM Chatbot on any cloud of your choosing? Worried about GPU unavailability on the cloud, massive cloud bills, manually learning how to launch instances in different regions/clouds, or laboriously setting up the cloud instances?
It will automatically perform the following:
- Get a beefy GPU instance on AWS, GCP, Azure, or Lambda
- Set up the instance (download weights, install requirements in a Conda env, etc.)
- Launch a chatbot interface that we can connect to through our laptop’s browser
…and it does so while abstracting away all of the above infra burden and minimizing costs.
LLaMA is a set of Large Language Models (LLMs) recently released by Meta. Trained on more than 1 trillion tokens from public datasets, LLaMA achieves high quality and is space-efficient. You can fill out a form to request access from Meta to download the open model weights. In the steps below we assume either (1) you have an unexpired download URL, or (2) the weights have been downloaded and stored on the local machine.
SkyPilot is an open-source framework from UC Berkeley for seamlessly running machine learning on any cloud. With a simple CLI, users can easily launch many clusters and jobs, while substantially lowering their cloud bills. Currently, Lambda (low-cost GPU cloud), AWS, GCP, and Azure are supported. See docs to learn more.
- Install SkyPilot and check that cloud credentials exist:
pip install "skypilot[aws,gcp,azure,lambda]" # pick your clouds sky check
- Get the example folder:
git clone https://github.com/skypilot-org/skypilot.git cd skypilot/llm/llama-chatbots
- a. If you have an unexpired LLaMA URL from Meta, run:
export LLAMA_URL='https://' # Add URL sent by Meta. sky launch llama-65b.yaml -c llama -s --env LLAMA_URL=$LLAMA_URL
This will download the 65B model to the cloud instance. The setup process can take up to 30 minutes.
b. Otherwise, if you have LLaMA checkpoints stored on your machine: Ensure that they are organized in the following directory structure (identical to Meta’s official release):
<llama-dir> ├── 7B/ | |-- ... ├── 13B/ | |-- ... ├── 30B/ | |-- ... ├── 65B/ | |-- ... └── tokenizer.model `-- tokenizer_checklist.chk
Then, run the commands below:
export LLAMA_DIR='<llama-dir>' # Directory of your local LLaMA checkpoints. ln -s $(readlink -f $LLAMA_DIR) /tmp/llama sky launch llama-65b-upload.yaml -c llama -s
This will upload the 65B model to the cloud instance.
Either way, you will see a confirmation prompt like the following:You may see fewer clouds because SkyPilot uses only the clouds you have access to.
See below for many more commands to run!
- Open another terminal and run:
ssh -L 7681:localhost:7681 llama
- Open http://localhost:7681 in your browser and start chatting!
More commands to try
To launch your chatbot on different GPUs or clouds, SkyPilot makes it easy with just one argument change.
NOTE: To use checkpoints stored on your machine, replace
llama-*-upload.yaml in all commands (see instructions above).
Launch different GPUs with
--gpus <type>:<num> (see
sky show-gpus for all supported GPUs):
sky launch llama-65b.yaml --gpus A100:8 <other args> sky launch llama-7b.yaml --gpus A100:1 <other args> sky launch llama-7b.yaml --gpus V100:1 <other args>
Launch on different clouds with
|Launch on the cheapest cloud/region (automatically chosen!)|
|Launch on Lambda|
|Launch on GCP|
|Launch on AWS|
|Launch on Azure|
Use spot instances to save >3x costs with
sky launch llama-65b.yaml --use-spot <other args>
To use other model sizes, simply pass the correct YAML path to the command (the YAMLs have set the correct number of GPUs and checkpoint paths):
sky launch llama-7b.yaml -c llama-7b -s --env LLAMA_URL=$LLAMA_URL sky launch llama-13b.yaml -c llama-13b -s --env LLAMA_URL=$LLAMA_URL sky launch llama-30b.yaml -c llama-30b -s --env LLAMA_URL=$LLAMA_URL sky launch llama-65b.yaml -c llama-65b -s --env LLAMA_URL=$LLAMA_URL
To see details about these flags, see CLI docs or run
sky launch -h.
When you are done, you can stop or tear down the cluster:
- To stop the cluster, run
sky stop llama # or pass your custom name if you used "-c <other name>"
You can restart a stopped cluster and relaunch the chatbot (the
run section in YAML) with
sky launch llama-65b.yaml -c llama --no-setup
--no-setup flag: a stopped cluster preserves its disk contents so we can skip redoing the setup.
- To tear down the cluster (non-restartable), run
sky down llama # or pass your custom name if you used "-c <other name>"
To see your clusters, run
sky status, which is a single pane of glass for all your clusters across regions/clouds.
To learn more about various SkyPilot commands, see Quickstart.
Some caveat first. LLaMA models aren’t specifically finetuned for being used as a chatbot and we only did some basic priming of the model (
chat.py), so it is expected to experience not-so-great quality in your chats. See also Meta’s FAQ here.
That said, we expect LLaMAs/other open LLMs to rapidly advance in the near future. As open LLMs become more powerful, bigger, and more compute-hungry, the demand of flexibly finetuning and running them on a variety of cloud compute will dramatically increase.
And that is where SkyPilot comes in. This example shows three major benefits of using SkyPilot to run ML projects on the cloud:
Cloud portability & productivity: We’ve wrapped an existing ML project and launched it to the cloud of your choosing using a simple YAML and one command. Interacting with a simple CLI, users get cloud portability with one argument change.
SkyPilot also improves ML users' productivity of using the cloud. There’s no need to learn different clouds' consoles or APIs. No need to figure out the right instance types. And there’s no change to the actual project code for it to run.
Higher GPU availability: If a region or a whole cloud runs out of GPUs (increasingly common in today’s Large Models race), the only solution other than waiting is to go to more regions and clouds.
sky launch command makes this entirely automatic. It performs auto-failover behind the scenes. For each request the system loops through all enabled regions (or even clouds) to find available GPUs, and does so in the cheapest-price order.
Reducing cloud bills: GPUs can be very expensive on the cloud. SkyPilot reduces ML teams' costs by supporting
- low-cost GPU cloud (Lambda; >3x cheaper than AWS/Azure/GCP)
- spot instances (>3x cheaper than on-demand)
- automatically choosing the cheapest cloud/region/zone
- auto-stopping & auto-termination of instances (docs)
Congratulations! You have used SkyPilot to launch a LLaMA-based chatbot on the cloud with just one command. The system automatically handles setting up instances and it offers cloud portability, higher GPU availability, and cost reduction.
LLaMA chatbots are just one example app. To leverage these benefits for your own ML projects on the cloud, we recommend the Quickstart guide.