Introducing SkyPilot Client-Server Architecture

SkyPilot has always been easy for individual AI/ML folks to run AI workloads across multiple clouds, Kubernetes clusters, and on-prem. It’s one command to spin up a cluster, run a job/service, and let SkyPilot optimize for cost and availability behind the scenes.

But until now, SkyPilot was very much a local system — everything had to run from a single laptop/machine. This limited our ability to support running jobs asynchronously, reconnecting from another device/service, or collaborating across a team.

Today, we’re introducing SkyPilot’s new client-server architecture. Think of it as moving from a “local system” model to a centralized API server you can connect to from anywhere. The server handles all resource orchestration and control-plane logic, while your local client sends lightweight requests. It’s a new way to use SkyPilot that unlocks powerful features for both individuals and teams.

A quick glimpse of SkyPilot before

In the old model, you’d install SkyPilot on your laptop and run a command like sky launch, and the control logic runs locally: provisioning resources and submitting jobs. That worked great for individual use cases.

# Install SkyPilot with prerelease allowed for Azure dependencies
uv pip install --prerelease allow skypilot-nightly[all]

# Launch a GPU cluster with 8 H100 GPUs and more than 32 CPU cores
sky launch --gpus H100:8 --cpus 32+ nvidia-smi

But if you wanted to:

Run many launches at once: You’d have to script them and wait for each launch’s synchronous control logic to finish.
Interact with your jobs/clusters from different devices: Not possible unless you copy over your entire SkyPilot state (~/.sky) and your secrets/cloud credentials (a la Terraform state files).
Collaborate with a teammate: You’d have to share credentials or run all SkyPilot commands on the same machine, as there was no multi-tenancy support.
Schedule a recurring workflow: You have to manage the SkyPilot state for clusters/jobs across different stages of a pipeline or multiple triggers of a cron job.

In short, the old model was simple but lacked multi-device and multi-user features.

Why a client-server architecture?

By decoupling SkyPilot into a client (your CLI or SDK) and a server (the brain), a range of benefits is unlocked:

For individuals:

Asynchronous execution: Launch many jobs quickly/asynchronously without being blocked by the submission process.
Access SkyPilot from different devices: Start a job from your office and check it on your home laptop.
Integrate with production workflow orchestrators: Schedule your workflow with an orchestrator while SkyPilot manages diverse compute/infra – no cumbersome SkyPilot state management anymore.

For teams:

Centralized deployment & seamless onboarding: Set up one SkyPilot API server (in the cloud or on Kubernetes), and team members can onboard with a single endpoint.
Multi-tenancy: Share clusters, jobs, and services securely among teammates.
Unified view and management: Get a single view of all running clusters and jobs across the organization and all infra you have.
Fault-tolerant and cloud-native deployment: SkyPilot API server can be cloud-native, and it is fully fault tolerant, without losing your team’s workloads.

The client-server model transforms SkyPilot from a single-user system into a scalable, multi-user platform, making it easier for individuals and teams to run and manage their workloads.

For individuals

1. Asynchronous, non-blocking execution

SkyPilot can now launch a job and immediately return – it does not tie up your terminal. You can Ctrl+C after launching, and the launch will still run in the background. Or, you can use the new CLI flag --async to submit multiple tasks without blocking:

for i in {1..10}; do
  sky jobs launch -y --async "echo 'Hello from job $i'"
done

This is a huge win for anyone doing hyperparameter searches or large experiment batches.

2. Connect from anywhere

Once you have a remotely deployed SkyPilot API server, you’re no longer tied to one laptop. Kick off a job in the office and check logs or continue your work from your home desktop by connecting to the same API server endpoint.

The server is always on – it keeps managing your resources even if your local machine shuts down.

3. Integrate with production workflow orchestrators

Schedule recurring production workflows with Airflow, Modal, or Temporal and find all resources/jobs across diverse infra in a single place. With the remote API server, each stage in a workflow or each trigger of a cron job can connect to the same API server and share the same SkyPilot state, i.e., resources/jobs are shared without manual state management.

For example, you can now start an Airflow pipeline that recurringly processes newly arrived data, re-trains an AI model, and runs evaluation, with each stage connecting to the same SkyPilot API server for actually run the AI workload on diverse infrastructure, such as clouds or Kubernetes clusters.

See more details of the integration with Airflow.

For teams

1. Centralized deployment & seamless onboarding

Imagine having a single SkyPilot server for your team. Everyone just runs sky api login to connect. That’s it.

Early users have reported that this new centralized deployment is a big time saver. Your infra team sets up the server and credentials once, and the whole org can immediately start running AI clusters/jobs on multiple clouds and Kubernetes clusters — without giving individuals detailed credentials or permissions.

$ sky api login
Enter your SkyPilot API server endpoint: http://user:[email protected]

$ sky check

SkyPilot now supports multi-user resource sharing. The team members can see each other’s active jobs, check logs, and coordinate usage. You can spin up a GPU cluster, and others can use the same cluster or submit jobs to it. No extra overhead.

Here are several examples of how a SkyPilot cluster can be shared:

Alice can launch a cluster for sharing

# Alice: Start a shared cluster
sky launch -c shared-gpu --gpus A10G:8

Bob can check the cluster, ssh into it, or submit more jobs to the cluster

# Bob: Check all available clusters
sky status -u
Clusters
NAME        USER  LAUNCHED    RESOURCES            STATUS   AUTOSTOP
shared-gpu  alice 1 days ago  1x AWS(g5.48xlarge)  UP       -       


# Bob: Interactively ssh into the cluster
ssh shared-gpu
# Bob: Submit additional jobs to the cluster
sky exec shared-gpu --gpus A10G:1 bob-1-gpu.yaml

Similar things can be done with the managed jobs and services as well. All of this drives up resource utilization and improve the engineering efficiency .

3. A single pane of glass: view and manage in one place

With a remote API server, you now get a single place to see all running clusters, jobs, and services across the organization:

If a cluster is idle, you can shut it down or stop it.
If a job is hogging resources, you can check the logs or intervene.
If a service needs to be updated, the on-call member can update it.

This helps keep cloud usage efficient and the cloud budget in check.

Check all the clusters/jobs/services within the organization:

sky status -u

Stop a idle cluster: sky stop dev
Show logs of a job: sky jobs logs 3
Update a service: sky serve update -n deepseek service-v2.yaml

4. Fault-tolerant and cloud-native deployment

SkyPilot offers a simple cloud-native deployment of the SkyPilot API server via a single helm chart. It offers fault tolerance out-of-the-box. The quick start for the deployment can be found in the next section.

Note: If you don’t have a Kubernetes cluster, SkyPilot can also automatically set up a Kubernetes cluster on your existing machines with a single command: sky local up (see more details here)

How to deploy

Deploying the SkyPilot API server has been made very easy. For local mode users, who just want to run SkyPilot on a local machine, no additional actions are needed – SkyPilot automatically deploys a local API server when the first CLI/SDK call is made.

A remote API server can be deployed on either cloud VMs or Kubernetes cluster (Recommended). For the latter, deploying our open-source Helm chart will set up a fault-tolerant SkyPilot API server on a Kubernetes cluster that can be accessed immediately with the endpoint.

You can find the detailed instructions for setting up a SkyPilot API server on Kubernetes here, but here is a quick glance at how it can be done:

Deploy an API server with Helm chart:

# Add SkyPilot helm chart repository
helm repo add skypilot https://helm.skypilot.co
helm repo update

# Deploy the helm chart
NAMESPACE=skypilot
WEB_USERNAME=user
WEB_PASSWORD=password
AUTH_STRING=$(htpasswd -nb $WEB_USERNAME $WEB_PASSWORD)

helm upgrade --install skypilot skypilot/skypilot-nightly --devel \
  --namespace $NAMESPACE \
  --create-namespace \
  --set ingress.authCredentials=$AUTH_STRING

Find the endpoint

RELEASE_NAME=skypilot  # This should match the name used in helm install/upgrade
NODE_PORT=$(kubectl get svc ${RELEASE_NAME}-ingress-controller-np -n $NAMESPACE -o jsonpath='{.spec.ports[?(@.name=="http")].nodePort}')
NODE_IP=$(kubectl get nodes -o jsonpath='{ $.items[0].status.addresses[?(@.type=="ExternalIP")].address }')

ENDPOINT=http://${WEB_USERNAME}:${WEB_PASSWORD}@${NODE_IP}:${NODE_PORT}
echo $ENDPOINT

You will see an endpoint like: http://user:[email protected]:30050

Let SkyPilot use the endpoint

sky api login -e $ENDPOINT
sky api info
Using SkyPilot API server: http://user:[email protected]:30050
├── Status: healthy, commit: 296a22e868b9bdf1faccbe3effbfb858a5a05905, version: 1.0.0-dev0
└── User: alice (1dca28cd)

Optional: To setup cloud credentials in the API server, you can mount the credentials to deployment. See more details in here.

You are now all set with your remote SkyPilot API server that can be shared with your team!

💡 Hint: to quickly test a remote API server without a Kubernetes cluster, you can also try the instruction to deploy the API server to a normal VM. See SkyPilot docs.

Upgrade and migrate to new client-server architecture

When upgrading from SkyPilot version <= 0.8.0 before the Client-Server Architecture, here are some notes for migration.

If you just want to run SkyPilot locally, there are no additional actions compared to before. SkyPilot will automatically start an API server locally that handles all your requests. You don’t need to care about the existence of SkyPilot API server, although whenever you upgrade SkyPilot, you may want to run sky api stop to have the latest SkyPilot take effect.

For existing scripts written with SkyPilot Python SDK, the main difference after upgrade is the asynchronous execution nature, so you need to add sky.stream_and_get to wait for a request to finish. If blocking execution is required, here is our detailed guide for migration.

# Previously
handle, job_id = sky.launch(...)

# Now
request_id = sky.launch(...)
handle, job_id = sky.stream_and_get(request_id)

Next steps

Install SkyPilot and try it out:

# Install SkyPilot with prerelease allowed for Azure dependencies
uv pip install --prerelease allow skypilot-nightly[all]

# Launch a GPU cluster with 8 H100 GPUs and more than 32 CPU cores
sky launch --gpus H100:8 --cpus 32+ nvidia-smi

Find more information about team deployment.
Join our Slack workspace.

Don’t hesitate to contact us for help if you want to deploy a server for your team on Slack.

A quick glimpse of SkyPilot before#

Why a client-server architecture?#

For individuals#

1. Asynchronous, non-blocking execution#

2. Connect from anywhere#

3. Integrate with production workflow orchestrators#

For teams#

1. Centralized deployment & seamless onboarding#

2. Multi-tenancy: share clusters, jobs, services#

3. A single pane of glass: view and manage in one place#

4. Fault-tolerant and cloud-native deployment#

How to deploy#

Upgrade and migrate to new client-server architecture#

Next steps#