SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

SkyPilot Endpoints is a next-gen LLM inference system designed for production-ready inference in multi-cluster environments. A single YAML deploys the full serving stack - engine, autoscaler, gateway, certificates, metrics - and runs it across any number of Kubernetes clusters under one endpoint URL with a focus on performance and production-readiness.

Multi-cluster inference made simple

GPU supply is limited, and teams take capacity wherever they can get it - across clouds, regions, and on-prem.

But the Kubernetes-native LLM serving stack today (KServe, llm-d, Dynamo) is single-cluster, and operating it across the resulting fleet compounds both deployment and maintenance cost.

SkyPilot Endpoints provides the cross-cluster control plane on top. It sees registered Kubernetes clusters as one pool and handles:

Placement. On deploy, SkyPilot selects a cluster with sufficient GPU capacity for the configured replica count, accounting for preferences (region, cost, availability) declared in the YAML.
Scaling. When autoscaling adds replicas beyond the home cluster’s capacity, additional replicas land on the next cluster with available GPUs.
Failure recovery. On cluster failure, replicas are recreated on healthy clusters. The endpoint URL does not change.

Clients see one endpoint URL; the infra team manages one spec across the fleet.

Deploy once. Place anywhere. Survive cluster failure.

Below, SkyPilot autoscales replicas across clusters behind a single endpoint URL. Click the health dot on any cluster to terminate it and watch the replicas migrate.

One YAML, one dashboard

The components of the modern LLM inference stack are great in isolation: inference engines (vLLM, SGLang, TensorRT-LLM), serving frameworks (KServe, llm-d, Dynamo), autoscaling (KEDA), KV cache-aware routing (Gateway API + Inference Extension), TLS (cert-manager), metrics and tracing (Prometheus, Alloy).

Assembling them in a performant configuration is tedious per-deployment work - engine tuning, autoscaling wired to the right Prometheus query, KV cache-aware routing rules, certificate plumbing - and keeping the stack alive through engine upgrades, CRD migrations, and version-compatibility checks is a recurring tax.

SkyPilot Endpoints replaces it with a single specification that deploys and manages inference across all your clusters. Here’s a minimal spec for an endpoint:

name: glm-prod
model: zai-org/GLM-5.2
resources:
  accelerators: B200:8
replicas: 2
routing: kv_cache_aware

$ sky endpoint up endpoint.yaml

Six lines to set up the whole stack from earlier - inference engine, serving framework, autoscaler, inference gateway, intelligent routing, metrics and more. SkyPilot handles setting up CRDs, wires up inference metrics to prometheus, installs KEDA when you turn on autoscaling and gives you a public (or private) URL. Works on every cluster you own.

Optional fields cover production knobs:

engine: — choose between vLLM, SGLang and more. Passthrough for all engine flags (max_model_len, enforce_eager, …), or override the entrypoint for custom engines.
routing: — KV cache-aware routing using Gateway API Inference Extension or P2C.
prefill: — prefill/decode disaggregation (heterogeneous GPU types supported).
volumes: — Attach shared model cache across replicas for faster cold starts.
autoscaling: — scale on kv_cache_utilization, queue_depth or custom PromQL metrics with tunable up/down delays. Scale-to-zero supported.
Rolling updates, auth/TLS, gated-model auth and more.

The underlying stack builds on battle-tested open-source frameworks - KServe and llm-d. vLLM works out of the box, support for more inference engines coming soon.

YAML in, dashboard out. One dashboard for the whole fleet — not one per cluster:

Overview — Pod health and replica spread across clusters.
Serving metrics — latency (TTFT, TPOT, end-to-end at p50/p95/p99), throughput (output tok/s, req/s), saturation (KV-cache util, queue depth, GPU util).
Logs — per-pod engine logs, including sidecars and init containers.
Chat playground — sanity-check the deployed model from your browser.

app.skypilot.co/endpoints/my-endpoint

my-endpoint Ready v3 ▾ deployed 2h ago

Model

zai-org/GLM-5.2

MoE · FP8 · 32K ctx

Replicas

4 / 4

2 prefill · 2 decode

Fleet

32× H100 80GB

CoreWeave · Nebius

Pods

Pod	Role	Region	GPU	Util	Age
my-endpoint-prefill-0	Prefill	CoreWeaveus-iad-1	8× H100	88%	2h
my-endpoint-decode-0	Decode	CoreWeaveus-iad-1	8× H100	74%	2h
my-endpoint-prefill-1	Prefill	Nebiuseu-north1	8× H100	91%	2h
my-endpoint-decode-1	Decode	Nebiuseu-north1	8× H100	14%	2m

Recent events

2m Ready decode-1 · Pod is ready — joined decode pool

12m Pulled decode-1 · Image vllm/vllm-openai pulled

14m Scheduled decode-1 · Assigned to Nebius eu-north1 to satisfy autoscale

2h Deploy endpoint · v3 deployed (was v2) — rolling restart complete

Requests / s

TTFT

82 ms

TPOT

9 ms

KV cache

38%

E2E Request Latency

p50 p95 p99

400 ms 200 ms 0

−5m−3m−1mnow

Token Throughput

Prefill Decode

3K tok/s 1.5K 0

−5m−3m−1mnow

Time To First Token

p95

200 ms 100 ms 0

−5m−3m−1mnow

Time Per Output Token

p95

20 ms 10 ms 0

−5m−3m−1mnow

Live replica-0 ▾

↻ Refresh ⤓ Download

INFO 05-23 12:34:08 [api_server.py:512] vLLM API server v0.6.7 — model=zai-org/GLM-5.2

INFO 05-23 12:34:08 [config.py:1234] dtype=fp8, max_model_len=32768, tensor_parallel_size=8

INFO 05-23 12:34:09 [model_runner.py:891] Loading model weights · 8× H100 80GB

INFO 05-23 12:34:46 [worker.py:234] Weights loaded · 312 GiB sharded across 8 GPUs (37.4s)

INFO 05-23 12:34:47 [kv_cache.py:120] KV cache allocated · 512 blocks · 76.4 GiB / GPU

INFO 05-23 12:34:49 [api_server.py:678] Uvicorn running on http://0.0.0.0:8000

INFO 05-23 12:34:49 [engine.py:340] Accepting requests · max_num_seqs=256

INFO 05-23 12:34:51 [logger.py:50] POST /v1/chat/completions · 312 prompt tok

INFO 05-23 12:34:52 [metrics.py:230] Throughput · prompt 1.2K tok/s · gen 184 tok/s · KV 38%

WARN 05-23 12:34:53 [scheduler.py:88] Prefix cache miss for seq 0x9f3 (cold start)

INFO 05-23 12:34:54 [logger.py:50] POST /v1/chat/completions · 188 prompt tok

ERROR 05-23 12:34:55 [server.py:194] Client 10.0.0.71 disconnected mid-stream (seq 0xaa1)

INFO 05-23 12:34:56 [engine.py:445] Added seq_id=0xcc4 · running=4 waiting=0

INFO 05-23 12:34:57 [metrics.py:230] Throughput · prompt 1.4K tok/s · gen 220 tok/s · KV 41%

INFO 05-23 12:34:58 [logger.py:50] POST /v1/chat/completions · 256 prompt tok

INFO 05-23 12:34:58 [scheduler.py:142] Speculative decoding hit-rate 71% over last 200 tokens

INFO 05-23 12:34:59 [engine.py:445] Added seq_id=0xee2 · running=5 waiting=0

INFO 05-23 12:35:00 [router.py:71] KV-aware routing · sticky pod replica-1 for prefix 0x4c8

INFO 05-23 12:35:01 [metrics.py:230] Throughput · prompt 1.5K tok/s · gen 240 tok/s · KV 44%

INFO 05-23 12:35:02 [lora.py:88] Loaded LoRA adapter customer-A:v3 (32 MiB) into slot 2

INFO 05-23 12:35:03 [logger.py:50] POST /v1/chat/completions · 412 prompt tok · adapter=customer-A

WARN 05-23 12:35:04 [autoscaler.py:55] Queue depth > 4 for 30s — signaling KEDA scale-up

INFO 05-23 12:35:05 [engine.py:445] Added seq_id=0xff7 · running=6 waiting=1

INFO 05-23 12:35:06 [metrics.py:230] Throughput · prompt 1.7K tok/s · gen 268 tok/s · KV 47%

INFO 05-23 12:35:07 [logger.py:50] POST /v1/embeddings · 1.4K tok · client=ingestion-pipeline

INFO 05-23 12:35:08 [scheduler.py:218] Batched 18 requests · avg batch 6.0 · queue 0

Chat Playground

OpenAI-compatible chat interface for my-endpoint

↑

Inference by day. Training by Night.

The economics of running your own GPUs only work if you keep them busy. Unfortunately, most organizations don’t.

AI teams today split their GPU fleet into two fixed partitions: one for inference, one for training. However, Inference demand is inherently spiky — it surges during business hours and drops at night.

But those GPUs earmarked for inference can’t be touched by training, even when they’re sitting idle at 3 AM. Meanwhile, the training partition can’t expand to absorb that idle capacity. You’re paying full price for hardware you can’t fully use.

SkyPilot improves your GPU utilization by providing a unified interface for both training and inference workloads. With Managed Jobs for training and the new SkyPilot Endpoints for inference, you manage both through the same system - and SkyPilot handles the dynamic GPU allocation automatically.

The key insight which drives utilization: training workloads can be preemptible; inference workloads are latency-sensitive.

Inference replicas get high priority. When they need to scale up, they get GPUs immediately.
Training jobs can be low priority. They use all available GPUs, but gracefully yield when inference needs more.
SkyPilot handles the job recovery. When a training job is preempted, SkyPilot automatically restarts it from its last checkpoint.

Static Partitioning vs SkyPilot

Here are the two approaches on the same 16-H100 cluster:

The static side keeps 8 GPUs walled off for inference and 8 for training, no sharing.
SkyPilot’s unified pool treats all 16 as one shared resource - inference borrows training’s GPUs when load rises, and gives them back when it falls.

Watch what happens as the request rate climbs: the static side caps out at 8 GPUs and starts dropping queries; the unified pool absorbs the burst.

Below is a real trace from a SkyPilot deployment sharing 16 H100s between inference and training.

Both workloads start together: 14 GPUs running training jobs, 2 GPUs for inference.
Then we hit the endpoint with bursty traffic.

Watch what happens - each tile is one H100 GPU:

SkyPilot makes sure no GPUs sit idle. When inference needed capacity, it got it instantly. When it didn’t, training used every available GPU. Training jobs that were preempted automatically resumed from their last checkpoint - without any manual intervention from researchers or infra teams.

Designed for performance, reliability and observability

SkyPilot Endpoints is already deployed in production for serving frontier models by top AI teams.

They choose it for its multi-cluster capabilities, but keep it for the performance, reliability and observability features that come with the stack:

High performance. KV cache-aware routing, prefill/decode disaggregation, speculative decoding and KV offloading available. LoRA support. Model caching cuts 235B-class cold starts to under 1 min.
Reliability. Automatic failure recovery. Autoscaling on KV-cache util, queue depth, RPS or your own metric. Scale-to-zero support. Rolling updates with deployment versioning. Builds on battle-tested vLLM + llm-d + KServe stack.
Observability. Replicas, traffic, KV-cache util, request latency, TTFT, TPOT, live logs + tracing, OTel/Datadog/Fluentbit/Promtail support, an in-browser playground — all in one dashboard, across every cluster you own.

Early access

Want to try SkyPilot Endpoints? Request access and we’ll be in touch.

Multi-cluster inference made simple#

Deploy once. Place anywhere. Survive cluster failure.#

One YAML, one dashboard#

Inference by day. Training by Night.#

Static Partitioning vs SkyPilot#

Static Partitioning

SkyPilot Unified

In action: Sharing 16 H100s between inference and training#

Designed for performance, reliability and observability#

Early access#

Multi-cluster inference made simple

Deploy once. Place anywhere. Survive cluster failure.

One YAML, one dashboard

Inference by day. Training by Night.

Static Partitioning vs SkyPilot

In action: Sharing 16 H100s between inference and training

Designed for performance, reliability and observability

Early access