SkyPilot Endpoints is a next-gen LLM inference system designed for production-ready inference in multi-cluster environments. A single YAML deploys the full serving stack - engine, autoscaler, gateway, certificates, metrics - and runs it across any number of Kubernetes clusters under one endpoint URL with a focus on performance and production-readiness.

Multi-cluster inference made simple

GPU supply is limited, and teams take capacity wherever they can get it - across clouds, regions, and on-prem.

But the Kubernetes-native LLM serving stack today (KServe, llm-d, Dynamo) is single-cluster, and operating it across the resulting fleet compounds both deployment and maintenance cost.

SkyPilot Endpoints provides the cross-cluster control plane on top. It sees registered Kubernetes clusters as one pool and handles:

  • Placement. On deploy, SkyPilot selects a cluster with sufficient GPU capacity for the configured replica count, accounting for preferences (region, cost, availability) declared in the YAML.
  • Scaling. When autoscaling adds replicas beyond the home cluster’s capacity, additional replicas land on the next cluster with available GPUs.
  • Failure recovery. On cluster failure, replicas are recreated on healthy clusters. The endpoint URL does not change.

Clients see one endpoint URL; the infra team manages one spec across the fleet.

Deploy once. Place anywhere. Survive cluster failure.

Below, SkyPilot autoscales replicas across clusters behind a single endpoint URL. Click the health dot on any cluster to terminate it and watch the replicas migrate.

Incoming traffic
0 RPS
↓ ↓ ↓ ↓ ↓ ↓Incoming traffic
Endpoint is idle. Drag the slider to send traffic.

One YAML, one dashboard

The components of the modern LLM inference stack are great in isolation: inference engines (vLLM, SGLang, TensorRT-LLM), serving frameworks (KServe, llm-d, Dynamo), autoscaling (KEDA), KV cache-aware routing (Gateway API + Inference Extension), TLS (cert-manager), metrics and tracing (Prometheus, Alloy).

Assembling them in a performant configuration is tedious per-deployment work - engine tuning, autoscaling wired to the right Prometheus query, KV cache-aware routing rules, certificate plumbing - and keeping the stack alive through engine upgrades, CRD migrations, and version-compatibility checks is a recurring tax.

SkyPilot Endpoints replaces it with a single specification that deploys and manages inference across all your clusters. Here’s a minimal spec for an endpoint:

name: glm-prod
model: zai-org/GLM-5.2
resources:
  accelerators: B200:8
replicas: 2
routing: kv_cache_aware
$ sky endpoint up endpoint.yaml

Six lines to set up the whole stack from earlier - inference engine, serving framework, autoscaler, inference gateway, intelligent routing, metrics and more. SkyPilot handles setting up CRDs, wires up inference metrics to prometheus, installs KEDA when you turn on autoscaling and gives you a public (or private) URL. Works on every cluster you own.

Optional fields cover production knobs:

  • engine: — choose between vLLM, SGLang and more. Passthrough for all engine flags (max_model_len, enforce_eager, …), or override the entrypoint for custom engines.
  • routing: — KV cache-aware routing using Gateway API Inference Extension or P2C.
  • prefill: — prefill/decode disaggregation (heterogeneous GPU types supported).
  • volumes: — Attach shared model cache across replicas for faster cold starts.
  • autoscaling: — scale on kv_cache_utilization, queue_depth or custom PromQL metrics with tunable up/down delays. Scale-to-zero supported.
  • Rolling updates, auth/TLS, gated-model auth and more.

The underlying stack builds on battle-tested open-source frameworks - KServe and llm-d. vLLM works out of the box, support for more inference engines coming soon.

YAML in, dashboard out. One dashboard for the whole fleet — not one per cluster:

  • Overview — Pod health and replica spread across clusters.
  • Serving metrics — latency (TTFT, TPOT, end-to-end at p50/p95/p99), throughput (output tok/s, req/s), saturation (KV-cache util, queue depth, GPU util).
  • Logs — per-pod engine logs, including sidecars and init containers.
  • Chat playground — sanity-check the deployed model from your browser.
app.skypilot.co/endpoints/my-endpoint
my-endpoint Ready v3 deployed 2h ago
Model
zai-org/GLM-5.2
MoE · FP8 · 32K ctx
Replicas
4 / 4
2 prefill · 2 decode
Fleet
32× H100 80GB
CoreWeave · Nebius
PodRoleRegionGPUUtilAge
my-endpoint-prefill-0PrefillCoreWeaveus-iad-18× H10088%2h
my-endpoint-decode-0DecodeCoreWeaveus-iad-18× H10074%2h
my-endpoint-prefill-1PrefillNebiuseu-north18× H10091%2h
my-endpoint-decode-1DecodeNebiuseu-north18× H10014%2m
2m Ready decode-1 · Pod is ready — joined decode pool
12m Pulled decode-1 · Image vllm/vllm-openai pulled
14m Scheduled decode-1 · Assigned to Nebius eu-north1 to satisfy autoscale
2h Deploy endpoint · v3 deployed (was v2) — rolling restart complete
Requests / s
42
TTFT
82 ms
TPOT
9 ms
KV cache
38%
E2E Request Latency
p50 p95 p99
400 ms 200 ms 0
−5m−3m−1mnow
Token Throughput
Prefill Decode
3K tok/s 1.5K 0
−5m−3m−1mnow
Time To First Token
p95
200 ms 100 ms 0
−5m−3m−1mnow
Time Per Output Token
p95
20 ms 10 ms 0
−5m−3m−1mnow
Live replica-0 ▾
↻ Refresh ⤓ Download
INFO 05-23 12:34:08 [api_server.py:512] vLLM API server v0.6.7 — model=zai-org/GLM-5.2
INFO 05-23 12:34:08 [config.py:1234] dtype=fp8, max_model_len=32768, tensor_parallel_size=8
INFO 05-23 12:34:09 [model_runner.py:891] Loading model weights · 8× H100 80GB
INFO 05-23 12:34:46 [worker.py:234] Weights loaded · 312 GiB sharded across 8 GPUs (37.4s)
INFO 05-23 12:34:47 [kv_cache.py:120] KV cache allocated · 512 blocks · 76.4 GiB / GPU
INFO 05-23 12:34:49 [api_server.py:678] Uvicorn running on http://0.0.0.0:8000
INFO 05-23 12:34:49 [engine.py:340] Accepting requests · max_num_seqs=256
INFO 05-23 12:34:51 [logger.py:50] POST /v1/chat/completions · 312 prompt tok
INFO 05-23 12:34:52 [metrics.py:230] Throughput · prompt 1.2K tok/s · gen 184 tok/s · KV 38%
WARN 05-23 12:34:53 [scheduler.py:88] Prefix cache miss for seq 0x9f3 (cold start)
INFO 05-23 12:34:54 [logger.py:50] POST /v1/chat/completions · 188 prompt tok
ERROR 05-23 12:34:55 [server.py:194] Client 10.0.0.71 disconnected mid-stream (seq 0xaa1)
INFO 05-23 12:34:56 [engine.py:445] Added seq_id=0xcc4 · running=4 waiting=0
INFO 05-23 12:34:57 [metrics.py:230] Throughput · prompt 1.4K tok/s · gen 220 tok/s · KV 41%
INFO 05-23 12:34:58 [logger.py:50] POST /v1/chat/completions · 256 prompt tok
INFO 05-23 12:34:58 [scheduler.py:142] Speculative decoding hit-rate 71% over last 200 tokens
INFO 05-23 12:34:59 [engine.py:445] Added seq_id=0xee2 · running=5 waiting=0
INFO 05-23 12:35:00 [router.py:71] KV-aware routing · sticky pod replica-1 for prefix 0x4c8
INFO 05-23 12:35:01 [metrics.py:230] Throughput · prompt 1.5K tok/s · gen 240 tok/s · KV 44%
INFO 05-23 12:35:02 [lora.py:88] Loaded LoRA adapter customer-A:v3 (32 MiB) into slot 2
INFO 05-23 12:35:03 [logger.py:50] POST /v1/chat/completions · 412 prompt tok · adapter=customer-A
WARN 05-23 12:35:04 [autoscaler.py:55] Queue depth > 4 for 30s — signaling KEDA scale-up
INFO 05-23 12:35:05 [engine.py:445] Added seq_id=0xff7 · running=6 waiting=1
INFO 05-23 12:35:06 [metrics.py:230] Throughput · prompt 1.7K tok/s · gen 268 tok/s · KV 47%
INFO 05-23 12:35:07 [logger.py:50] POST /v1/embeddings · 1.4K tok · client=ingestion-pipeline
INFO 05-23 12:35:08 [scheduler.py:218] Batched 18 requests · avg batch 6.0 · queue 0
Chat Playground
OpenAI-compatible chat interface for my-endpoint

Inference by day. Training by Night.

The economics of running your own GPUs only work if you keep them busy. Unfortunately, most organizations don’t.

AI teams today split their GPU fleet into two fixed partitions: one for inference, one for training. However, Inference demand is inherently spiky — it surges during business hours and drops at night.

But those GPUs earmarked for inference can’t be touched by training, even when they’re sitting idle at 3 AM. Meanwhile, the training partition can’t expand to absorb that idle capacity. You’re paying full price for hardware you can’t fully use.

SkyPilot improves your GPU utilization by providing a unified interface for both training and inference workloads. With Managed Jobs for training and the new SkyPilot Endpoints for inference, you manage both through the same system - and SkyPilot handles the dynamic GPU allocation automatically.

The key insight which drives utilization: training workloads can be preemptible; inference workloads are latency-sensitive.

  • Inference replicas get high priority. When they need to scale up, they get GPUs immediately.
  • Training jobs can be low priority. They use all available GPUs, but gracefully yield when inference needs more.
  • SkyPilot handles the job recovery. When a training job is preempted, SkyPilot automatically restarts it from its last checkpoint.

Static Partitioning vs SkyPilot

Here are the two approaches on the same 16-H100 cluster:

  • The static side keeps 8 GPUs walled off for inference and 8 for training, no sharing.
  • SkyPilot’s unified pool treats all 16 as one shared resource - inference borrows training’s GPUs when load rises, and gives them back when it falls.

Watch what happens as the request rate climbs: the static side caps out at 8 GPUs and starts dropping queries; the unified pool absorbs the burst.

0 RPS

Static Partitioning

Training partition 8 GPUs
no sharing
Inference partition 8 GPUs
7
GPUs idle
0
Queries dropped/s
SLO violation: inference demand exceeds 8-GPU partition. Queries are being dropped.

SkyPilot Unified

Unified pool 16 GPUs
0
GPUs idle
0
Queries dropped/s
All queries served. Training yields GPUs as needed.

In action: Sharing 16 H100s between inference and training

Below is a real trace from a SkyPilot deployment sharing 16 H100s between inference and training.

  • Both workloads start together: 14 GPUs running training jobs, 2 GPUs for inference.
  • Then we hit the endpoint with bursty traffic.

Watch what happens - each tile is one H100 GPU:

14
Training GPUs
0:00
Time (min:sec)
2
Inference GPUs
Training
Inference
Idle
Steady state: 14 GPUs running training jobs, 2 GPUs serving inference.
0:00 Burst starts Peak inference Scale down Recovery

SkyPilot makes sure no GPUs sit idle. When inference needed capacity, it got it instantly. When it didn’t, training used every available GPU. Training jobs that were preempted automatically resumed from their last checkpoint - without any manual intervention from researchers or infra teams.

Designed for performance, reliability and observability

SkyPilot Endpoints is already deployed in production for serving frontier models by top AI teams.

They choose it for its multi-cluster capabilities, but keep it for the performance, reliability and observability features that come with the stack:

  • High performance. KV cache-aware routing, prefill/decode disaggregation, speculative decoding and KV offloading available. LoRA support. Model caching cuts 235B-class cold starts to under 1 min.
  • Reliability. Automatic failure recovery. Autoscaling on KV-cache util, queue depth, RPS or your own metric. Scale-to-zero support. Rolling updates with deployment versioning. Builds on battle-tested vLLM + llm-d + KServe stack.
  • Observability. Replicas, traffic, KV-cache util, request latency, TTFT, TPOT, live logs + tracing, OTel/Datadog/Fluentbit/Promtail support, an in-browser playground — all in one dashboard, across every cluster you own.

Early access

Want to try SkyPilot Endpoints? Request access and we’ll be in touch.