Unlocking GPU Metrics in Kubernetes with SkyPilot

Intro

Modern AI workloads typically run on many expensive GPUs. AI infrastructure teams try to improve cost-value by maximizing GPU utilization, but performing telemetry at scale can become challenging very quickly. Complexity grows significantly when multiple users run simultaneous workloads across diverse infrastructure – such as Kubernetes clusters deployed on various cloud providers or on-premises.

SkyPilot simplifies this complexity by consolidating GPU observability into a unified dashboard. With SkyPilot’s GPU metrics, you will see:

A comprehensive GPU usage overview across multiple Kubernetes clusters.
Detailed per-workload GPU utilization, memory consumption, and power usage from multiple users.

High-level GPU overview of an entire Kubernetes cluster: Kubernetes Cluster Overview

Quickstart

Setting up SkyPilot’s GPU metrics is super easy. Run this command in the Kubernetes context that your SkyPilot API server is deployed in:

helm upgrade --install skypilot skypilot/skypilot-nightly --devel \
  --namespace skypilot \
  --create-namespace \
  --reuse-values \
  --set apiService.metrics.enabled=true \
  --set prometheus.enabled=true \
  --set grafana.enabled=true

If your SkyPilot API server is running in the same Kubernetes cluster as your GPUs, no further setup is required. Otherwise, run the following command in each of your GPU cluster Kubernetes contexts:

helm upgrade --install skypilot skypilot/skypilot-prometheus-server --devel \
  --namespace skypilot \
  --create-namespace

For more information regarding setting up GPU metrics in SkyPilot check out the setup guide in the SkyPilot documentation.

Use Cases

SkyPilot’s workload-level filtering on the clusters page enables ML engineers to verify that individual jobs scale optimally from their local machine to a large-scale cluster. Per-job metrics allow engineers to investigate whether a workload is bottlenecked on memory, compute, or power, and to quickly iterate on batch sizes, model parallelism, or input pipeline strategies.

For example, during a Llama fine-tuning run, increasing the sequence length from 128 to 8192 boosted GPU utilization across NVIDIA H100 GPUs by roughly 15% (see screenshots below). While still short of full saturation, this illustrates how SkyPilot makes it easy to spot under-utilization early and apply targeted adjustments to get closer to peak performance.

Llama Fine Tuning with Sequence Length=128: Llama Fine Tuning Low Sequence Length

Llama Fine Tuning with Sequence Length=8192: Llama Fine Tuning High Sequence Length

The SkyPilot infra dashboard provides ML infra teams a comprehensive view across Kubernetes clusters, making it easy to spot idle resources and maximize hardware utilization. Real-time monitoring surfaces hardware issues such as XID errors, empowering teams to address reliability problems before they escalate. With workload-based filtering, infra teams can dive deep to understand exactly how GPU resources are allocated – supporting fair scheduling and user accountability.

Next Steps

SkyPilot’s unified Kubernetes GPU observability is available now and takes less than five minutes to set up. To get started, follow the SkyPilot GPU Metrics Setup Guide and start monitoring your multi-cluster workloads with ease.

Intro#

Quickstart#

Use Cases#

Next Steps#

Intro

Quickstart

Use Cases

Next Steps