When you’re training large language models or running distributed ML workloads, network performance can make or break your job. A poorly configured network can turn your multi-thousand-dollar GPU cluster into an expensive space heater. But configuring high-performance networking across different cloud providers and managed infrastructure platforms? That’s where things get really messy.
At SkyPilot, we’ve spent countless hours wrestling with the networking peculiarities of cloud providers offering VMs (Virtual Machines - individual server instances you can SSH into) and managed Kubernetes services (GKE, EKS, Nebius Managed Kubernetes, …). Each has its own way of doing things, its own gotchas, and its own performance characteristics. Whether you’re provisioning VMs or using managed Kubernetes clusters, the networking complexity is overwhelming.
Here’s what we learned along the way, and why you should care.
The Problem: Every Platform Does Networking Differently
Picture this: You have a distributed training job that works beautifully on cloud VMs with InfiniBand. You decide to move it to a managed Kubernetes service for better orchestration. Sounds simple. The performance shouldn’t change that much, right?
Wrong.
Each platform has its own high-performance networking story:
Cloud VMs:
- GCP: GPUDirect-TCPX for A3 High instances, GPUDirect-TCPXO for A3 Mega instances (Nvidia H100s), GPUDirect-RDMA for A4/A3 Ultra instances (Nvidia H200/B200s)
- Nebius: InfiniBand with MLX5 adapters and UCX optimizations
Managed Kubernetes Services:
- GKE: All of the above GCP networking, plus Kubernetes complexity, pod networking layers, and GPU device plugin configurations
- Nebius Managed Kubernetes: InfiniBand setup plus Kubernetes networking and container orchestration overhead
The Manual Setup Nightmare
Before we built our network tier abstraction, setting up a cluster manually was extremely tedious and error-prone. You need to look at the cloud provider’s guide for your specific instance type (say, A4 Ultra instances for B200), determine which machine type is best suited for your workload, then navigate through a complex setup process.
This setup process includes ensuring the NIC is properly connected to the GPU, installing GPU drivers, and configuring dozens of environment variables. You have to manually run each command to get the network set up correctly, and if you miss a command or make a typo, you’ll need to debug or restart the entire setup process—which can take hours or days. (Remember, you’re doing all this debugging on expensive high-end GPU instances. For example, with H200x8 clusters across 2 nodes costing around ~$100 per hour, just 3 days of debugging (8 hours per day) would cost you ~$2,400 in compute charges alone).
After all that technical setup, you still need to run NCCL tests to verify the network is performing as expected and ensure that when running any workload, the environment is set correctly. Then you have to repeat this entire process for every other instance type or cloud provider you want to use.
For managed Kubernetes deployments, you add yet another layer of complexity with pod networking, GPU device plugins, and container orchestration overhead.
Here’s what the manual configuration actually looks like across different platforms:
Click to see the full TCPX setup (25+ commands and environment variables)
For GCP H100 A3 High instances (GPUDirect-TCPX):
sudo cos-extensions install gpu -- --version=latest
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
docker run --pull=always --rm \
--name receive-datapath-manager \
--detach \
--privileged \
--cap-add=NET_ADMIN --network=host \
--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
--device /dev/nvidia2:/dev/nvidia2 \
--device /dev/nvidia3:/dev/nvidia3 \
--device /dev/nvidia4:/dev/nvidia4 \
--device /dev/nvidia5:/dev/nvidia5 \
--device /dev/nvidia6:/dev/nvidia6 \
--device /dev/nvidia7:/dev/nvidia7 \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
--device /dev/nvidiactl:/dev/nvidiactl \
--env LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
--volume /run/tcpx:/run/tcpx \
--entrypoint /tcpgpudmarxd/build/app/tcpgpudmarxd \
us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd \
--gpu_nic_preset a3vm --gpu_shmem_type fd --uds_path "/run/tcpx" --setup_param "--verbose 128 2 0"
# Set environment variables
export NCCL_SOCKET_IFNAME=eth0
export NCCL_CROSS_NIC=0
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=1
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_DYNAMIC_CHUNK_SIZE=524288
export NCCL_P2P_PXN_LEVEL=0
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_P2P_PCI_CHUNKSIZE=524288
export NCCL_P2P_NVL_CHUNKSIZE=1048576
export NCCL_BUFFSIZE=8388608
export NCCL_MAX_NCHANNELS=8
export NCCL_MIN_NCHANNELS=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/var/lib/tcpx/lib64"
export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
export NCCL_GPUDIRECTTCPX_FORCE_ACK=0
For GCP H100 A3 Mega instances (GPUDirect-TCPXO):
Similar complexity but with somewhat different setup procedures - different environment variables, and different network interface configurations. The setup involves another dozen+ environment variables specific to TCPXO.
For GCP B200/H200 A4/A3 Ultra instances (GPUDirect-RDMA):
Another different approach for even higher end instances. Follow the GPUDirect-RDMA guide which requires configuring setting up RDMA network parameters, and managing specific NCCL environment variables for optimal performance.
And this is just for one cloud provider.
Imagine doing this for every cloud provider, every instance type, and keeping it all up to date as providers may change their networking stacks.
But with SkyPilot, all of this complexity is abstracted away with a simple network_tier
configuration.
Performance Difference
NCCL Performance Tests
We conducted NCCL performance tests on a GCP cluster with 2x a3-highgpu-8g (2x H100:8) instances to compare GPUDirect-TCPX vs standard networking:
Key insight: Performance benefits scale with message size, achieving up to 3.8x speedup for large messages that are common in distributed ML training.
SGLang Serving Performance
LLM serving performance (DeepSeek-R1-Distill-Llama-8B) on the same cluster configuration:
For LLM serving workloads, proper networking configuration delivers measurable performance improvements - 11.3% higher throughput and 8% lower latency.
How SkyPilot Simplifies This
Instead of manually configuring networking for each cloud provider and infrastructure type, SkyPilot abstracts this complexity for both VMs and managed Kubernetes:
# skypilot.yaml - Works on VMs and managed Kubernetes
resources:
network_tier: best # Automatically configures your GPU cluster with the correct network settings for optimal performance
...
run: |
...
Adding network_tier: best
is the one liner change that does the setup for you. That’s it!
For more details, check out our Nebius InfiniBand documentation and see a complete example configuration.
Future Work
We’re continuing to improve the network tier system:
- Additional cloud providers: Adding support for other clouds with custom networking stacks (AWS EFA, AWS HyperPod, Azure InfiniBand, Oracle Cloud Infrastructure RoCE, Lambda Labs InfiniBand, CoreWeave, RunPod)
P.S. If You Need High-Performance Distributed Training, Try SkyPilot
We built this network tier system because we got tired of seeing users struggle with cloud networking configuration across both VMs and managed Kubernetes services. With SkyPilot, you can focus on your ML code instead of becoming a cloud configuration expert or Kubernetes networking specialist.
SkyPilot automatically finds the best instances across clouds and infrastructure types, configures high-performance networking (whether it’s VMs or managed Kubernetes), and scales your jobs efficiently. Whether you’re training LLMs on GCP VMs, running distributed inference on GKE, or using Nebius managed Kubernetes for large-scale data processing, we handle the infrastructure complexity so you can focus on what matters.
Give it a try and let us know how it goes in our Slack community!