Video analysis at scale is resource-intensive. Sports analytics companies process thousands of hours of footage. Security teams monitor multiple camera feeds. Content moderation systems analyze user-uploaded videos around the clock. Each of these use cases requires detecting and tracking objects across video frames - a task that was historically painful due to limited model capabilities and computational constraints.
The emergence of foundation models for video understanding has changed what’s possible. But running these models on large video archives still presents a practical challenge: how do you efficiently distribute processing across available GPU resources without spending more time on infrastructure plumbing than actual analysis?
SAM3 segmenting soccer players and ball. For demo purposes, sampled at 1 fps for faster inference.
SAM3: A foundation model for video segmentation
SAM3 (Segment Anything 3) is Meta’s latest addition to the Segment Anything family. The big change from previous versions is text-based prompting - you can now describe what you want to segment using natural language like “soccer player” or “ball” instead of clicking on objects or drawing bounding boxes. The model then detects, segments, and tracks those objects across all video frames.
SAM3 was trained on a dataset containing over 4 million annotated concepts, enabling it to handle 270,000 unique concepts - roughly 50x more than existing benchmarks. Meta reports it doubles the accuracy of previous systems on both image and video segmentation tasks.
For this example, we’ll use SAM3 to process a soccer video dataset from Kaggle, detecting and tracking players and balls throughout each video.
The single-GPU bottleneck
You could run SAM3 on a single node. Here’s what that looks like with SkyPilot (installation):
resources:
accelerators: L40S:1
setup: |
# Install dependencies and download dataset
...
run: |
source .venv/bin/activate
# Process all videos sequentially
for video in /data/videos/*.mp4; do
python process_segmentation.py "$video"
done
Launch it:
sky launch -c sam3-single task.yaml
This works, but the dataset contains over 100 videos. Processing them sequentially on a single node would take a long time. If you have a deadline or need results quickly, you’ll want to parallelize across multiple nodes.
Even if you scale up within a single cluster, you may still be leaving GPU capacity idle on other clusters or clouds. Ideally, you’d use all available GPUs across your infrastructure - not just the ones on whichever cluster you happened to deploy to.
Distributed batch inference with SkyPilot pools
SkyPilot’s Pools feature lets you create a fleet of GPU workers that share the same environment. You define the setup once, and SkyPilot keeps the workers warm and ready to process jobs as they come in.

The key benefits for video processing workloads:
- No cold starts: SAM3 model weights and the video dataset are downloaded once during pool creation, not repeated for each job
- Unified job queue: Submit any number of jobs - SkyPilot distributes them across available workers automatically
- Multi-cloud flexibility: Use GPUs from Kubernetes clusters, AWS, or other providers in the same pool
Setting up multi-cloud infrastructure
Most organizations have GPU capacity scattered across different providers - a Kubernetes cluster on-prem, some reserved instances on AWS, maybe capacity from a neocloud provider. SkyPilot unifies access to all of these through a single interface.
First, check what infrastructure is available:
$ sky check
...
🎉 Enabled infra 🎉
AWS [compute, storage]
Kubernetes [compute]
Allowed contexts:
├── k8s-cluster-two
└── k8s-cluster-one
...
In my case, I have 2 Kubernetes clusters configured (k8s-cluster-one and k8s-cluster-two) as well as AWS.
You could easily add GCP, Azure, or a neocloud like Lambda Labs, Nebuis or Coreweave.
Query GPU availability on each Kubernetes cluster:
$ sky show-gpus --infra k8s/k8s-cluster-one
Kubernetes GPUs
Context: k8s-cluster-one
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
H100 1, 2, 4, 8 8 of 8 free
L40S 1 2 of 2 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
k8s-cluster-one computeinstance-e00bv1fw2gdqr3qjx8 L40S 1 of 1 free
k8s-cluster-one computeinstance-e00fbanz9gtw87fga9 H100 8 of 8 free
k8s-cluster-one computeinstance-e00rjv8ch8n7chabbb H100 8 of 8 free
k8s-cluster-one computeinstance-e00yqvp56bxk9r4jt5 L40S 1 of 1 free
$ sky show-gpus --infra k8s/k8s-cluster-two
Kubernetes GPUs
Context: k8s-cluster-two
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
L40S 1 3 of 3 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
k8s-cluster-two computeinstance-e00ee5zmwmbtscq4ht L40S 1 of 1 free
k8s-cluster-two computeinstance-e00vc7e70xjdd3xgwg L40S 1 of 1 free
k8s-cluster-two computeinstance-e00xp4071zprmt6vwr L40S 1 of 1 free
We have 2 L40S GPUs on k8s-cluster-one, 3 on k8s-cluster-two, and AWS as a fallback. SkyPilot can use all of these together in a single pool.
Implementation
Pool configuration
The pool YAML defines worker infrastructure and shared setup (view on GitHub):
Click to expand: pool.yaml
pool:
workers: 7
resources:
accelerators: L40S:1
file_mounts:
~/.kaggle/kaggle.json: ~/.kaggle/kaggle.json
/outputs:
source: s3://my-skypilot-bucket
workdir: .
setup: |
# Setup runs once on all workers (must be non-blocking)
sudo apt-get update && sudo apt-get install -y unzip ffmpeg
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
# Download soccer video dataset from Kaggle (store in S3 to avoid re-downloading)
DATASET_PATH=/outputs/datasets/soccer-videos
if [ ! -d "$DATASET_PATH" ]; then
echo "Downloading dataset from Kaggle to S3..."
mkdir -p /outputs/datasets
kaggle datasets download shreyamainkar/football-soccer-videos-dataset --force
unzip -q football-soccer-videos-dataset.zip -d $DATASET_PATH
rm -f football-soccer-videos-dataset.zip
fi
echo "Setup complete!"
Job configuration
The job YAML defines the workload that runs on each worker (view on GitHub):
Click to expand: job.yaml
name: sam3-segmentation-job
resources:
accelerators: L40S:1
secrets:
HF_TOKEN: null
run: |
source .venv/bin/activate
uv pip install -r requirements.txt
echo "Job rank: ${SKYPILOT_JOB_RANK}/${SKYPILOT_NUM_JOBS}"
# Get list of all videos
VIDEO_DIR=/outputs/datasets/soccer-videos
mapfile -t VIDEOS < <(find ${VIDEO_DIR} -name "*.mp4" | sort)
TOTAL_VIDEOS=${#VIDEOS[@]}
echo "Total videos: ${TOTAL_VIDEOS}"
# Calculate start and end indices for this job
CHUNK_SIZE=$((TOTAL_VIDEOS / SKYPILOT_NUM_JOBS))
REMAINDER=$((TOTAL_VIDEOS % SKYPILOT_NUM_JOBS))
START_IDX=$((SKYPILOT_JOB_RANK * CHUNK_SIZE))
if [ ${SKYPILOT_JOB_RANK} -lt ${REMAINDER} ]; then
START_IDX=$((START_IDX + SKYPILOT_JOB_RANK))
CHUNK_SIZE=$((CHUNK_SIZE + 1))
else
START_IDX=$((START_IDX + REMAINDER))
fi
END_IDX=$((START_IDX + CHUNK_SIZE))
echo "Processing videos ${START_IDX} to ${END_IDX}"
# Process each video in this job's chunk
for ((i=START_IDX; i<END_IDX; i++)); do
video="${VIDEOS[$i]}"
echo "Processing: $video"
python process_segmentation.py "$video" --max-frames 50 || echo "Failed: $video"
done
echo "Job complete! Results saved to S3 bucket."
SkyPilot provides $SKYPILOT_JOB_RANK and $SKYPILOT_NUM_JOBS environment variables. Each job calculates which slice of videos it should process, ensuring work is evenly distributed without overlap.
Processing script
The Python script handles the actual segmentation (view on GitHub):
Click to expand: process_segmentation.py (key sections)
"""SAM3 video segmentation for soccer players and ball."""
import cv2
import numpy as np
from PIL import Image
import torch
from transformers import Sam3VideoModel, Sam3VideoProcessor
PROMPTS = ["soccer player", "ball"]
PLAYER_COLOR = (255, 100, 100)
BALL_COLOR = (100, 255, 100)
def process_video(model, processor, video_path, output_dir, sample_fps=1, max_frames=0):
"""Run SAM3 segmentation on video and save results."""
video_name = Path(video_path).stem
frames, original_fps, output_fps = load_video_frames(video_path, sample_fps, max_frames)
print(f" {len(frames)} frames (sampled at {output_fps} fps from {original_fps} fps)")
# Initialize video session with SAM3
session = processor.init_video_session(
video=frames,
inference_device="cuda",
processing_device="cpu",
video_storage_device="cpu",
dtype=torch.bfloat16,
)
session = processor.add_text_prompt(inference_session=session, text=PROMPTS)
# Track objects through video
masks_by_frame = {}
with torch.no_grad():
for out in model.propagate_in_video_iterator(
inference_session=session, max_frame_num_to_track=len(frames)):
processed = processor.postprocess_outputs(session, out)
# Store masks for each frame...
# Overlay colored masks and save video
output_frames = []
for i, frame in enumerate(frames):
masks = masks_by_frame.get(i, {})
output_frames.append(overlay_masks(frame, masks, colors) if masks else frame)
save_video(output_frames, output_video_path, output_fps)
# Save metadata JSON
result = {
"video": video_name,
"frames_processed": len(frames),
"objects_detected": len(obj_to_prompt),
"players_detected": total_players,
"balls_detected": total_balls,
}
return result
def main():
print("Loading SAM3 model...")
model = Sam3VideoModel.from_pretrained("facebook/sam3").to("cuda", dtype=torch.bfloat16).eval()
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
print("Model loaded!")
result = process_video(model, processor, video_path, output_dir, args.sample_fps, args.max_frames)
Running the pipeline
Create the pool
Spin up 7 workers across both Kubernetes clusters and AWS:
sky jobs pool apply -p sam3-pool pool.yaml

Check pool status
Once the pool is created, verify that workers are ready:
$ sky jobs pool status sam3-pool
Pools
NAME VERSION UPTIME STATUS WORKERS
sam3-pool 1 13m 56s READY 7/7
Pool Workers
POOL_NAME ID VERSION LAUNCHED INFRA RESOURCES STATUS USED_BY
sam3-pool 1 1 16 mins ago Kubernetes (k8s-cluster-two) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
sam3-pool 2 1 15 mins ago AWS (us-east-1a) 1x(gpus=L40S:1, g6e.xlarge, ...) READY -
sam3-pool 3 1 15 mins ago AWS (us-east-1a) 1x(gpus=L40S:1, g6e.xlarge, ...) READY -
sam3-pool 4 1 16 mins ago Kubernetes (k8s-cluster-two) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
sam3-pool 5 1 16 mins ago Kubernetes (k8s-cluster-two) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
sam3-pool 6 1 16 mins ago Kubernetes (k8s-cluster-one) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
sam3-pool 7 1 16 mins ago Kubernetes (k8s-cluster-one) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
SkyPilot filled the pool using all 5 available L40S GPUs from both Kubernetes clusters, then provisioned 2 additional workers on AWS to reach the requested 7 workers.

Submit batch jobs
Submit 10 jobs to process the video dataset:
sky jobs launch --pool sam3-pool --num-jobs 10 --secret HF_TOKEN job.yaml
Seven jobs start immediately (one per worker), and the remaining 3 queue up.
$ sky jobs queue
Fetching managed job statuses...
Managed jobs
In progress tasks: 3 PENDING, 7 RUNNING
ID TASK NAME REQUESTED SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS POOL
169 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s - 0 PENDING sam3-pool
168 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s - 0 PENDING sam3-pool
167 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s 4m 44s 0 RUNNING sam3-pool (worker=6)
166 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s 4m 44s 0 RUNNING sam3-pool (worker=5)
165 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s 4m 46s 0 RUNNING sam3-pool (worker=4)
164 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s - 0 PENDING sam3-pool
163 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s 4m 49s 0 RUNNING sam3-pool (worker=2)
162 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s 4m 49s 0 RUNNING sam3-pool (worker=3)
161 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s 4m 46s 0 RUNNING sam3-pool (worker=1)
160 - sam3-segmentation-job 1x[L40S:1] 4 mins ago 4m 51s 4m 43s 0 RUNNING sam3-pool (worker=7)

View logs
Watch a specific job’s progress:
$ sky jobs logs 167
...
(sam3-segmentation-job, pid=3213) Model loaded!
(sam3-segmentation-job, pid=3213) Processing: 87
(sam3-segmentation-job, pid=3213) 50 frames (sampled at 1 fps from 25.0 fps)
(sam3-segmentation-job, pid=3213) 0%| | 0/50 [00:00<?, ?it/s]
100%|██████████| 50/50 [00:48<00:00, 1.03it/s]
...
Scale up
If you need results faster, add more workers:
sky jobs pool apply --pool sam3-pool --workers 15
sky jobs launch --pool sam3-pool --num-jobs 20 job.yaml
SkyPilot will provision additional workers from available infrastructure to meet the new count.
Cleanup
When finished, tear down the pool:
sky jobs pool down sam3-pool
Results
Processed videos and metadata are synced to S3:
$ aws s3 ls s3://my-skypilot-bucket/segmentation_results/ --recursive
2025-12-22 08:53:37 0 segmentation_results/
2025-12-22 08:54:22 0 segmentation_results/1/
2025-12-22 08:54:23 231 segmentation_results/1/1_metadata.json
2025-12-22 08:54:23 3041504 segmentation_results/1/1_segmented.mp4
2025-12-22 08:55:13 0 segmentation_results/10/
2025-12-22 08:55:13 234 segmentation_results/10/10_metadata.json
2025-12-22 08:55:13 4291581 segmentation_results/10/10_segmented.mp4
...
Each video gets a segmented output with colored overlays (red for players, green for balls) and a metadata JSON with detection statistics.
How SkyPilot Pools unlocks capacity and boosts throughput
The main benefit of SkyPilot Pools is unlocking GPU capacity that would otherwise sit unused across different clusters and clouds. Here’s how throughput scales with the infrastructure in this example:
| Configuration | Available GPUs | Relative throughput |
|---|---|---|
| Single GPU instance | 1 | 1x |
Single K8s cluster (k8s-cluster-one) | 2 L40S | 2x |
| Both K8s clusters | 5 L40S | 5x |
| Multi-cloud pool (K8s + AWS) | 7 L40S | 7x |
Without SkyPilot, you’d be limited to whichever cluster has the most available GPUs - in this case, just 3 on k8s-cluster-two. With Pools, you aggregate capacity across both Kubernetes clusters and burst to AWS when needed, achieving near-linear scaling.
This pattern becomes more valuable as workloads grow. If you need to process 1000 videos instead of 100, you can scale the pool to 20+ workers across multiple regions and clouds - something that would require significant custom orchestration otherwise.
Adapting for other use cases
The same pool-based pattern works for other video processing tasks:
Change text prompts: Edit PROMPTS in process_segmentation.py for different objects:
PROMPTS = ["person", "car", "traffic light"] # Traffic monitoring
PROMPTS = ["whale", "dolphin", "boat"] # Marine research
Adjust frame sampling: By default, the script samples 1 frame per second. For higher-fidelity tracking:
python process_segmentation.py video.mp4 --sample-fps 5
Use different GPUs: Update the pool and job YAML files:
resources:
accelerators: H100:1 # More VRAM for longer videos
Non-video workloads: SkyPilot Pools work for any batch processing task, not just video. See the documentation for examples like batch text classification with vLLM and document OCR with DeepSeek OCR.
Resources
- SkyPilot Pools Documentation
- SAM3 on Hugging Face
- Complete example code - includes
pool.yaml,job.yaml,process_segmentation.py - Soccer Videos Dataset