Embedding generation is at the heart of modern AI applications, from recommendation systems to retrieval augmented generation (RAG). However, running batch inference for embedding generation on million/trillion-scale datasets is not trivial.
After spending days developing a well-tuned batch inference script, getting 100s or 1000s of GPUs and running jobs on those GPUs in parallel are still a huge pain.

In this post, we share the experience about how we utilize SkyPilot to run a large-scale embedding generation on ~30M records with a state-of-the-art LLM embedding model, and speed up the workload by 9x compared to a typical autoscaling K8s setup (AWS EKS).
We open source the code for running large-scale embedding generation here.
TL;DR: “forgotten” regions bring 9x speedup
We are able to speed up the embedding generation by 9x and reduce the time from 20 hours to 2.3 hours, by utilizing more than one region. As a bonus, we also cut costs by 61% along the way.
The following map shows the distribution of the jobs across the globe, including Europe, North America, Asia, Australia, etc:
We achieved the performance boost by using SkyPilot, which automatically handles the resource search across regions and managing the lifecycle for hundreds of jobs.
A key insight: popular regions, like US-East-1, have limited capacity, but “forgotten regions” across globe have tons of available capacity. By spreading workloads across these regions, you can access significantly more compute.
Let’s now dive into the details of the workloads.
Embeddings: Neural foundation of AI search
Embeddings form the cornerstone of contemporary AI search and retrieval systems. Unlike traditional keyword-based approaches (Google Search, etc.), embedding enables AI to understand conceptual similarities between phrases like “breaking my lease” and “terminate rental agreement” even when they share no common keywords.
How Tech Giants Use Embeddings
Generating embeddings for large datasets demands enormous computational resources. Consider these examples: A typical Nvidia L4 GPU processes with a 7 billion parameters approximately process 2000 text tokens per second. Computing embeddings for a billion-item collection would require over 5.8 days on a single machine. A larger dataset, such as falcon-refinedweb dataset with 600 billion text tokens would take more than 9.5 years to process!
Real-world case study: Generating embeddings for Amazon reviews
We developed a large-scale embedding generation pipeline on a real-world case study with Amazon reviews, and open sourced the code here.
We use the book partition of the Amazon reviews 2023, containing ~30M Amazon reviews for books, and generating the embeddings for the reviews with the state-of-the-art specialized embedding LLM Alibaba-NLP/gte-Qwen2-7B-instruct
, one of the top embedding models on the MTEB leaderboard.
We use a single L4 GPU on AWS (g6.xlarge
instance) for handling each embedding generation job.
Existing approach: Hitting the availability wall
With conventional approaches, scaling embedding generation is limited in a single region:
- Kubernetes clusters (EKS, GKE, etc.): Kubernetes clusters have to be deployed in a single region, limited by regional GPU availability even with autoscaling.
- Cloud Batch Services (AWS Batch, GCP Batch, etc.): Cloud batch services are also limited by the number of GPUs you can provision in a single region.
- Slurm: Slurm is a popular job scheduler for HPC, but it’s not designed for the cloud and is not scalable, especially for going beyond a single region.
- Manual GPU provisioning: Manually provisioning 100+ GPUs and managing them is a nightmare and not scalable.

By using a autoscaling cluster on AWS region ap-northeast-2
, we could only get at most 47 L4 GPUs (using g6.xlarge
instances) after 2 hours due to limited availability.
During the constant retrying for provisioning more GPUs, we hit the error InsufficientInstanceCapacity
2,955 times.
======= Job Status Summary =======
Total jobs: 128
Successfully provisioned or in progress: 47
Failed to provision: 81
======= Error Type Summary =======
InsufficientInstanceCapacity errors: 2955
- On-demand instances: 2955
Affected zones:
- ap-northeast-2: 2955 occurrences
Challenges for using conventional cloud services:
- Availability: Our quota in the specific region,
ap-northeast-2
, allowed for 128 GPUs, but we could only get 47 GPUs due to availability constraints. - Long processing time: Processing time for the full Amazon reviews dataset: 20+ hours
- High cost: On-demand pricing resulted in high projected costs: ~$710
Going beyond a single region
To mitigate the availability constraints, we use SkyPilot to distribute the workload across multiple regions for more available resources, and it pays off with 9x more resources.
The following figure shows the throughput comparison between single- and multi-region approaches for Amazon reviews dataset embedding generation:

Note that we cut off the figure at 2 hours, due to budget constraints, but we can already see a 9x speed up compared to the single-region approach, such as an autoscaling EKS cluster.
By going across multiple regions, although we also hit availability errors in different regions, SkyPilot automatically searches regions across the globe and retries provisioning GPUs, getting us 406 L4 GPUs with token throughput of 797.7k tokens/s!
======= Job Status Summary =======
Total jobs: 406
Successfully provisioned or in progress: 406
Failed to provision: 0
Spot instances: 396
- Successfully provisioned or in progress: 396
- Failed to provision: 0
On-demand instances: 10
- Successfully provisioned or in progress: 10
- Failed to provision: 0
See the full error type summary
======= Error Type Summary =======
MaxSpotInstanceCountExceeded errors: 6673
- Spot instances: 6673
- On-demand instances: 0
Affected zones:
- ca-central-1b: 607 occurrences
- ca-central-1a: 606 occurrences
- ap-south-1a: 564 occurrences
- eu-west-2b: 520 occurrences
- eu-west-2a: 509 occurrences
- us-east-1d: 450 occurrences
- us-east-2b: 427 occurrences
- ap-south-1b: 425 occurrences
- us-east-1a: 414 occurrences
- eu-central-1b: 390 occurrences
- eu-central-1a: 374 occurrences
- us-east-2a: 334 occurrences
- us-west-2c: 319 occurrences
- us-west-2d: 313 occurrences
- us-east-1b: 211 occurrences
- us-west-2b: 136 occurrences
- us-east-2c: 74 occurrences
InsufficientInstanceCapacity errors: 6566
- Spot instances: 6360
- On-demand instances: 206
Affected zones:
- ap-northeast-2a: 600 occurrences
- ap-northeast-2d: 580 occurrences
- us-east-2c: 567 occurrences
- eu-west-3c: 560 occurrences
- eu-north-1a: 535 occurrences
- ap-southeast-2c: 494 occurrences
- eu-north-1b: 455 occurrences
- ap-southeast-2a: 435 occurrences
- eu-west-3b: 427 occurrences
- us-east-1b: 412 occurrences
- ap-northeast-1a: 299 occurrences
- us-west-2b: 281 occurrences
- ap-northeast-1c: 256 occurrences
- us-west-2: 206 occurrences
- us-east-2b: 189 occurrences
- eu-west-2a: 84 occurrences
- ap-south-1b: 84 occurrences
- us-east-2a: 58 occurrences
- us-west-2d: 16 occurrences
- ca-central-1a: 12 occurrences
- eu-west-2b: 6 occurrences
- eu-central-1b: 3 occurrences
- ca-central-1b: 3 occurrences
- us-east-1a: 2 occurrences
- ap-south-1a: 2 occurrences
VcpuLimitExceeded errors: 0
- Spot instances: 0
- On-demand instances: 0
Other errors: 0
- Spot instances: 0
- On-demand instances: 0
The full event and error log can be found Here.
To summarize the results:
Metric | Single Region | Multi-Region | Improvement |
---|---|---|---|
GPUs Actually Obtained | 47 | 406 | 9x more resources |
Processing Time | 20 hours | 2.3 hours | 9x faster |
Cost | $710 | $277 | 61% cheaper |
Regions | 1 | 12 | Greater availability |
Job Recovery | Manual | Automatic | Improved reliability |
Benefits of going beyond a single region:
- More resources: By leveraging resources across regions, we accessed 406 GPUs simultaneously with token throughput 364.4k tokens/s!
- Faster: Processing time for the same dataset: reduced from 20 hours to just 2.3 hours (9x faster than the single-region approach)
- Cheaper: Spot instance usage reduced costs from $710 to $277 (61% savings)
Running large-scale jobs with SkyPilot
We use SkyPilot to distribute embedding workloads across multiple regions and utilize both on-demand and spot instances, significantly increasing available resources while reducing costs.
Running and scaling the batch inference jobs for embedding generation is easy with SkyPilot:
Firstly, we define a SkyPilot YAML configuration for our embedding job.
# compute_text_embeddings.yaml
workdir: .
resources:
cpus: 4
accelerators: L4
cloud: aws
# Notably, we don't specify a region to allow multi-region deployment
any_of:
- use_spot: true
- use_spot: false
envs: # Will be overridden by batch launcher script
START_IDX: 0
END_IDX: 10000
run: python compute_text_embeddings.py
To distribute the workload evenly, we use a batch launcher script that creates hundreds of individual worker jobs across multiple regions:
# batch_compute_embeddings.py
def main():
# ... initialization code ...
task = sky.Task.from_yaml('compute_text_embeddings.yaml')
# Launch jobs for each partition
for job_rank in range(args.num_jobs):
# Update environment variables based on partition method
env_vars = {
'START_IDX': str(args.start_idx),
'END_IDX': str(args.end_idx),
}
task_copy = task.update_envs(env_vars)
print(f"Launching job {job_rank}...")
sky.jobs.launch(
task_copy,
)
The complete code for running the large-scale embedding generation is open sourced here.
Tip: As the huggingface dataset only supports iterator access but not indexing, we implement a stride partitioning to ensure each job processes documents that are distributed throughout the dataset (job 0 processes items
{0, N, 2N, ...}
; job 1 processes items{1, N+1, 2N+1, ...}
).
Job queue
All jobs can be shown either in UI (sky jobs dashboard
), or via CLI:
Embedding generation progress dashboard
We also built a custom dashboard to visualize the progress of the embedding generation:
What about egress cost?
When using a multi-region approach, egress cost charged by cloud providers for data transfer between regions is a natural concern. However, for embedding generation workloads, these costs remain manageable or even negligible for several reasons:
- Embeddings are compact representations (typically 768-3072 dimensions per vector)
- The data flow is primarily one-directional (from compute nodes to storage)
- Our centralized S3 bucket approach minimizes cross-region transfers
- In our experiment, our dataset Amazon Reviews (approximately 30 million reviews) generates around 120GB of embedding data, and the egress costs were less than $50 across all regions. The significant cost savings from hundreds of dollars of compute resources far outweighed these transfer costs.

Conclusion
To run large-scale batch inference jobs, SkyPilot offers a simple and scalable way to distribute the workload across multiple regions (and even cloud providers). Utilizing the “forgotten regions” and spot instances significantly increases the available resources by 9x and reduces the costs by 61%:
- 9x Speed up: Processing that would have taken over 20 hours completed in just 2 hours
- 61% Cost Reduction: Total expenditure dropped from $710 to $277
- Scale Beyond Limits: Bypassed single-region availability constraints to access 406 GPUs simultaneously across 12 regions (compared to only ~40 in a single region)
- Enhanced Reliability: Automatic recovery from spot instance preemptions maintained processing continuity
Next steps
- Code for running embedding generation on Amazon reviews dataset: here
- Running many jobs with SkyPilot
- Deploy SkyPilot for your team: SkyPilot Team Deployment