Large-Scale AI Batch Inference: 9x Faster Embedding Generation

Embedding generation is at the heart of modern AI applications, from recommendation systems to retrieval augmented generation (RAG). However, running batch inference for embedding generation on million/trillion-scale datasets is not trivial.

After spending days developing a well-tuned batch inference script, getting 100s or 1000s of GPUs and running jobs on those GPUs in parallel are still a huge pain.

In this post, we share the experience about how we utilize SkyPilot to run a large-scale embedding generation on ~30M records with a state-of-the-art LLM embedding model, and speed up the workload by 9x compared to a typical autoscaling K8s setup (AWS EKS).

We open source the code for running large-scale embedding generation here.

TL;DR: “forgotten” regions bring 9x speedup

We are able to speed up the embedding generation by 9x and reduce the time from 20 hours to 2.3 hours, by utilizing more than one region. As a bonus, we also cut costs by 61% along the way.

The following map shows the distribution of the jobs across the globe, including Europe, North America, Asia, Australia, etc:

Total Jobs

406

Regions Used

Most Active Region

ap-northeast-1

Visualizing our execution traces. Top 3 utilized regions: ap-northeast-1, ap-southeast-2, and eu-west-3.

We achieved the performance boost by using SkyPilot, which automatically handles the resource search across regions and managing the lifecycle for hundreds of jobs.

A key insight: popular regions, like US-East-1, have limited capacity, but “forgotten regions” across globe have tons of available capacity. By spreading workloads across these regions, you can access significantly more compute.

Let’s now dive into the details of the workloads.

Embeddings: Neural foundation of AI search

Embeddings form the cornerstone of contemporary AI search and retrieval systems. Unlike traditional keyword-based approaches (Google Search, etc.), embedding enables AI to understand conceptual similarities between phrases like “breaking my lease” and “terminate rental agreement” even when they share no common keywords.

How Tech Giants Use Embeddings

Amazon

Powers product recommendations with Amazon Titan Text Embeddings.

Google

Improves search relevance with embeddings for better query understanding.

Facebook

Enhances search with embedding-based retrieval using social context.

Microsoft

Leverages embeddings in Azure AI Search for better search and eDiscovery.

Netflix

Uses embeddings for movie and show recommendations at scale.

Spotify

Applies neural embeddings for personalized music recommendations.

Airbnb

Improves search ranking and listing recommendations with embeddings.

Uber

Uses two-tower embeddings for Uber Eats recommendations.

Generating embeddings for large datasets demands enormous computational resources. Consider these examples: A typical Nvidia L4 GPU processes with a 7 billion parameters approximately process 2000 text tokens per second. Computing embeddings for a billion-item collection would require over 5.8 days on a single machine. A larger dataset, such as falcon-refinedweb dataset with 600 billion text tokens would take more than 9.5 years to process!

Real-world case study: Generating embeddings for Amazon reviews

We developed a large-scale embedding generation pipeline on a real-world case study with Amazon reviews, and open sourced the code here.

We use the book partition of the Amazon reviews 2023, containing ~30M Amazon reviews for books, and generating the embeddings for the reviews with the state-of-the-art specialized embedding LLM Alibaba-NLP/gte-Qwen2-7B-instruct, one of the top embedding models on the MTEB leaderboard.

We use a single L4 GPU on AWS (g6.xlarge instance) for handling each embedding generation job.

Existing approach: Hitting the availability wall

With conventional approaches, scaling embedding generation is limited in a single region:

Kubernetes clusters (EKS, GKE, etc.): Kubernetes clusters have to be deployed in a single region, limited by regional GPU availability even with autoscaling.
Cloud Batch Services (AWS Batch, GCP Batch, etc.): Cloud batch services are also limited by the number of GPUs you can provision in a single region.
Slurm: Slurm is a popular job scheduler for HPC, but it’s not designed for the cloud and is not scalable, especially for going beyond a single region.
Manual GPU provisioning: Manually provisioning 100+ GPUs and managing them is a nightmare and not scalable.

Existing approaches are confined to one region, which means limited GPU availability.

By using a autoscaling cluster on AWS region ap-northeast-2, we could only get at most 47 L4 GPUs (using g6.xlarge instances) after 2 hours due to limited availability.

During the constant retrying for provisioning more GPUs, we hit the error InsufficientInstanceCapacity 2,955 times.

======= Job Status Summary =======
Total jobs: 128
Successfully provisioned or in progress: 47
Failed to provision: 81

======= Error Type Summary =======
InsufficientInstanceCapacity errors: 2955
  - On-demand instances: 2955
  Affected zones:
    - ap-northeast-2: 2955 occurrences

Challenges for using conventional cloud services:

Availability: Our quota in the specific region, ap-northeast-2, allowed for 128 GPUs, but we could only get 47 GPUs due to availability constraints.
Long processing time: Processing time for the full Amazon reviews dataset: 20+ hours
High cost: On-demand pricing resulted in high projected costs: ~$710

Going beyond a single region

To mitigate the availability constraints, we use SkyPilot to distribute the workload across multiple regions for more available resources, and it pays off with 9x more resources.

The following figure shows the throughput comparison between single- and multi-region approaches for Amazon reviews dataset embedding generation:

Setup: gte-Qwen2-7B-instruct embedding model, Amazon review dataset, L4 GPUs, running for 2 hours.

Note that we cut off the figure at 2 hours, due to budget constraints, but we can already see a 9x speed up compared to the single-region approach, such as an autoscaling EKS cluster.

By going across multiple regions, although we also hit availability errors in different regions, SkyPilot automatically searches regions across the globe and retries provisioning GPUs, getting us 406 L4 GPUs with token throughput of 797.7k tokens/s!

======= Job Status Summary =======
Total jobs: 406
Successfully provisioned or in progress: 406
Failed to provision: 0
Spot instances: 396
  - Successfully provisioned or in progress: 396
  - Failed to provision: 0
On-demand instances: 10
  - Successfully provisioned or in progress: 10
  - Failed to provision: 0

See the full error type summary

======= Error Type Summary =======
MaxSpotInstanceCountExceeded errors: 6673
  - Spot instances: 6673
  - On-demand instances: 0
  Affected zones:
    - ca-central-1b: 607 occurrences
    - ca-central-1a: 606 occurrences
    - ap-south-1a: 564 occurrences
    - eu-west-2b: 520 occurrences
    - eu-west-2a: 509 occurrences
    - us-east-1d: 450 occurrences
    - us-east-2b: 427 occurrences
    - ap-south-1b: 425 occurrences
    - us-east-1a: 414 occurrences
    - eu-central-1b: 390 occurrences
    - eu-central-1a: 374 occurrences
    - us-east-2a: 334 occurrences
    - us-west-2c: 319 occurrences
    - us-west-2d: 313 occurrences
    - us-east-1b: 211 occurrences
    - us-west-2b: 136 occurrences
    - us-east-2c: 74 occurrences
InsufficientInstanceCapacity errors: 6566
  - Spot instances: 6360
  - On-demand instances: 206
  Affected zones:
    - ap-northeast-2a: 600 occurrences
    - ap-northeast-2d: 580 occurrences
    - us-east-2c: 567 occurrences
    - eu-west-3c: 560 occurrences
    - eu-north-1a: 535 occurrences
    - ap-southeast-2c: 494 occurrences
    - eu-north-1b: 455 occurrences
    - ap-southeast-2a: 435 occurrences
    - eu-west-3b: 427 occurrences
    - us-east-1b: 412 occurrences
    - ap-northeast-1a: 299 occurrences
    - us-west-2b: 281 occurrences
    - ap-northeast-1c: 256 occurrences
    - us-west-2: 206 occurrences
    - us-east-2b: 189 occurrences
    - eu-west-2a: 84 occurrences
    - ap-south-1b: 84 occurrences
    - us-east-2a: 58 occurrences
    - us-west-2d: 16 occurrences
    - ca-central-1a: 12 occurrences
    - eu-west-2b: 6 occurrences
    - eu-central-1b: 3 occurrences
    - ca-central-1b: 3 occurrences
    - us-east-1a: 2 occurrences
    - ap-south-1a: 2 occurrences
VcpuLimitExceeded errors: 0
  - Spot instances: 0
  - On-demand instances: 0

Other errors: 0
  - Spot instances: 0
  - On-demand instances: 0

The full event and error log can be found Here.

To summarize the results:

Metric	Single Region	Multi-Region	Improvement
GPUs Actually Obtained	47	406	9x more resources
Processing Time	20 hours	2.3 hours	9x faster
Cost	$710	$277	61% cheaper
Regions	1	12	Greater availability
Job Recovery	Manual	Automatic	Improved reliability

Benefits of going beyond a single region:

More resources: By leveraging resources across regions, we accessed 406 GPUs simultaneously with token throughput 364.4k tokens/s!
Faster: Processing time for the same dataset: reduced from 20 hours to just 2.3 hours (9x faster than the single-region approach)
Cheaper: Spot instance usage reduced costs from $710 to $277 (61% savings)

Running large-scale jobs with SkyPilot

We use SkyPilot to distribute embedding workloads across multiple regions and utilize both on-demand and spot instances, significantly increasing available resources while reducing costs.

Running and scaling the batch inference jobs for embedding generation is easy with SkyPilot:

Firstly, we define a SkyPilot YAML configuration for our embedding job.

# compute_text_embeddings.yaml
workdir: .

resources:
  cpus: 4
  accelerators: L4
  cloud: aws
  # Notably, we don't specify a region to allow multi-region deployment
  any_of:
    - use_spot: true
    - use_spot: false

envs: # Will be overridden by batch launcher script
  START_IDX: 0
  END_IDX: 10000  

run: python compute_text_embeddings.py

To distribute the workload evenly, we use a batch launcher script that creates hundreds of individual worker jobs across multiple regions:

# batch_compute_embeddings.py
def main():
    # ... initialization code ...
    task = sky.Task.from_yaml('compute_text_embeddings.yaml')
    
    # Launch jobs for each partition
    for job_rank in range(args.num_jobs):
        # Update environment variables based on partition method
        env_vars = {
            'START_IDX': str(args.start_idx),
            'END_IDX': str(args.end_idx),
        }
           
        task_copy = task.update_envs(env_vars)
        print(f"Launching job {job_rank}...")
        sky.jobs.launch(
            task_copy,
        )

The complete code for running the large-scale embedding generation is open sourced here.

Tip: As the huggingface dataset only supports iterator access but not indexing, we implement a stride partitioning to ensure each job processes documents that are distributed throughout the dataset (job 0 processes items {0, N, 2N, ...}; job 1 processes items {1, N+1, 2N+1, ...}).

Job queue

All jobs can be shown either in UI (sky jobs dashboard), or via CLI:

Job queue showing distributed jobs across regions

Embedding generation progress dashboard

We also built a custom dashboard to visualize the progress of the embedding generation:

Dashboard showing global job distribution

What about egress cost?

When using a multi-region approach, egress cost charged by cloud providers for data transfer between regions is a natural concern. However, for embedding generation workloads, these costs remain manageable or even negligible for several reasons:

Embeddings are compact representations (typically 768-3072 dimensions per vector)
- The data flow is primarily one-directional (from compute nodes to storage)
- Our centralized S3 bucket approach minimizes cross-region transfers
In our experiment, our dataset Amazon Reviews (approximately 30 million reviews) generates around 120GB of embedding data, and the egress costs were less than $50 across all regions. The significant cost savings from hundreds of dollars of compute resources far outweighed these transfer costs.

Conclusion

To run large-scale batch inference jobs, SkyPilot offers a simple and scalable way to distribute the workload across multiple regions (and even cloud providers). Utilizing the “forgotten regions” and spot instances significantly increases the available resources by 9x and reduces the costs by 61%:

9x Speed up: Processing that would have taken over 20 hours completed in just 2 hours
61% Cost Reduction: Total expenditure dropped from $710 to $277
Scale Beyond Limits: Bypassed single-region availability constraints to access 406 GPUs simultaneously across 12 regions (compared to only ~40 in a single region)
Enhanced Reliability: Automatic recovery from spot instance preemptions maintained processing continuity

Next steps

Code for running embedding generation on Amazon reviews dataset: here
Running many jobs with SkyPilot
Deploy SkyPilot for your team: SkyPilot Team Deployment

TL;DR: “forgotten” regions bring 9x speedup#

Embeddings: Neural foundation of AI search#

How Tech Giants Use Embeddings

Amazon

Google

Facebook

Microsoft

Netflix

Spotify

Airbnb

Uber

Real-world case study: Generating embeddings for Amazon reviews#

Existing approach: Hitting the availability wall#

Going beyond a single region#

Running large-scale jobs with SkyPilot#

Job queue#

Embedding generation progress dashboard#

What about egress cost?#

Conclusion#

Next steps#