AWS Batch works well for traditional enterprise batch processing (see their case studies 1 and 2). But AI workloads have different requirements - they’re more interactive, need flexible GPU access, and benefit from simpler iteration cycles.
In this post, we explore how you would use AWS Batch to run AI batch inference at scale, and explain why it does not fit well with modern AI workloads. We’ll also show an alternative approach using SkyPilot.
Why AWS Batch struggles with AI workloads
AWS Batch launched in 2016 for traditional enterprise batch processing - ETL jobs, financial risk calculations, and genomics pipelines. It handles these workloads well.
But AI workloads have fundamentally different characteristics that AWS Batch struggles to address:
- Long end-to-end setup time: Complex setup and infrastructure configuration delays getting started
- GPU scarcity: Limited to a single region with limited high-end GPU capacity
- Long job completion time: Limited parallelism due to insufficient GPU availability, leads to long job completion time
- Developer experience: ML engineers need interactive tools (SSH, Jupyter, debuggers), rapid iteration without infrastructure overhead, and the ability to focus on models rather than operations
We will take a look at an example of how AWS Batch handles an AI batch inference as an infrastructure configuration problem:
- Configure compute environments with complex JSON
- Set up job definitions with container requirements
- Manage job queues with priority ordering
- Hope p3.8xlarge instances are available in us-east-1
- Manually reconfigure for us-west-2 when they’re not
Example: Using AWS Batch for AI embedding generation
Let’s walk through a concrete example: large-scale embedding generation using the aws_batch_demo_embeddings project. This will show how AWS Batch’s infrastructure-first design creates friction for ML workloads.
The AWS Batch approach: Six required components
Before running a single job, AWS Batch requires configuring six interconnected components:
The official “getting started” flow requires 6 steps. (The comprehensive commands can be found here)
Create IAM roles (4 different roles: service role, instance role, spot fleet role, job execution role) For example, creating the Batch service role
# Service role for AWS Batch aws iam create-role --role-name AWSBatchServiceRole \ --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"batch.amazonaws.com"},"Action":"sts:AssumeRole"}]}' aws iam attach-role-policy --role-name AWSBatchServiceRole \ --policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole
Click to view commands for remaining 3 IAM roles (~15 lines)
# ECS instance role aws iam create-role --role-name ecsInstanceRole \ --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}' aws iam attach-role-policy --role-name ecsInstanceRole \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role # Spot fleet role aws iam create-role --role-name aws-batch-spot-fleet-role \ --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"spotfleet.amazonaws.com"},"Action":"sts:AssumeRole"}]}' # Job execution role with EFS permissions aws iam create-role --role-name BatchJobRole \ --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ecs-tasks.amazonaws.com"},"Action":"sts:AssumeRole"}]}' aws iam put-role-policy --role-name BatchJobRole --policy-name EFSAccess \ --policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["elasticfilesystem:ClientMount","elasticfilesystem:ClientWrite"],"Resource":"*"}]}'
Set up storage for the job output (filesystem + security groups + mount targets in every subnet) Creating EFS requires 4 steps with more than 20 lines of commands
# Create EFS filesystem aws efs create-file-system --creation-token embeddings-efs \ --tags "Key=Name,Value=embeddings-efs" --region $REGION export EFS_ID=$(aws efs describe-file-systems \ --query 'FileSystems[?Name==`embeddings-efs`].FileSystemId' \ --output text --region $REGION)
Click to view remaining EFS setup commands (~20 lines)
# Create security group for NFS aws ec2 create-security-group --group-name batch-embeddings-sg \ --description "Batch embeddings SG" --vpc-id $VPC_ID --region $REGION SG_ID=$(aws ec2 describe-security-groups \ --query 'SecurityGroups[?GroupName==`batch-embeddings-sg`].GroupId' \ --output text --region $REGION) aws ec2 authorize-security-group-ingress --group-id $SG_ID \ --protocol tcp --port 2049 --source-group $SG_ID --region $REGION # Create mount target in EVERY subnet (critical for job scheduling) SUBNET_IDS=($(aws ec2 describe-subnets --filters "Name=vpc-id,Values=$VPC_ID" \ --query 'Subnets[*].SubnetId' --output text --region $REGION)) for subnet in "${SUBNET_IDS[@]}"; do aws efs create-mount-target --file-system-id $EFS_ID \ --subnet-id $subnet --security-groups $SG_ID --region $REGION done
Configure the compute environment (instance types, scaling limits, spot configuration)
Click to view compute environment configuration
{ "type": "EC2", "minvCpus": 0, "maxvCpus": 256, "desiredvCpus": 0, "instanceTypes": ["g4dn.xlarge", "g4dn.2xlarge"], "subnets": ["subnet-xxx", "subnet-yyy", "subnet-zzz"], "securityGroupIds": ["sg-xxx"], "instanceRole": "arn:aws:iam::XXX:instance-profile/ecsInstanceRole", "tags": {"Name": "embeddings-batch-spot"}, "bidPercentage": 80, "spotIamFleetRole": "arn:aws:iam::XXX:role/aws-batch-spot-fleet-role" }
aws batch create-compute-environment \ --compute-environment-name embeddings-compute-env-spot \ --type MANAGED --state ENABLED \ --service-role arn:aws:iam::$ACCOUNT_ID:role/AWSBatchServiceRole \ --compute-resources file://compute-env.json \ --region $REGION
Create a job queue (priority, compute environment ordering)
aws batch create-job-queue --job-queue-name embeddings-queue-mixed \ --state ENABLED --priority 1 \ --compute-environment-order order=1,computeEnvironment=embeddings-compute-env-spot \ --region $REGION
Create job definition with all infrastructure specific configs (container config, resource requirements, volume mounts)
Click to view job definition configuration
{ "jobDefinitionName": "embeddings-processor-job", "type": "container", "containerProperties": { "image": "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/embeddings-processor:latest", "vcpus": 4, "memory": 15000, "jobRoleArn": "arn:aws:iam::${ACCOUNT_ID}:role/BatchJobRole", "volumes": [{ "name": "efs-volume", "efsVolumeConfiguration": { "fileSystemId": "${EFS_ID}", "rootDirectory": "/" } }], "mountPoints": [{ "sourceVolume": "efs-volume", "containerPath": "/mnt/efs", "readOnly": false }] } }
Create job container image and submit jobs (Docker build + push for every code change, then submit via CLI) Every code change could involve 5-15 minutes, which leads to slow iteration.
# Build Docker image (typically 2-10 minutes) docker build --platform linux/amd64 -t embeddings-processor . # Tag and push to ECR (typically 3-5 minutes for multi-GB images) docker tag embeddings-processor:latest \ $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/embeddings-processor:latest aws ecr get-login-password --region $REGION | \ docker login --username AWS --password-stdin \ $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com docker push \ $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/embeddings-processor:latest # Submit jobs with AWS Batch CLI aws batch submit-job \ --job-name "embeddings-${START}-${END}" \ --job-queue "embeddings-queue-mixed" \ --job-definition "embeddings-processor-job" \ --container-overrides '{ "command": [ "python", "embeddings_processor.py", "--input-file", "data/dblp-v10.csv", "--output-file", "'$OUTPUT_FILE'", "--text-column", "'$TEXT_COLUMN'", "--model-name", "'$MODEL_NAME'", "--batch-size", "'$BATCH_SIZE'", "--start-idx", "'$START'", "--end-idx", "'$END'" ] }' \ --region ${REGION:-us-east-1}
Click to view complete job submission loop script
# AWS Batch: launch_jobs.sh - Complex job submission loop for ((i=0; i<NUM_JOBS; i++)); do START=$((i * CHUNK_SIZE)) END=$(((i + 1) * CHUNK_SIZE)) OUTPUT_FILE="results/embeddings_${START}_${END}.parquet" aws batch submit-job \ --job-name "embeddings-${START}-${END}" \ --job-queue "embeddings-queue-mixed" \ --job-definition "embeddings-processor-job" \ --container-overrides '{ "command": [ "python", "embeddings_processor.py", "--input-file", "data/dblp-v10.csv", "--output-file", "'$OUTPUT_FILE'", "--text-column", "'$TEXT_COLUMN'", "--model-name", "'$MODEL_NAME'", "--batch-size", "'$BATCH_SIZE'", "--start-idx", "'$START'", "--end-idx", "'$END'" ] }' \ --region ${REGION:-us-east-1} done
Monitor the jobs
After submission, you monitor job progress in the AWS Console:
Key challenges with AWS Batch for ML workloads
After going through this setup, several fundamental issues become clear:
- Regional limitation: AWS Batch is fundamentally a regional scheduler. When
us-east-1
runs out of GPUs, you must manually reconfigure forus-west-2
. No multi-region or multi-cloud capability. - No GPU support on Fargate: GPU resources aren’t supported on Fargate jobs, forcing you to manage EC2 instances
- No interactive debugging: ECS Exec isn’t supported by AWS Batch, removing SSH access for ML debugging
- Jobs stuck in RUNNABLE: Batch jobs frequently get stuck due to capacity or configuration issues - serious enough that AWS maintains a dedicated troubleshooting guide
- Limited observability: AWS provides a reference solution requiring CloudFormation/Lambda/CloudWatch just to monitor job delays
- EFS complexity: Shared data mounts require Amazon EFS, which must be pre-mounted on the AMI with mount targets in every subnet
SkyPilot: an alternative job-centric way - focus on AI workloads, instead of infra
Now let’s see how SkyPilot handles the same embedding generation workload. Instead of configuring infrastructure components, you define the task itself:
The single SkyPilot step without all the infra configurations:
name: embeddings-job
resources:
accelerators: {T4, L4} # Try either T4 or L4, whichever is available
any_of:
- use_spot: true
- use_spot: false # Fallback to on-demand if spot unavailable
file_mounts:
/my_data:
source: s3://my-bucket-for-skypilot-jobs # Direct S3 access, no EFS needed
workdir: .
setup: |
uv pip install --system -r requirements.txt
sudo apt install unzip -y
# Download dataset
mkdir -p data
curl -L -o data/research-papers-dataset.zip https://www.kaggle.com/api/v1/datasets/download/nechbamohammed/research-papers-dataset
unzip -j data/research-papers-dataset.zip -d data/
run: |
python embeddings_processor.py \
--input-file data/dblp-v10.csv \
--output-file results/embeddings_0_10000.parquet \
--text-column abstract \
--model-name all-MiniLM-L6-v2 \
--batch-size 1024 \
--start-idx 0 \
--end-idx 10000
What SkyPilot gets rid of for AI engineers compared to AWS Batch?
- No ECR image management
- No compute environment configuration
- No job queues
- No IAM role creation
- No VPC/subnet configuration
- No EFS mount targets
Launching jobs is equally simple with SkyPilot:
CLI
# Launch a single job
sky jobs launch skypilot_embeddings.yaml \
--env JOB_START_IDX=0 \
--env JOB_END_IDX=10000
# Or launch multiple parallel jobs with a bash loop
for i in {0..9}; do
START=$((i * 1000))
END=$(((i + 1) * 1000))
sky jobs launch skypilot_embeddings.yaml \
--env JOB_START_IDX=$START \
--env JOB_END_IDX=$END
done
Python SDK
# skypilot_launcher.py - Clean parallel job submission
import sky
task = sky.Task.from_yaml('skypilot_embeddings.yaml')
for i in range(num_jobs):
start, end = calculate_job_range(total_records, i, num_jobs)
task_envs = task.update_envs({
'JOB_START_IDX': str(start),
'JOB_END_IDX': str(end),
'BATCH_SIZE': str(args.batch_size),
# other params
})
sky.jobs.launch(task_envs, name=f'embeddings-{start}-{end}')
The SkyPilot dashboard shows the submitted jobs running across multiple clouds, automatically finding available GPUs wherever they exist:
After the jobs complete, the resulting parquet files are written directly to S3:
AWS Batch vs SkyPilot: Comparison
Now that we’ve seen both approaches for the same embedding generation task, let’s compare them across three key dimensions: GPU access, performance & cost, and operational overhead.
1. Multi-cloud GPU access: single region vs global scale
Capability | AWS Batch | SkyPilot |
---|---|---|
Region scope | Single region only | Multi-region, multi-cloud automatic |
GPU unavailable? | Manually reconfigure for new region, no cross-region job view, manual load balancing across regions | Automatically finds GPUs elsewhere and load balances |
Cloud support | AWS only | 17+ clouds (Hyperscalers and Neoclouds) |
Real-world impact:
SkyPilot’s multi-region approach unlocks a key insight: popular regions like us-east-1
have limited capacity, but “forgotten regions” across the globe have tons of available capacity. By spreading workloads across these regions, you access significantly more compute.
In this case study, SkyPilot achieved 9x speedup (20 hours -> 2.3 hours) and 61% cost reduction by automatically distributing 406 jobs across 12 AWS regions globally when a single region could only provide 47 GPUs:
Click to explore the interactive map.
The environment variables for distributed jobs (master address, node rank, world size) are automatically configured.
2. Developer productivity: Fast iteration and interactive debugging
ML engineers need to iterate quickly on code and debug interactively. AWS Batch’s Docker-centric workflow and lack of SSH access create significant friction.
Challenge | AWS Batch | SkyPilot |
---|---|---|
Code iteration | 5-15 min (Docker rebuild + ECR push + update) | Seconds (direct code sync) |
Impact of 10 changes/day | Hours of waiting | Minutes total |
Interactive debugging | No SSH access (ECS Exec unsupported) | Full SSH, Jupyter, VSCode support |
Code iteration cycle comparison:
The difference between several minutes and a few seconds compounds quickly: 10 code changes in a day = hours of waiting with AWS Batch vs minutes with SkyPilot.
3. End-to-end time and cost savings
Setup time and operational overhead:
Metric | AWS Batch | SkyPilot |
---|---|---|
Steps to run a job | 6 steps (IAM, EFS, compute env, job queue, job def, ECR) | 1 step (YAML file) |
Initial setup time1 | 2-5 days | 1-3 hours |
Debugging | 20-40 minutes (logs across multiple services) | 2-5 minutes (sky jobs logs ) |
Cost optimization opportunities:
Cost Factor | AWS Batch | SkyPilot | Savings |
---|---|---|---|
Cloud selection | AWS only, single region | Multi-cloud arbitrage. Neoclouds offer much more attractive GPU pricing. | Find cheapest GPUs globally |
GPU allocation | Fixed instance sizes (e.g., 8 GPUs minimum) | Request exactly what you need (e.g., 2 GPUs if cloud supports it) | Eliminate over-provisioning |
Storage options | Amazon EFS required for shared storage | Flexible: S3, GCS, EFS, third-party (e.g., Cloudflare R2) | Use cheapest/best storage option |
Real-world example: 11x cost reduction and improved productivity
Avataar, an AI company that creates product videos from 2D images, achieved dramatic improvements with SkyPilot:
- 11x cost reduction in infrastructure expenses
- Hourly GPU costs dropped from $6.88 (AWS) to $2.39 (RunPod) through multi-cloud arbitrage
- Saved tens of hours per week on infrastructure management, boosting team productivity
- Enabled seamless multi-cloud deployment across AWS, Azure, GCP, Nebius, etc
- Scaled from 1 to 1000+ GPUs as needed without infrastructure reconfiguration
When to use which system
We’ve seen how AWS Batch requires extensive infrastructure configuration while SkyPilot focuses on the task itself. We’ve compared their capabilities across GPU access, cost & performance, and operational overhead. Now, when should you use each?
Use AWS Batch if:
- Running traditional batch processing (nightly reports, ETL, financial calculations)
- You have dedicated infrastructure teams managing Terraform/CloudFormation
Use SkyPilot if:
- You’re doing ML/AI work that needs GPUs
- You want interactive development (SSH into jobs, Jupyter/VSCode support)
- You use experiment tracking tools (MLflow, W&B)
- You want faster iteration cycles without Docker rebuild overhead
- You are looking for increasing the productivity of your data scientists/AI engineers on clouds
- You need automatic cost optimization and multi-cloud GPU access
- You want access to new clouds & Neoclouds (RunPod, Nebius, Lambda Labs, etc.) for future-proofing
Many organizations use both: AWS Batch for production ETL pipelines with deep AWS integration, SkyPilot for ML training and inference that needs GPU flexibility and multi-cloud access.
Even if you’re deep in the AWS ecosystem, it’s worth trying SkyPilot for AI workloads to gain the benefits of faster iteration, automatic multi-region GPU access, and significant cost savings - while still running on AWS infrastructure when desired.
Conclusion
We started by exploring why AWS Batch, designed for traditional enterprise batch processing, struggles with AI workloads. Through a concrete embedding generation example, we saw how AWS Batch requires configuring six interconnected components before running a single job.
In contrast, SkyPilot’s job-centric approach lets you define the task itself - no infrastructure configuration needed. Our comparison across three dimensions showed clear advantages:
- Multi-cloud GPU access: Automatic distribution across 17+ clouds and all regions vs single-region limitation
- Time and cost: 11x cost reduction, seconds for code changes vs minutes, 1-3 hour setup vs 2-5 days
- Operational overhead: Simple task definition vs complex state management and permissions
For ML teams fighting GPU availability and infrastructure complexity, SkyPilot provides significant cost reduction, automatic multi-cloud GPU access, and faster iteration cycles - while AWS Batch remains better suited for traditional enterprise batch processing with deep AWS integration requirements.
Based on experience of an AI engineer with cloud experience but no prior hands-on experience with either AWS Batch or SkyPilot. ↩︎