Why AWS Batch Doesn't Work for Modern AI Workloads: A Technical Comparison with SkyPilot

AWS Batch works well for traditional enterprise batch processing (see their case studies 1 and 2). But AI workloads have different requirements - they’re more interactive, need flexible GPU access, and benefit from simpler iteration cycles.

In this post, we explore how you would use AWS Batch to run AI batch inference at scale, and explain why it does not fit well with modern AI workloads. We’ll also show an alternative approach using SkyPilot.

Why AWS Batch struggles with AI workloads

AWS Batch launched in 2016 for traditional enterprise batch processing - ETL jobs, financial risk calculations, and genomics pipelines. It handles these workloads well.

But AI workloads have fundamentally different characteristics that AWS Batch struggles to address:

Long end-to-end setup time: Complex setup and infrastructure configuration delays getting started
GPU scarcity: Limited to a single region with limited high-end GPU capacity
Long job completion time: Limited parallelism due to insufficient GPU availability, leads to long job completion time
Developer experience: ML engineers need interactive tools (SSH, Jupyter, debuggers), rapid iteration without infrastructure overhead, and the ability to focus on models rather than operations

We will take a look at an example of how AWS Batch handles an AI batch inference as an infrastructure configuration problem:

Configure compute environments with complex JSON
Set up job definitions with container requirements
Manage job queues with priority ordering
Hope p3.8xlarge instances are available in us-east-1
Manually reconfigure for us-west-2 when they’re not

Example: Using AWS Batch for AI embedding generation

Let’s walk through a concrete example: large-scale embedding generation using the aws_batch_demo_embeddings project. This will show how AWS Batch’s infrastructure-first design creates friction for ML workloads.

The AWS Batch approach: Six required components

Before running a single job, AWS Batch requires configuring six interconnected components:

AWS Batch Setup: 6 Complex Steps Required Before Running Any Job

The official “getting started” flow requires 6 steps. (The comprehensive commands can be found here)

Create IAM roles (4 different roles: service role, instance role, spot fleet role, job execution role) For example, creating the Batch service role

# Service role for AWS Batch
aws iam create-role --role-name AWSBatchServiceRole \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"batch.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name AWSBatchServiceRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

Click to view commands for remaining 3 IAM roles (~15 lines)

# ECS instance role
aws iam create-role --role-name ecsInstanceRole \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name ecsInstanceRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

# Spot fleet role
aws iam create-role --role-name aws-batch-spot-fleet-role \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"spotfleet.amazonaws.com"},"Action":"sts:AssumeRole"}]}'

# Job execution role with EFS permissions
aws iam create-role --role-name BatchJobRole \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ecs-tasks.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam put-role-policy --role-name BatchJobRole --policy-name EFSAccess \
  --policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["elasticfilesystem:ClientMount","elasticfilesystem:ClientWrite"],"Resource":"*"}]}'

Set up storage for the job output (filesystem + security groups + mount targets in every subnet) Creating EFS requires 4 steps with more than 20 lines of commands

# Create EFS filesystem
aws efs create-file-system --creation-token embeddings-efs \
  --tags "Key=Name,Value=embeddings-efs" --region $REGION
export EFS_ID=$(aws efs describe-file-systems \
  --query 'FileSystems[?Name==`embeddings-efs`].FileSystemId' \
  --output text --region $REGION)

Click to view remaining EFS setup commands (~20 lines)

# Create security group for NFS
aws ec2 create-security-group --group-name batch-embeddings-sg \
  --description "Batch embeddings SG" --vpc-id $VPC_ID --region $REGION
SG_ID=$(aws ec2 describe-security-groups \
  --query 'SecurityGroups[?GroupName==`batch-embeddings-sg`].GroupId' \
  --output text --region $REGION)
aws ec2 authorize-security-group-ingress --group-id $SG_ID \
  --protocol tcp --port 2049 --source-group $SG_ID --region $REGION

# Create mount target in EVERY subnet (critical for job scheduling)
SUBNET_IDS=($(aws ec2 describe-subnets --filters "Name=vpc-id,Values=$VPC_ID" \
  --query 'Subnets[*].SubnetId' --output text --region $REGION))
for subnet in "${SUBNET_IDS[@]}"; do
  aws efs create-mount-target --file-system-id $EFS_ID \
    --subnet-id $subnet --security-groups $SG_ID --region $REGION
done

Configure the compute environment (instance types, scaling limits, spot configuration)

Click to view compute environment configuration

{
  "type": "EC2",
  "minvCpus": 0,
  "maxvCpus": 256,
  "desiredvCpus": 0,
  "instanceTypes": ["g4dn.xlarge", "g4dn.2xlarge"],
  "subnets": ["subnet-xxx", "subnet-yyy", "subnet-zzz"],
  "securityGroupIds": ["sg-xxx"],
  "instanceRole": "arn:aws:iam::XXX:instance-profile/ecsInstanceRole",
  "tags": {"Name": "embeddings-batch-spot"},
  "bidPercentage": 80,
  "spotIamFleetRole": "arn:aws:iam::XXX:role/aws-batch-spot-fleet-role"
}

aws batch create-compute-environment \
  --compute-environment-name embeddings-compute-env-spot \
  --type MANAGED --state ENABLED \
  --service-role arn:aws:iam::$ACCOUNT_ID:role/AWSBatchServiceRole \
  --compute-resources file://compute-env.json \
  --region $REGION

Create a job queue (priority, compute environment ordering)

aws batch create-job-queue --job-queue-name embeddings-queue-mixed \
  --state ENABLED --priority 1 \
  --compute-environment-order order=1,computeEnvironment=embeddings-compute-env-spot \
  --region $REGION

Create job definition with all infrastructure specific configs (container config, resource requirements, volume mounts)

Click to view job definition configuration

{
  "jobDefinitionName": "embeddings-processor-job",
  "type": "container",
  "containerProperties": {
    "image": "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/embeddings-processor:latest",
    "vcpus": 4,
    "memory": 15000,
    "jobRoleArn": "arn:aws:iam::${ACCOUNT_ID}:role/BatchJobRole",
    "volumes": [{
      "name": "efs-volume",
      "efsVolumeConfiguration": {
        "fileSystemId": "${EFS_ID}",
        "rootDirectory": "/"
      }
    }],
    "mountPoints": [{
      "sourceVolume": "efs-volume",
      "containerPath": "/mnt/efs",
      "readOnly": false
    }]
  }
}

Create job container image and submit jobs (Docker build + push for every code change, then submit via CLI) Every code change could involve 5-15 minutes, which leads to slow iteration.

# Build Docker image (typically 2-10 minutes)
docker build --platform linux/amd64 -t embeddings-processor .

# Tag and push to ECR (typically 3-5 minutes for multi-GB images)
docker tag embeddings-processor:latest \
  $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/embeddings-processor:latest

aws ecr get-login-password --region $REGION | \
  docker login --username AWS --password-stdin \
  $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

docker push \
  $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/embeddings-processor:latest

# Submit jobs with AWS Batch CLI
aws batch submit-job \
    --job-name "embeddings-${START}-${END}" \
    --job-queue "embeddings-queue-mixed" \
    --job-definition "embeddings-processor-job" \
    --container-overrides '{
        "command": [
            "python", "embeddings_processor.py",
            "--input-file", "data/dblp-v10.csv",
            "--output-file", "'$OUTPUT_FILE'",
            "--text-column", "'$TEXT_COLUMN'",
            "--model-name", "'$MODEL_NAME'",
            "--batch-size", "'$BATCH_SIZE'",
            "--start-idx", "'$START'",
            "--end-idx", "'$END'"
        ]
    }' \
    --region ${REGION:-us-east-1}

Click to view complete job submission loop script

# AWS Batch: launch_jobs.sh - Complex job submission loop
for ((i=0; i<NUM_JOBS; i++)); do
    START=$((i * CHUNK_SIZE))
    END=$(((i + 1) * CHUNK_SIZE))
    OUTPUT_FILE="results/embeddings_${START}_${END}.parquet"

    aws batch submit-job \
        --job-name "embeddings-${START}-${END}" \
        --job-queue "embeddings-queue-mixed" \
        --job-definition "embeddings-processor-job" \
        --container-overrides '{
            "command": [
                "python", "embeddings_processor.py",
                "--input-file", "data/dblp-v10.csv",
                "--output-file", "'$OUTPUT_FILE'",
                "--text-column", "'$TEXT_COLUMN'",
                "--model-name", "'$MODEL_NAME'",
                "--batch-size", "'$BATCH_SIZE'",
                "--start-idx", "'$START'",
                "--end-idx", "'$END'"
            ]
        }' \
        --region ${REGION:-us-east-1}
done

Monitor the jobs

After submission, you monitor job progress in the AWS Console:

AWS Batch jobs queued in AWS Console

Key challenges with AWS Batch for ML workloads

After going through this setup, several fundamental issues become clear:

Regional limitation: AWS Batch is fundamentally a regional scheduler. When us-east-1 runs out of GPUs, you must manually reconfigure for us-west-2. No multi-region or multi-cloud capability.
No GPU support on Fargate: GPU resources aren’t supported on Fargate jobs, forcing you to manage EC2 instances
No interactive debugging: ECS Exec isn’t supported by AWS Batch, removing SSH access for ML debugging
Jobs stuck in RUNNABLE: Batch jobs frequently get stuck due to capacity or configuration issues - serious enough that AWS maintains a dedicated troubleshooting guide
Limited observability: AWS provides a reference solution requiring CloudFormation/Lambda/CloudWatch just to monitor job delays
EFS complexity: Shared data mounts require Amazon EFS, which must be pre-mounted on the AMI with mount targets in every subnet

SkyPilot: an alternative job-centric way - focus on AI workloads, instead of infra

Now let’s see how SkyPilot handles the same embedding generation workload. Instead of configuring infrastructure components, you define the task itself:

The single SkyPilot step without all the infra configurations:

name: embeddings-job

resources:
  accelerators: {T4, L4}  # Try either T4 or L4, whichever is available
  any_of:
    - use_spot: true
    - use_spot: false  # Fallback to on-demand if spot unavailable

file_mounts:
  /my_data:
    source: s3://my-bucket-for-skypilot-jobs  # Direct S3 access, no EFS needed

workdir: .

setup: |
  uv pip install --system -r requirements.txt
  sudo apt install unzip -y

  # Download dataset
  mkdir -p data
  curl -L -o data/research-papers-dataset.zip https://www.kaggle.com/api/v1/datasets/download/nechbamohammed/research-papers-dataset
  unzip -j data/research-papers-dataset.zip -d data/  

run: |
  python embeddings_processor.py \
    --input-file data/dblp-v10.csv \
    --output-file results/embeddings_0_10000.parquet \
    --text-column abstract \
    --model-name all-MiniLM-L6-v2 \
    --batch-size 1024 \
    --start-idx 0 \
    --end-idx 10000

What SkyPilot gets rid of for AI engineers compared to AWS Batch?

No ECR image management
No compute environment configuration
No job queues
No IAM role creation
No VPC/subnet configuration
No EFS mount targets

Launching jobs is equally simple with SkyPilot:

CLI

# Launch a single job
sky jobs launch skypilot_embeddings.yaml \
  --env JOB_START_IDX=0 \
  --env JOB_END_IDX=10000

# Or launch multiple parallel jobs with a bash loop
for i in {0..9}; do
  START=$((i * 1000))
  END=$(((i + 1) * 1000))
  sky jobs launch skypilot_embeddings.yaml \
    --env JOB_START_IDX=$START \
    --env JOB_END_IDX=$END
done

Python SDK

# skypilot_launcher.py - Clean parallel job submission
import sky

task = sky.Task.from_yaml('skypilot_embeddings.yaml')

for i in range(num_jobs):
    start, end = calculate_job_range(total_records, i, num_jobs)
    task_envs = task.update_envs({
        'JOB_START_IDX': str(start),
        'JOB_END_IDX': str(end),
        'BATCH_SIZE': str(args.batch_size),
        # other params
    })
    sky.jobs.launch(task_envs, name=f'embeddings-{start}-{end}')

The SkyPilot dashboard shows the submitted jobs running across multiple clouds, automatically finding available GPUs wherever they exist:

SkyPilot Dashboard showing submitted managed jobs

After the jobs complete, the resulting parquet files are written directly to S3:

Resulting parquet files in S3 created by SkyPilot jobs

AWS Batch vs SkyPilot: Comparison

Now that we’ve seen both approaches for the same embedding generation task, let’s compare them across three key dimensions: GPU access, performance & cost, and operational overhead.

1. Multi-cloud GPU access: single region vs global scale

Capability	AWS Batch	SkyPilot
Region scope	Single region only	Multi-region, multi-cloud automatic
GPU unavailable?	Manually reconfigure for new region, no cross-region job view, manual load balancing across regions	Automatically finds GPUs elsewhere and load balances
Cloud support	AWS only	17+ clouds (Hyperscalers and Neoclouds)

Real-world impact:

SkyPilot’s multi-region approach unlocks a key insight: popular regions like us-east-1 have limited capacity, but “forgotten regions” across the globe have tons of available capacity. By spreading workloads across these regions, you access significantly more compute.

In this case study, SkyPilot achieved 9x speedup (20 hours -> 2.3 hours) and 61% cost reduction by automatically distributing 406 jobs across 12 AWS regions globally when a single region could only provide 47 GPUs:

Click to explore the interactive map.

The environment variables for distributed jobs (master address, node rank, world size) are automatically configured.

2. Developer productivity: Fast iteration and interactive debugging

ML engineers need to iterate quickly on code and debug interactively. AWS Batch’s Docker-centric workflow and lack of SSH access create significant friction.

Challenge	AWS Batch	SkyPilot
Code iteration	5-15 min (Docker rebuild + ECR push + update)	Seconds (direct code sync)
Impact of 10 changes/day	Hours of waiting	Minutes total
Interactive debugging	No SSH access (ECS Exec unsupported)	Full SSH, Jupyter, VSCode support

Code iteration cycle comparison:

Code Change Iteration Cycle: AWS Batch vs SkyPilot

The difference between several minutes and a few seconds compounds quickly: 10 code changes in a day = hours of waiting with AWS Batch vs minutes with SkyPilot.

3. End-to-end time and cost savings

Setup time and operational overhead:

Metric	AWS Batch	SkyPilot
Steps to run a job	6 steps (IAM, EFS, compute env, job queue, job def, ECR)	1 step (YAML file)
Initial setup time¹	2-5 days	1-3 hours
Debugging	20-40 minutes (logs across multiple services)	2-5 minutes (`sky jobs logs`)

Cost optimization opportunities:

Cost Factor	AWS Batch	SkyPilot	Savings
Cloud selection	AWS only, single region	Multi-cloud arbitrage. Neoclouds offer much more attractive GPU pricing.	Find cheapest GPUs globally
GPU allocation	Fixed instance sizes (e.g., 8 GPUs minimum)	Request exactly what you need (e.g., 2 GPUs if cloud supports it)	Eliminate over-provisioning
Storage options	Amazon EFS required for shared storage	Flexible: S3, GCS, EFS, third-party (e.g., Cloudflare R2)	Use cheapest/best storage option

Real-world example: 11x cost reduction and improved productivity

Avataar, an AI company that creates product videos from 2D images, achieved dramatic improvements with SkyPilot:

11x cost reduction in infrastructure expenses
Hourly GPU costs dropped from $6.88 (AWS) to $2.39 (RunPod) through multi-cloud arbitrage
Saved tens of hours per week on infrastructure management, boosting team productivity
Enabled seamless multi-cloud deployment across AWS, Azure, GCP, Nebius, etc
Scaled from 1 to 1000+ GPUs as needed without infrastructure reconfiguration

When to use which system

We’ve seen how AWS Batch requires extensive infrastructure configuration while SkyPilot focuses on the task itself. We’ve compared their capabilities across GPU access, cost & performance, and operational overhead. Now, when should you use each?

Use AWS Batch if:

Running traditional batch processing (nightly reports, ETL, financial calculations)
You have dedicated infrastructure teams managing Terraform/CloudFormation

Use SkyPilot if:

You’re doing ML/AI work that needs GPUs
You want interactive development (SSH into jobs, Jupyter/VSCode support)
You use experiment tracking tools (MLflow, W&B)
You want faster iteration cycles without Docker rebuild overhead
You are looking for increasing the productivity of your data scientists/AI engineers on clouds
You need automatic cost optimization and multi-cloud GPU access
You want access to new clouds & Neoclouds (RunPod, Nebius, Lambda Labs, etc.) for future-proofing

Many organizations use both: AWS Batch for production ETL pipelines with deep AWS integration, SkyPilot for ML training and inference that needs GPU flexibility and multi-cloud access.

Even if you’re deep in the AWS ecosystem, it’s worth trying SkyPilot for AI workloads to gain the benefits of faster iteration, automatic multi-region GPU access, and significant cost savings - while still running on AWS infrastructure when desired.

Conclusion

We started by exploring why AWS Batch, designed for traditional enterprise batch processing, struggles with AI workloads. Through a concrete embedding generation example, we saw how AWS Batch requires configuring six interconnected components before running a single job.

In contrast, SkyPilot’s job-centric approach lets you define the task itself - no infrastructure configuration needed. Our comparison across three dimensions showed clear advantages:

Multi-cloud GPU access: Automatic distribution across 17+ clouds and all regions vs single-region limitation
Time and cost: 11x cost reduction, seconds for code changes vs minutes, 1-3 hour setup vs 2-5 days
Operational overhead: Simple task definition vs complex state management and permissions

For ML teams fighting GPU availability and infrastructure complexity, SkyPilot provides significant cost reduction, automatic multi-cloud GPU access, and faster iteration cycles - while AWS Batch remains better suited for traditional enterprise batch processing with deep AWS integration requirements.

Based on experience of an AI engineer with cloud experience but no prior hands-on experience with either AWS Batch or SkyPilot. ↩︎

Why AWS Batch struggles with AI workloads#

Example: Using AWS Batch for AI embedding generation#

The AWS Batch approach: Six required components#

Monitor the jobs#

Key challenges with AWS Batch for ML workloads#

SkyPilot: an alternative job-centric way - focus on AI workloads, instead of infra#

AWS Batch vs SkyPilot: Comparison#

1. Multi-cloud GPU access: single region vs global scale#

2. Developer productivity: Fast iteration and interactive debugging#

3. End-to-end time and cost savings#

When to use which system#

Conclusion#