Enterprise AI tools like Microsoft Copilot, Glean, Onyx and the like are becoming popular in organizations of all sizes. These RAG-based systems can answer questions, summarize content, and pull insights from massive document repositories.
However, they have trouble processing images and scanned documents because these types of data is often unsupported by the embedding models used in RAG pipelines (for semantic similarity search).
Many enterprises have decades of knowledge locked away in these formats. Scanned paper documents, legacy PDFs with weird layouts, technical drawings, archives from before anyone cared about “digital workflows.”

These documents often contain sensitive info - contracts, medical records, financial statements, internal reports. Shipping them off to a third-party OCR API is a no-go from a compliance perspective. HIPAA, GDPR, and internal data governance policies often mean you simply can’t send this data outside your infrastructure. Self-hosting OCR models becomes the only viable option. Traditional OCR doesn’t quite help here.
Multi-column layouts get merged into gibberish, tables lose their structure completely, and PDFs with invisible layers or annotations become unreadable. So you end up with valuable knowledge that might as well not exist, because your RAG system can’t see it.
Enter DeepSeek-OCR
DeepSeek OCR is a different beast from traditional OCR. Instead of the traditional sequential pipeline, it uses a vision-language model that understands a document as a whole - recognizing text, structure, and context simultaneously. This means multi-column layouts stay intact, tables preserve their structure, and the model outputs clean markdown that’s ready for RAG systems.
It also does context-aware text recognition. When text is illegible from ink stains or poor scanning quality, it infers the most likely word from context rather than outputting gibberish. For instance, in a damaged contract reading “The party agrees to ##### the premises by December 31st”, traditional OCR might return random characters, but DeepSeek OCR correctly infers “vacate” from the legal context.
Traditional OCR’s sequential five-stage pipeline - preprocessing, detection, layout analysis, recognition, and language correction - compounds errors and often loses structure. DeepSeek OCR replaces this with a unified vision-language model that processes text, structure, and context simultaneously, outputting RAG-ready markdown.
The model itself is great, but there’s a challenge: processing enterprise document archives with hundreds of thousands of pages on a single GPU would take weeks. You need a way to scale this efficiently across multiple machines.
Batch inference with SkyPilot Pools
To process large document archives efficiently, you need a scalable batch inference system. Most organizations already have GPU capacity scattered across their infrastructure - reserved instances on AWS, managed Kubernetes clusters from Neoclouds, such as Nebius and Coreweave, maybe some credits on GCP. These GPUs often sit idle between training runs or serving workloads. SkyPilot’s Pools feature lets you harness all of this capacity together, creating a unified pool of workers that spans multiple clouds and Kubernetes clusters.
With a pool of workers, you can spin off a large amount of batch inference jobs and utilize all the GPUs available from any of your infrastructure.

The naive approach: single GPU processing
You could start by running OCR on a single GPU instance. Here’s a simple SkyPilot task that processes the entire Book-Scan-OCR dataset (full example):
resources:
accelerators: L40S:1
setup: |
# Install DeepSeek OCR and dependencies
# Download the Book-Scan-OCR dataset
...
run: |
source .venv/bin/activate
# Process all images sequentially on one GPU
python process_ocr.py --start-idx 0 --end-idx -1
Launch it with:
sky launch -c deepseek-ocr-single task.yaml
Unfortunately, for enterprise document archives with hundreds of thousands of scanned pages, this approach simply doesn’t scale. You’d be waiting days or weeks for results.
This is where you need parallel batch inference across multiple GPUs to make OCR practical for large document collections.
Scaling batch inference with SkyPilot pools
Here we’ll look into SkyPilot’s Pools feature and how it enables scalable batch inference for DeepSeek OCR. Pools let you spin up a fleet of GPU workers that stay warm and ready to process document batches in parallel.
What are pools and why use them for batch inference?
A pool is a collection of GPU instances that share the same setup - dependencies, models, datasets all installed once. They persist across jobs, so there are no cold starts or re-downloading gigabytes of model weights or datasets every time.
Key benefits for batch inference workloads:
- Fully utilize GPU capacity: Pools allow you to utilize idle GPUs available across any of your clouds or Kubernetes clusters.
- Unified queue: Submit any number of jobs - SkyPilot automatically distributes work across available workers.
- Automatic recovery: Step aside for other higher priority jobs, and automatically reschedule when GPUs become available.
- Dynamic submission: Add new jobs anytime without reconfiguring infrastructure.
- Warm workers and elastic scaling: Models stay loaded and ready - no setup delays between jobs. Scale workers up or down with a single command.
It’s like having your own batch inference cluster that you control with a single YAML file, but with the flexibility to use GPUs from any provider.
Implementation: batch OCR pipeline
Let’s build a production-ready OCR pipeline for the Book-Scan-OCR dataset of scanned news and book pages.
Step 1: pool configuration
With the new pools feature, we separate the pool infrastructure definition from the job specification. The pool YAML defines the shared worker environment (view on GitHub):
Click to expand: pool.yaml
pool:
workers: 3
resources:
accelerators: L40S:1
file_mounts:
~/.kaggle/kaggle.json: ~/.kaggle/kaggle.json
/outputs:
source: s3://my-skypilot-bucket
workdir: .
setup: |
# Setup runs once on all workers (must be non-blocking)
sudo apt-get update && sudo apt-get install -y unzip
uv venv .venv --python 3.12
source .venv/bin/activate
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
pip install kaggle
uv pip install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
uv pip install vllm==0.8.5
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install -r requirements.txt
cd ..
# Download dataset during setup (shared across all jobs)
kaggle datasets download goapgo/book-scan-ocr-vlm-finetuning
unzip -q book-scan-ocr-vlm-finetuning.zip -d book-scan-ocr
echo "Setup complete!"
The job YAML defines the actual workload that runs on each worker (view on GitHub):
Click to expand: job.yaml
name: deepseek-ocr-job
resources:
accelerators: L40S:1
run: |
# Calculate job range using SKYPILOT_JOB_RANK and SKYPILOT_NUM_JOBS
source .venv/bin/activate
echo "Job rank: ${SKYPILOT_JOB_RANK}/${SKYPILOT_NUM_JOBS}"
# Count total images in the dataset
IMAGE_DIR=./book-scan-ocr/Book-Scan-OCR/images
TOTAL_IMAGES=$(find ${IMAGE_DIR} -name "*.jpg" -o -name "*.png" | wc -l)
echo "Total images: ${TOTAL_IMAGES}"
# Calculate start and end indices for this job
CHUNK_SIZE=$((TOTAL_IMAGES / SKYPILOT_NUM_JOBS))
REMAINDER=$((TOTAL_IMAGES % SKYPILOT_NUM_JOBS))
# Calculate start index
START_IDX=$((SKYPILOT_JOB_RANK * CHUNK_SIZE))
if [ ${SKYPILOT_JOB_RANK} -lt ${REMAINDER} ]; then
START_IDX=$((START_IDX + SKYPILOT_JOB_RANK))
CHUNK_SIZE=$((CHUNK_SIZE + 1))
else
START_IDX=$((START_IDX + REMAINDER))
fi
END_IDX=$((START_IDX + CHUNK_SIZE))
echo "Processing images ${START_IDX} to ${END_IDX}"
# Pass indices to Python script via CLI arguments
python process_ocr.py --start-idx ${START_IDX} --end-idx ${END_IDX}
echo "Job complete! Results saved to S3 bucket."
Key components
Pool configuration (
pool.yaml): Defines the worker infrastructure and shared setuppool: workers: 3 # Number of parallel GPU instances setup: | # Runs once when each worker starts git clone https://github.com/deepseek-ai/DeepSeek-OCR.git kaggle datasets download goapgo/book-scan-ocr-vlm-finetuning # Install dependencies, download models, etc.Job configuration (
job.yaml): Defines the workload that executes on each workerrun: | # Runs for each job submitted to the pool python process_ocr.py --start-idx ${START_IDX} --end-idx ${END_IDX}Separation of concerns: The pool YAML contains setup, file mounts, and infrastructure configuration. The job YAML contains only the run command and must match the pool’s resource requirements.
Automatic work distribution: SkyPilot provides environment variables to split work across jobs
run: | # Each job gets its own rank: 0, 1, 2, ... echo "Job rank: ${SKYPILOT_JOB_RANK}/${SKYPILOT_NUM_JOBS}" # Calculate which slice of images this job processes START_IDX=$((SKYPILOT_JOB_RANK * CHUNK_SIZE)) END_IDX=$((START_IDX + CHUNK_SIZE))Cloud storage integration: Results sync to S3 automatically (e.g. for use in downstream RAG systems)
file_mounts: /outputs: source: s3://my-skypilot-bucket # Auto-synced
Step 2: processing script
The Python script takes start/end indices and processes its chunk (view on GitHub):
Click to expand: process_ocr.py
"""
DeepSeek OCR Image Processing Script
Processes images from the Book-Scan-OCR dataset.
"""
import argparse
import json
from pathlib import Path
from transformers import AutoModel, AutoTokenizer
import torch
def main():
parser = argparse.ArgumentParser(description='Process OCR on image dataset')
parser.add_argument('--start-idx', type=int, required=True)
parser.add_argument('--end-idx', type=int, required=True)
args = parser.parse_args()
print(f"Processing range: {args.start_idx} to {args.end_idx}")
# Load DeepSeek OCR model
model_name = "deepseek-ai/deepseek-ocr"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModel.from_pretrained(
model_name,
_attn_implementation='flash_attention_2',
trust_remote_code=True,
use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)
# Find and slice images
image_dir = Path.cwd() / "book-scan-ocr" / "Book-Scan-OCR" / "images"
output_dir = Path("/outputs/ocr_results")
output_dir.mkdir(parents=True, exist_ok=True)
all_image_files = sorted(image_dir.glob("*.jpg")) + sorted(image_dir.glob("*.png"))
image_files = all_image_files[args.start_idx:args.end_idx]
print(f"Processing {len(image_files)} images...")
results = []
for idx, img_path in enumerate(image_files, 1):
print(f"Processing {idx}/{len(image_files)}: {img_path.name}...")
try:
# Run OCR with grounding tag for structure awareness
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_output_dir = output_dir / img_path.stem
image_output_dir.mkdir(exist_ok=True)
ocr_result = model.infer(
tokenizer,
prompt=prompt,
image_file=str(img_path),
output_path=str(image_output_dir),
base_size=1024,
image_size=640,
crop_mode=True,
save_results=True,
test_compress=True
)
# Read the markdown result
mmd_file = image_output_dir / "result.mmd"
if mmd_file.exists():
with open(mmd_file, 'r', encoding='utf-8') as f:
ocr_text = f.read()
else:
ocr_text = "[OCR completed but result not found]"
# Save consolidated markdown at top level
md_file = output_dir / f"{img_path.stem}.md"
with open(md_file, 'w', encoding='utf-8') as f:
f.write(f"# {img_path.name}\n\n{ocr_text}\n")
# Save JSON metadata
result = {"image_name": img_path.name, "ocr_text": ocr_text}
results.append(result)
json_file = output_dir / f"{img_path.stem}_ocr.json"
with open(json_file, 'w', encoding='utf-8') as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print(f"Saved markdown to {md_file}")
except Exception as e:
print(f"Error processing {img_path.name}: {e}")
results.append({"image_name": img_path.name, "error": str(e)})
# Save batch summary
summary_file = output_dir / f"results_{args.start_idx}_{args.end_idx}.json"
with open(summary_file, 'w', encoding='utf-8') as f:
json.dump(results, f, indent=2, ensure_ascii=False)
# Print summary
successful = sum(1 for r in results if "error" not in r)
print(f"\n{'='*60}")
print(f"Processing complete!")
print(f"Total: {len(results)} | Successful: {successful} | Failed: {len(results) - successful}")
print(f"Results saved to {output_dir}")
print('='*60)
if __name__ == "__main__":
main()
Running the pipeline
Create the pool
Spin up 3 workers with the shared environment:
sky jobs pool apply -p deepseek-ocr-pool pool.yaml
SkyPilot will automatically select workers (based on availability and cost) and they can be spread across your K8s and 20+ other cloud providers for maximum flexibility.
If you want to restrict to a specific cloud, you can add --infra k8s or --infra aws flags (replace k8s/aws with your desired provider).
Output:
YAML to run: pool.yaml
Pool spec:
Worker policy: Fixed-size (3 workers)
Each pool worker will use the following resources (estimated):
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
INFRA INSTANCE vCPUs Mem(GB) GPUS COST ($) CHOSEN
-------------------------------------------------------------------------------------------------------
Kubernetes (my-cluster) - 4 16 L40S:1 0.00 ✔
Nebius (eu-north1) gpu-l40s-a_1gpu-8vcpu-32gb 8 32 L40S:1 1.55
AWS (us-east-1) g6e.xlarge 4 32 L40S:1 1.86
-------------------------------------------------------------------------------------------------------
🔍 Multiple Nebius instances satisfy L40S:1. The cheapest (gpus=L40S:1, cpus=4, mem=16, ...) is considered among: gpu-l40s-d_1gpu-16vcpu-96gb, gpu-l40s-d_1gpu-48vcpu-288gb, gpu-l40s-d_1gpu-32vcpu-192gb, gpu-l40s-a_1gpu-24vcpu-96gb, gpu-l40s-a_1gpu-32vcpu-128gb, gpu-l40s-a_1gpu-16vcpu-64gb, gpu-l40s-a_1gpu-8vcpu-32gb, gpu-l40s-a_1gpu-40vcpu-160gb.
🔍 Multiple AWS instances satisfy L40S:1. The cheapest (gpus=L40S:1, cpus=4, mem=16, ...) is considered among: g6e.16xlarge, g6e.xlarge, g6e.4xlarge, g6e.8xlarge, g6e.2xlarge.
...
Check & install cloud dependencies on controller: done.
✓ Setup completed. View logs: sky api logs -l sky-2025-11-10-17-37-53-737160/setup-*.log
⚙︎ Job submitted, ID: 1
Pool name: deepseek-ocr-pool
📋 Useful Commands
├── To submit jobs to the pool: sky jobs launch --pool deepseek-ocr-pool job.yaml
├── To submit multiple jobs: sky jobs launch --pool deepseek-ocr-pool --num-jobs 10 job.yaml
├── To check the pool status: sky jobs pool status deepseek-ocr-pool
├── To terminate the pool: sky jobs pool down deepseek-ocr-pool
└── To update the number of workers: sky jobs pool apply -p deepseek-ocr-pool --workers 5
✓ Successfully created pool 'deepseek-ocr-pool'.
Check pool status
See what your workers are doing:
sky jobs pool status deepseek-ocr-pool
Output:
Pools
NAME VERSION UPTIME STATUS WORKERS
deepseek-ocr-pool 1 10m 8s READY 3/3
Pool Workers
POOL_NAME ID VERSION LAUNCHED INFRA RESOURCES STATUS USED_BY
deepseek-ocr-pool 1 1 4 mins ago Nebius (eu-north1) 1x(gpus=L40S:1, gpu-l40s-a_1gpu..., ...) READY -
deepseek-ocr-pool 2 1 6 mins ago Kubernetes (my-cluster) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
deepseek-ocr-pool 3 1 6 mins ago Kubernetes (my-cluster) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -

Note that I have set up access to my own K8s cluster with two L40S GPUs, Nebius cloud in Europe and AWS in US East. So SkyPilot prioritized using my K8s cluster first, then Nebius for the remaining worker because it’s cheaper than AWS (by ~$0.32/hour). As we’ll see below, if I scale up later, SkyPilot will automatically provision additional workers across different clouds to meet the desired count.
Once all workers show READY status, they’ve completed setup with models and dataset loaded.
Submit batch jobs
Submit 10 parallel jobs to process all images:
sky jobs launch --pool deepseek-ocr-pool --num-jobs 10 job.yaml
This submits 10 jobs to the pool. Since we have 3 workers, the first 3 jobs start immediately, each assigned to a worker. The remaining 7 jobs are queued and will automatically start as workers become available. Each job calculates its slice of images via $SKYPILOT_JOB_RANK, so the work gets evenly distributed across all 10 jobs.
sky dashboard

Watch progress
Check on your jobs:
sky jobs queue # or see the dashboard by running `sky dashboard`
Look at the logs:
$ sky jobs logs 2
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(deepseek-ocr-job, pid=2433) Job rank: 2/3
(deepseek-ocr-job, pid=2433) Total images: 156
(deepseek-ocr-job, pid=2433) Processing images 104 to 156
(deepseek-ocr-job, pid=2433) Processing range: 104 to 156
...
(deepseek-ocr-job, pid=2433) Processing 52 images...
(deepseek-ocr-job, pid=2433) Processing 1/52: india_news_p000104.jpg...
...
Scale the pool
Want to go faster? Scale up:
sky jobs pool apply --pool deepseek-ocr-pool --workers 10
If you have setup access to multiple clouds and K8s clusters, SkyPilot will automatically provision additional workers across different providers to meet your desired count. So your pool might look like this:
$ sky jobs pool status deepseek-ocr-pool
Pools
NAME VERSION UPTIME STATUS WORKERS
deepseek-ocr-pool 1 5m READY 10/10
Pool Workers
POOL_NAME ID VERSION LAUNCHED INFRA RESOURCES STATUS USED_BY
deepseek-ocr-pool 1 1 3 mins ago Nebius (eu-north1) 1x(gpus=L40S:1, gpu-l40s-a_1gpu..., ...) READY -
deepseek-ocr-pool 2 1 3 mins ago Kubernetes (my-cluster) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
deepseek-ocr-pool 3 1 3 mins ago Nebius (eu-north1) 1x(gpus=L40S:1, gpu-l40s-a_1gpu..., ...) READY -
deepseek-ocr-pool 4 1 3 mins ago Kubernetes (my-cluster) 1x(gpus=L40S:1, cpus=4, mem=16, ...) READY -
deepseek-ocr-pool 5 1 3 mins ago Nebius (eu-north1) 1x(gpus=L40S:1, gpu-l40s-a_1gpu..., ...) READY -
deepseek-ocr-pool 6 1 3 mins ago Nebius (eu-north1) 1x(gpus=L40S:1, gpu-l40s-a_1gpu..., ...) READY -
deepseek-ocr-pool 7 1 3 mins ago Nebius (eu-north1) 1x(gpus=L40S:1, gpu-l40s-a_1gpu..., ...) READY -
deepseek-ocr-pool 8 1 3 mins ago Nebius (eu-north1) 1x(gpus=L40S:1, gpu-l40s-a_1gpu..., ...) READY -
deepseek-ocr-pool 9 1 3 mins ago AWS (us-east-1a) 1x(gpus=L40S:1, g6e.xlarge, ...) READY -
deepseek-ocr-pool 10 1 3 mins ago AWS (us-east-1a) 1x(gpus=L40S:1, g6e.xlarge, ...) READY -

Then launch more jobs:
sky jobs launch --pool deepseek-ocr-pool --num-jobs 20 job.yaml
Results and integration
Once processing finishes, your S3 bucket has all the converted documents:
$ aws s3 ls s3://my-skypilot-bucket/ocr_results/
PRE india_news_p000000/
PRE india_news_p000001/
PRE india_news_p000002/
PRE india_news_p000003/
...
Inside each directory there’s an .md file with clean markdown text ready for RAG systems.
Point Glean, Onyx, or whatever pipeline you’re using at the bucket and you’re done.
Here’s an example of what the processing looks like. This scanned two-column news article:

gets converted into clean and structured markdown:

Once processed, you can point your RAG system (Onyx, Glean, etc.) to the S3 bucket and start asking questions about the documents. The previously inaccessible scanned content becomes part of your searchable knowledge base. For example, asking about the sample document above:
Onyx retrieves relevant context from the resulting markdown documents to answer questions about the scanned content.
Enterprise RAG systems can access digital documents directly, but scanned documents and legacy PDFs require OCR processing first. DeepSeek OCR combined with SkyPilot’s parallel GPU workers converts these unreadable images into clean markdown format. Once processed, all enterprise knowledge becomes searchable and accessible through the RAG system for Q&A, summarization, and analysis.
Why pools work well for batch inference
How does SkyPilot Pools compare to other batch inference approaches?
DIY scripts and SSH: You could manually partition data, SSH into each GPU node, and run jobs. This works for small runs but becomes a coordination nightmare at scale: no automatic job distribution, no failure recovery, and no visibility into what’s running where.
Kubernetes Jobs / Argo Workflows: These work well if all your GPUs are in one Kubernetes cluster. But if you have capacity spread across Hyperscalers, Neoclouds, or on-prem clusters, you’d need to manage workflows separately for each. SkyPilot unifies them into a single pool.
Cloud batch services (AWS Batch, GCP Batch): These vendor-specific services lock you into one cloud and one region. When GPU capacity runs out, you’re stuck manually reconfiguring for another region. They also require significant setup - IAM roles, compute environments, job queues, container images - before you can run anything. SkyPilot replaces this with a single YAML file that works across 17+ clouds and all regions.
Ray / Dask: Great for distributed computing within a cluster, but require you to provision and manage the underlying infrastructure yourself. SkyPilot handles both the infrastructure provisioning and the job orchestration.
The key difference: SkyPilot Pools give you a single control plane that spans multiple clouds and Kubernetes clusters, with warm workers that skip repeated setup costs. For batch inference workloads processing thousands of documents, this means higher GPU utilization and faster end-to-end throughput.
Beyond OCR: other batch inference use cases
The same batch inference pattern with SkyPilot pools works for other embarrassingly parallel workloads:
- Large-scale model inference: Process millions of samples for classification, embedding generation, or LLM inference
- Video processing: Batch transcription, scene detection, and content analysis across video archives
- Model training: Train multiple models with different hyperparameters simultaneously
- Scientific computing: Parameter sweeps, Monte Carlo simulations, and computational experiments
- ETL pipelines: Transform massive datasets in parallel across distributed workers
Any workload that can be split into independent batches benefits from this architecture.
Wrapping up
Modern OCR models like DeepSeek solve the document understanding problem and SkyPilot’s pools solve the batch inference scaling problem. When you combine them together, you can improve usefulness of the AI systems in your organization by providing them with additional knowledge trapped in scanned documents.
The implementation is pretty straightforward - define your batch inference environment once, submit jobs with one command, and let SkyPilot handle the orchestration across multiple clouds. If you’ve got archives of scanned documents collecting dust, this gives you a practical way to make them searchable and useful again.
Resources
- SkyPilot Pools Documentation
- DeepSeek OCR GitHub
- Complete example code - includes
pool.yaml,job.yaml,process_ocr.py, and sample output