Building Large-Scale Image Search using VectorDB & OpenAI CLIP: From 120 Hours to 1 Hour, From $$$ to $

Image searching from a given dataset is an important application, e.g., finding all my photos on beach. People have been using AI techniques, such as semantic image search, to enable it.

In this blog post, we share our experience in constructing a large-scale image search database, in a highly distributed and cost-effective way — reducing end-to-end generation time from 120 hours to 1 hour and the cost from $231 to $46.2. We architect our batch data processing functionalities across multiple cloud.

Screenshot 2025-01-30 at 2.53.30 PM.png

Understanding Semantic Image Search

Traditional image search relies on metadata like filenames, tags, or manually added descriptions. If you’ve ever tried finding a “sunset over mountains” in your photo library, you know how limiting this can be – unless you’ve meticulously tagged every photo, you’re out of luck.

Semantic search solves this by understanding the actual content and meaning within images. Here’s how it works:

Vector Embedding: Neural networks (like OpenAI CLIP) convert images into high-dimensional vectors – typically 512 to 1024 dimensions. These vectors capture semantic features: colors, shapes, objects, and even abstract concepts.
Similarity Matching: When you search, your query (text or image) gets converted into the same vector space. The system then finds images whose vectors are “closest” to your query vector, typically using cosine similarity or Euclidean distance.
Neural Understanding: The magic happens in the neural network’s training. Models like CLIP are trained on millions of image-text pairs, learning to map related concepts close together in vector space. This means searching for “person walking dog” finds relevant images even if they were never tagged with those words.

For example, when you search for “sunset over mountains”:

The query text “sunset over mountains” → vector [0.1, 0.8, -0.3, ...]
Each database image → vector [0.2, 0.7, -0.4, ...]
System returns images whose vectors are most similar to the query vector

How Does Semantic Image Search Work

In this example, we use simplified code example of OpenAI’s CLIP, a dual-encoder architecture of embedding size 512 that consists of a Vision Encoder (ViT) and a Text Encoder (Transformer).

Why not VLMs like DeepSeek-Janus? CLIP learns from its training approach that jointly combines the loss of both the text and images to the training set, which involves explicit text–image matching objective; while some VLMs may rely on different training objectives (e.g., masked token prediction or caption generation). Without a strong contrastive alignment, the learned embeddings might be less separable for direct retrieval tasks.

Using CLIP for Search

When building a search system with CLIP:

Generating Image Embeddings: To make an image dataset searchable, we need to generate embedding vector for each image in the dataset using the CLIP model, and store those embeddings in a database, called vector database.

def process_images(image_paths: List[str], clip_model: Any) -> List[torch.Tensor]:
    embeddings = []
    for path in image_paths:
        image = preprocess_image(path) # Resize to 224x224, normalize
        with torch.no_grad():
            image_embedding = clip_model.encode_image(image)
            embeddings.append(normalize(image_embedding))
    return embeddings

Query Processing: for every input query, the same CLIP model will be used for generating the embedding for the input.

def process_query(query: Union[str, bytes], clip_model: Any) -> torch.Tensor:
    if is_text_query(query):
        # Text query
        text = clip.tokenize(query)
        with torch.no_grad():
            vector_embeddings = clip_model.encode_text(text)
    else:
        # Image query
        image = preprocess_image(query)
        with torch.no_grad():
            vector_embeddings = clip_model.encode_image(image)

    return normalize(vector_embeddings)

Vector Search: the distance between input vector embedding and the image embeddings in the database is then calculated, and the topk images are selected as the result of the search.

def semantic_search(query_vector: torch.Tensor, image_vectors: torch.Tensor, k: int=10):
    similarities = torch.matmul(
        query_vector,
        torch.stack(image_vectors).T
    )
    top_k = torch.topk(similarities, k)
    return top_k.indices

How to Build the Vector Database with A Lot of Images?

The steps above seems quite straightforward and simple, but the complexity quickly adds up for each stage when we have to deal with a million-scale dataset, especially due to the high demands in GPU compute by the AI model.

For the rest of this blog post, we share our experience in constructing a large scale vector database in a highly distributed and cost-effective way — reducing end-to-end generation time from 120 hours to 1 hour and the cost from $231 to $46.2. We architect our batch data processing functionalities across multiple cloud. See Figure below as a general design of our pipeline, containing three stages:

Generating Image Embeddings
Vector Database Ingestion
Serving the Vector Database for Queries.

Screenshot 2025-01-28 at 2.16.59 PM.png

Figure: There are three stages for building image search database: generating image embeddings, vector database ingestion and handling queries.

Step 1: Generating Image Embeddings

Generating embeddings with large AI models for images in the dataset is the most time-consuming and computationally expensive step. A typical cloud GPU, such as Nvidia L4, can process 7 images per second with openclip . With one GPU, computing 1 million images takes more than 40 hours; computing the largest Laion-5B Dataset with 5 billion images takes more than 23 years!

Scale it Up

A natural choice is to speed it up is to use a lot of machines with multiple GPUs to generate embeddings in parallel. Imagine if you have access to 100 machines with similar GPUs: the processing time is cut to only 1 hour! While this is an intuitive idea, in reality GPU resources are:

Scarce: Cloud service providers have limited GPU capacity in a single region.limit the amount of GPU one can get.
Expensive: A GPU-equipped machine can be over 100 times more costly than a CPU-only counterpart.

To address scarcity, utilizing GPUs of different types across multiple regions and clouds can be really helpful, as it combines the available GPUs from multiple resource pools. Since embedding generation requires no communication among nodes, we can freely spread the compute across the globe.

For the expensive GPUs, we can leverage spot offerings on the cloud, which are generally 2-3x cheaper than normal on-demand offerings, though it can be preempted at any time, i.e. the workload can be interrupted.

Screenshot 2025-01-28 at 2.34.02 PM.png

Figure: An Illustration of Image Vector Computation With SkyPilot.

Launch the Jobs

We use SkyPilot to launch in-parallel embedding generation jobs across multiple regions or clouds, and mixing on-demand and spot instance to get best GPU availability and reduce costs. Specifically, we use SkyPilot Managed Jobs that automatically grab GPUs across multi-cloud/region, and recover the job from failure and spot preemptions.. The generated embeddings are stored in SkyPilot managed buckets in Apache Parquet format, which enables resume or merge partial results without re-processing everything.

We specified the GPU resources as a set of candidate choices, so SkyPilot can grab any of them for running the embedding generation job with CLIP specified afterwards. We statically partitioned dataset into subsets and assign each subset to a separate job. SkyPilot is responsible for making sure each job finishes the partition or recovering it from failure by retrying the job.

See below a simplified configuration example (complete SkyPilot YAML can be find here):

name: clip-batch-compute-vectors

resources:
  accelerators: {T4:1, L4:1}
  memory: 32+
  use_spot: true # we use Spot VMs to save costs

envs:
  START_IDX: ${START_IDX}
  END_IDX: ${END_IDX}

file_mounts:
  /output:
    name: my-bucket-for-embedding-output
    mode: MOUNT

setup: |
  pip install numpy==1.26.4
  pip install torch==2.5.1 torchvision==0.20.1 ftfy regex tqdm  
	...

run: |
  python scripts/compute_vectors.py \
    --output-dir /output \
    --start-idx ${START_IDX} \
    --end-idx ${END_IDX} ...

We start all jobs with a simple script that assigns partitions to every job:

DATA_SIZE = 1_000_000
NUM_WORKERS = 40
per_worker_size = DATA_SIZE // NUM_WORKERS
task = sky.Task.from_yaml('task.yaml')   

def start_job(start_idx, end_idx):
    sky.jobs.launch(
        task, 
        envs={
            "START_IDX": str(start_idx), 
            "END_IDX": str(end_idx)
        }, 
        detach_run=True
    )

with concurrent.futures.ThreadPoolExecutor() as executor:
    for start_idx in range(0, DATA_SIZE, per_worker_size):
        end_idx = min(DATA_SIZE, start_idx + per_worker_size)
        executor.submit(start_job, start_idx, end_idx)

After submitting all the embedding generation jobs with SkyPilot, we found the jobs are successfully run in parallel across different regions, including eastus2, westus3, etc, in the figure.

Screenshot 2025-02-04 at 10.32.12 PM.png

Step 2: Vector Database Ingestion

After the embedding vectors are generated for all images, we ingest the vectors into a vector database. The vector database builds internal structures to support storage and fast queries.

In this step, we use SkyPilot to launch a cloud virtual machine to gather the (embedding, images) pairs from Apache Parquet files in SkyPilot managed bucket, and construct a vector database and store the resulting vector database to another bucket (Here is the SkyPilot yaml).

Different vector databases have different underlying storage mechanisms and tradeoffs (See various vector databases). In our example, we use ChromaDB.

Screenshot 2025-02-02 at 1.31.46 AM.png

Step 3: Serving Your Vector Database

A vector database is only useful if you can query it. We now serve the database with an API endpoint that other applications (or your local client) can call to perform semantic search. We designed a simple search website for the vector database. Check the demo out here. The detailed SkyPilot YAML can be found here.

Screenshot 2025-02-03 at 9.28.53 AM.png

This guide shows how to build a large-scale image search database with SkyPilot and CLIP embeddings. By parallelizing embedding generation across multiple regions and using cost-effective GPU instances, processing time drops from 120 hours to 1 hour. Once ingested into a vector database, these embeddings enable fast and accurate image queries.

To try this full pipeline yourself, the detailed step-by-step tutorial can be found: Here

Understanding Semantic Image Search#

How Does Semantic Image Search Work#

Using CLIP for Search#

How to Build the Vector Database with A Lot of Images?#

Step 1: Generating Image Embeddings#

Step 2: Vector Database Ingestion#

Step 3: Serving Your Vector Database#

Read More#