Embedding Generation Pipelines

01

Why this matters

You have 100 million product descriptions, 50 million user-generated images, 1 billion documents. To use them with vector search, every one needs an embedding — a 384–1536-dim vector. Calling an embedding model API for each is straightforward; doing it efficiently for millions+ of items, keeping embeddings fresh as the source updates, and not blowing your budget — that's an engineering problem.

An embedding generation pipeline is the data infrastructure that produces, updates, and stores vectors at scale. Different from feature stores, which target structured features.

02

The four-stage pipeline

Source — text/image/audio in S3, a database, a stream of events.
Chunking — split long text into 200-1000 token pieces (LLM context windows + relevance). Image: resize / crop. Audio: split by silence.
Embed — call the model (OpenAI, Cohere, in-house). One forward pass per chunk → vector.
Store — write vectors + metadata to vector DB. Index updates on the fly.

For 100M items at $0.0001/embedding (OpenAI text-embedding-3-small) = $10,000 initial backfill. Per-update cost is per-changed-item.

~10 ms

embedding API call (one chunk)

~1000 chunks

batch size for throughput

$0.0001

per-chunk embedding cost

512-1536

typical embedding dim

03

Initial backfill vs incremental update

Backfill

One-shot, massive parallelism

Spark/Beam job reads source, batches into chunks, calls embedding API in parallel (1000s of concurrent requests), writes vectors. Hours-days for 100M items. Cost: thousands of dollars one-time.

Incremental

CDC + streaming

CDC from source DB → Kafka → embed worker → vector DB. New / updated items get re-embedded continuously. Steady-state cost is per-update.

Practical pattern

One-time backfill on Day 0; incremental from Day 1. Most pipelines run both side by side — batch for backfilling new fields/models, streaming for source changes.

04

Deep dive — chunking is the real problem

"Embed the document" is wrong. Documents are too long for a single embedding to capture meaning. You chunk first, embed each chunk, retrieve the most relevant chunks at query time. How you chunk determines retrieval quality.

Naive: split every 500 tokens. Bad — sentences get cut mid-thought.

Better: recursive splitting on natural boundaries — paragraphs first, then sentences, then tokens if needed. LangChain's RecursiveCharacterTextSplitter does this.

Best: semantic chunking — embed sentence-by-sentence, look for cosine-distance jumps to find topic shifts. Chunk at the jumps. Slower but ~20% better retrieval.

Chunk overlap (~50-100 tokens) helps too — context near chunk boundaries shows up in two chunks, retrieval picks whichever is more relevant. Storage cost is +10-20%.

Real cost breakdown

"100M docs × 5 chunks each = 500M embeddings × $0.0001 = $50k initial. Recursive chunking with 100-token overlap. Stored in pgvector with HNSW indexing. Daily incremental embeds via CDC: ~$200/day for 2M new docs."

05

Hosted vs self-hosted embedding model

Choice	Pros	Cons
OpenAI / Cohere API	Best quality, no infra, well-tested	Per-call cost, data leaves your network, rate limits
Self-hosted (sentence-transformers, BGE)	Free per-call, data stays internal, no rate limits	GPU infra cost, slightly lower quality, you own the ops
Cloud-managed (Vertex AI, SageMaker)	Hybrid — managed infra, your model	Pricier than DIY but cheaper than per-token API at scale

Crossover point: ~10M embeddings/day. Below, hosted API wins on TCO. Above, self-hosted starts paying for itself.

06

Real-world pipelines

Notion AI

Per-workspace embeddings

Every page chunked + embedded on edit. Stored per-tenant. Powers semantic search inside Notion.

Stripe Docs / GitHub Search

Build-time embedding

Documentation embedded at build, vectors stored in repo or CDN. No streaming pipeline for static content.

Spotify recommendations

Track + user embeddings

Daily Spark job re-embeds tracks, user vectors updated on listen events. Vectors fed to ANN candidate generation.

Pinterest visual search

Image embeddings via CLIP

Every pin embedded on upload via in-house ResNet-derived model. Sub-100ms reverse-image search at billions scale.

07

Used in problems

Recommendation algorithm depends on a fresh embedding pipeline. Typeahead can leverage embedding-based completion. News feed scores posts by author/topic embeddings.

📺

References & Videos

Embeddings & Vector Search

ByteByteGo · 8 min

Embedding Generation at Scale

Arpit Bhayani · 30 min

Vector Embeddings Pipeline

Pinecone

Vector Databases & Embeddings

ByteByteGo Blog

Why this matters

The four-stage pipeline

Initial backfill vs incremental update

One-shot, massive parallelism

CDC + streaming

Deep dive — chunking is the real problem

Hosted vs self-hosted embedding model

Real-world pipelines

Per-workspace embeddings

Build-time embedding

Track + user embeddings

Image embeddings via CLIP

Used in problems

References & Videos

Batch vs Stream Processing

Vector Databases

Recommendation Algorithm