Concept · Machine Learning Systems

Embedding Generation Pipelines

01

Why this matters

You have 100 million product descriptions, 50 million user-generated images, 1 billion documents. To use them with vector search, every one needs an embedding — a 384–1536-dim vector. Calling an embedding model API for each is straightforward; doing it efficiently for millions+ of items, keeping embeddings fresh as the source updates, and not blowing your budget — that's an engineering problem.

An embedding generation pipeline is the data infrastructure that produces, updates, and stores vectors at scale. Different from feature stores, which target structured features.

02

The four-stage pipeline

  1. Source — text/image/audio in S3, a database, a stream of events.
  2. Chunking — split long text into 200-1000 token pieces (LLM context windows + relevance). Image: resize / crop. Audio: split by silence.
  3. Embed — call the model (OpenAI, Cohere, in-house). One forward pass per chunk → vector.
  4. Store — write vectors + metadata to vector DB. Index updates on the fly.

For 100M items at $0.0001/embedding (OpenAI text-embedding-3-small) = $10,000 initial backfill. Per-update cost is per-changed-item.

~10 ms
embedding API call (one chunk)
~1000 chunks
batch size for throughput
$0.0001
per-chunk embedding cost
512-1536
typical embedding dim
03

Initial backfill vs incremental update

Backfill

One-shot, massive parallelism

Spark/Beam job reads source, batches into chunks, calls embedding API in parallel (1000s of concurrent requests), writes vectors. Hours-days for 100M items. Cost: thousands of dollars one-time.

Incremental

CDC + streaming

CDC from source DB → Kafka → embed worker → vector DB. New / updated items get re-embedded continuously. Steady-state cost is per-update.

Practical pattern

One-time backfill on Day 0; incremental from Day 1. Most pipelines run both side by side — batch for backfilling new fields/models, streaming for source changes.

04

Deep dive — chunking is the real problem

"Embed the document" is wrong. Documents are too long for a single embedding to capture meaning. You chunk first, embed each chunk, retrieve the most relevant chunks at query time. How you chunk determines retrieval quality.

Naive: split every 500 tokens. Bad — sentences get cut mid-thought.

Better: recursive splitting on natural boundaries — paragraphs first, then sentences, then tokens if needed. LangChain's RecursiveCharacterTextSplitter does this.

Best: semantic chunking — embed sentence-by-sentence, look for cosine-distance jumps to find topic shifts. Chunk at the jumps. Slower but ~20% better retrieval.

Chunk overlap (~50-100 tokens) helps too — context near chunk boundaries shows up in two chunks, retrieval picks whichever is more relevant. Storage cost is +10-20%.

Real cost breakdown

"100M docs × 5 chunks each = 500M embeddings × $0.0001 = $50k initial. Recursive chunking with 100-token overlap. Stored in pgvector with HNSW indexing. Daily incremental embeds via CDC: ~$200/day for 2M new docs."

05

Hosted vs self-hosted embedding model

ChoiceProsCons
OpenAI / Cohere APIBest quality, no infra, well-testedPer-call cost, data leaves your network, rate limits
Self-hosted (sentence-transformers, BGE)Free per-call, data stays internal, no rate limitsGPU infra cost, slightly lower quality, you own the ops
Cloud-managed (Vertex AI, SageMaker)Hybrid — managed infra, your modelPricier than DIY but cheaper than per-token API at scale

Crossover point: ~10M embeddings/day. Below, hosted API wins on TCO. Above, self-hosted starts paying for itself.

06

Real-world pipelines

Notion AI

Per-workspace embeddings

Every page chunked + embedded on edit. Stored per-tenant. Powers semantic search inside Notion.

Stripe Docs / GitHub Search

Build-time embedding

Documentation embedded at build, vectors stored in repo or CDN. No streaming pipeline for static content.

Spotify recommendations

Track + user embeddings

Daily Spark job re-embeds tracks, user vectors updated on listen events. Vectors fed to ANN candidate generation.

Pinterest visual search

Image embeddings via CLIP

Every pin embedded on upload via in-house ResNet-derived model. Sub-100ms reverse-image search at billions scale.

07

Used in problems

Recommendation algorithm depends on a fresh embedding pipeline. Typeahead can leverage embedding-based completion. News feed scores posts by author/topic embeddings.

Next up