You have 100 million product descriptions, 50 million user-generated images, 1 billion documents. To use them with vector search, every one needs an embedding — a 384–1536-dim vector. Calling an embedding model API for each is straightforward; doing it efficiently for millions+ of items, keeping embeddings fresh as the source updates, and not blowing your budget — that's an engineering problem.
An embedding generation pipeline is the data infrastructure that produces, updates, and stores vectors at scale. Different from feature stores, which target structured features.
02
The four-stage pipeline
Source — text/image/audio in S3, a database, a stream of events.
Chunking — split long text into 200-1000 token pieces (LLM context windows + relevance). Image: resize / crop. Audio: split by silence.
Embed — call the model (OpenAI, Cohere, in-house). One forward pass per chunk → vector.
Store — write vectors + metadata to vector DB. Index updates on the fly.
For 100M items at $0.0001/embedding (OpenAI text-embedding-3-small) = $10,000 initial backfill. Per-update cost is per-changed-item.
~10 ms
embedding API call (one chunk)
~1000 chunks
batch size for throughput
$0.0001
per-chunk embedding cost
512-1536
typical embedding dim
03
Initial backfill vs incremental update
Backfill
One-shot, massive parallelism
Spark/Beam job reads source, batches into chunks, calls embedding API in parallel (1000s of concurrent requests), writes vectors. Hours-days for 100M items. Cost: thousands of dollars one-time.
Incremental
CDC + streaming
CDC from source DB → Kafka → embed worker → vector DB. New / updated items get re-embedded continuously. Steady-state cost is per-update.
Practical pattern
One-time backfill on Day 0; incremental from Day 1. Most pipelines run both side by side — batch for backfilling new fields/models, streaming for source changes.
04
Deep dive — chunking is the real problem
"Embed the document" is wrong. Documents are too long for a single embedding to capture meaning. You chunk first, embed each chunk, retrieve the most relevant chunks at query time. How you chunk determines retrieval quality.
Naive: split every 500 tokens. Bad — sentences get cut mid-thought.
Better: recursive splitting on natural boundaries — paragraphs first, then sentences, then tokens if needed. LangChain's RecursiveCharacterTextSplitter does this.
Best: semantic chunking — embed sentence-by-sentence, look for cosine-distance jumps to find topic shifts. Chunk at the jumps. Slower but ~20% better retrieval.
Chunk overlap (~50-100 tokens) helps too — context near chunk boundaries shows up in two chunks, retrieval picks whichever is more relevant. Storage cost is +10-20%.
Real cost breakdown
"100M docs × 5 chunks each = 500M embeddings × $0.0001 = $50k initial. Recursive chunking with 100-token overlap. Stored in pgvector with HNSW indexing. Daily incremental embeds via CDC: ~$200/day for 2M new docs."
05
Hosted vs self-hosted embedding model
Choice
Pros
Cons
OpenAI / Cohere API
Best quality, no infra, well-tested
Per-call cost, data leaves your network, rate limits
Self-hosted (sentence-transformers, BGE)
Free per-call, data stays internal, no rate limits
GPU infra cost, slightly lower quality, you own the ops
Cloud-managed (Vertex AI, SageMaker)
Hybrid — managed infra, your model
Pricier than DIY but cheaper than per-token API at scale
Crossover point: ~10M embeddings/day. Below, hosted API wins on TCO. Above, self-hosted starts paying for itself.
06
Real-world pipelines
Notion AI
Per-workspace embeddings
Every page chunked + embedded on edit. Stored per-tenant. Powers semantic search inside Notion.
Stripe Docs / GitHub Search
Build-time embedding
Documentation embedded at build, vectors stored in repo or CDN. No streaming pipeline for static content.
Spotify recommendations
Track + user embeddings
Daily Spark job re-embeds tracks, user vectors updated on listen events. Vectors fed to ANN candidate generation.
Pinterest visual search
Image embeddings via CLIP
Every pin embedded on upload via in-house ResNet-derived model. Sub-100ms reverse-image search at billions scale.
07
Used in problems
Recommendation algorithm depends on a fresh embedding pipeline. Typeahead can leverage embedding-based completion. News feed scores posts by author/topic embeddings.