You ship a new homepage. Did it improve conversion or hurt it? Eyeballing the dashboard for a week proves nothing — traffic patterns shift hour to hour, day to day. A/B testing is the discipline of statistically comparing variant A vs variant B with enough rigor to actually know.
Different from feature flags (which are about safe rollout). Different from canary deploys (which are about not breaking things). A/B testing is about measuring causal impact on user behavior. Every product team at Netflix, Booking, Airbnb, Meta runs hundreds simultaneously.
02
The architecture
Experiment definition. Variant A (control), variant B (treatment), traffic split (50/50 typical), success metric, expected effect size, max duration.
Bucketing. Each user deterministically hashed to A or B by user_id mod 100 (or a hash of user_id + experiment_id for hash-stable assignment across experiments).
Exposure logging. Every time a user enters the experiment, log (user_id, experiment_id, variant). This is the source of truth for "who saw what."
Outcome logging. Every event the experiment cares about — click, purchase, sign-up — flows into a metrics store.
Analysis. Statistical test (t-test, chi-square, CUPED for variance reduction) on outcome differences. Did B beat A with p < 0.05?
Decision. Promote B, kill B, run longer, or iterate.
03
Three statistical regimes
Approach
How it decides
When to use
Fixed-horizon t-test
Run for N days, then compare means with p-value < 0.05
Standard. Most experiments. Plan duration ahead.
Sequential testing
Continuously analyze; stop early when significant; control α across peeks
When time is precious; lets you stop bad experiments early
Multi-armed bandit
Allocate more traffic to better-performing variant continuously
When opportunity cost is high (ad bidding) — but loses statistical rigor
Most product teams should default to sequential testing. It's nearly as rigorous as fixed-horizon and stops bad experiments faster. Bandits are great for revenue-direct decisions; not great when you need to learn why something worked.
04
Deep dive — sample size math
The interview question every infra-aware PM asks: "How long do we need to run this?" Answer requires the formula:
N per arm ≈ 16 × σ² / Δ²
where:
σ = standard deviation of the metric
Δ = minimum detectable effect (the smallest change you care about)
Practical example. Conversion baseline 5% (σ ≈ 0.218). You want to detect a 5% relative lift = 0.25 percentage points absolute. Δ = 0.0025. N = 16 × 0.0475 / 0.00000625 ≈ 121,600 users per arm.
At 10k daily active users in your test, that's 12 days for 50/50 split. Most teams under-power experiments massively, get false negatives, conclude "no effect" when there genuinely was one but the experiment was too small to detect.
CUPED (controlled-experiment using pre-experiment data) reduces variance by 30-70% by adjusting outcomes for each user's pre-experiment behavior. Saves weeks per test. Standard at FAANG.
The peeking problem
Looking at your test repeatedly and stopping when "significant" inflates the false-positive rate from 5% to ~25% with 5 peeks. This is the #1 mistake. Either commit to fixed-horizon and don't look, or use sequential tests with proper alpha-spending. NEVER eyeball-peek.
05
Architecture patterns
Production A/B platform components:
Assignment service — sub-millisecond bucket lookup. Often runs at the edge.
Exposure logger — high-throughput event sink (Kafka), often deduped per (user, experiment).
Outcome pipeline — event stream joined to exposures via point-in-time logic.
Analysis engine — Spark / Snowflake jobs computing daily metrics, p-values, confidence intervals per experiment.
Experiment registry — config service holding all running experiments + their definitions.
UI dashboard — for PMs to see results, segment by user dimension, decide.
06
Real-world platforms
Statsig
Modern SaaS A/B + flags
Combined feature flag + A/B platform. Used by OpenAI, Notion, Brex. Supports CUPED, sequential testing.
Optimizely
Original commercial
Pioneered the space. Now enterprise-focused. Handles personalization + experimentation in one product.
Eppo / Split
Modern alternatives
Both focus on stats rigor (CUPED, sequential). Pricier than feature-flag-only tools but rigor justifies the cost.
Internal at FAANG
All build their own
Netflix XP, Meta Deltoid, LinkedIn T-Rex, Airbnb ERF. Each team customizes for their scale + culture.
07
Used in problems
News feed runs hundreds of A/B tests on ranker variants. E-commerce tests checkout flows, pricing displays, recs widgets. Recommendation algorithm uses A/B to evaluate model changes. Notification system tests message variants for engagement.