A/B Testing Platform

01

Why this matters

You ship a new homepage. Did it improve conversion or hurt it? Eyeballing the dashboard for a week proves nothing — traffic patterns shift hour to hour, day to day. A/B testing is the discipline of statistically comparing variant A vs variant B with enough rigor to actually know.

Different from feature flags (which are about safe rollout). Different from canary deploys (which are about not breaking things). A/B testing is about measuring causal impact on user behavior. Every product team at Netflix, Booking, Airbnb, Meta runs hundreds simultaneously.

02

The architecture

Experiment definition. Variant A (control), variant B (treatment), traffic split (50/50 typical), success metric, expected effect size, max duration.
Bucketing. Each user deterministically hashed to A or B by user_id mod 100 (or a hash of user_id + experiment_id for hash-stable assignment across experiments).
Exposure logging. Every time a user enters the experiment, log (user_id, experiment_id, variant). This is the source of truth for "who saw what."
Outcome logging. Every event the experiment cares about — click, purchase, sign-up — flows into a metrics store.
Analysis. Statistical test (t-test, chi-square, CUPED for variance reduction) on outcome differences. Did B beat A with p < 0.05?
Decision. Promote B, kill B, run longer, or iterate.

03

Three statistical regimes

Approach	How it decides	When to use
Fixed-horizon t-test	Run for N days, then compare means with p-value < 0.05	Standard. Most experiments. Plan duration ahead.
Sequential testing	Continuously analyze; stop early when significant; control α across peeks	When time is precious; lets you stop bad experiments early
Multi-armed bandit	Allocate more traffic to better-performing variant continuously	When opportunity cost is high (ad bidding) — but loses statistical rigor

Most product teams should default to sequential testing. It's nearly as rigorous as fixed-horizon and stops bad experiments faster. Bandits are great for revenue-direct decisions; not great when you need to learn why something worked.

04

Deep dive — sample size math

The interview question every infra-aware PM asks: "How long do we need to run this?" Answer requires the formula:

N per arm ≈ 16 × σ² / Δ²

where:
  σ = standard deviation of the metric
  Δ = minimum detectable effect (the smallest change you care about)

Practical example. Conversion baseline 5% (σ ≈ 0.218). You want to detect a 5% relative lift = 0.25 percentage points absolute. Δ = 0.0025. N = 16 × 0.0475 / 0.00000625 ≈ 121,600 users per arm.

At 10k daily active users in your test, that's 12 days for 50/50 split. Most teams under-power experiments massively, get false negatives, conclude "no effect" when there genuinely was one but the experiment was too small to detect.

CUPED (controlled-experiment using pre-experiment data) reduces variance by 30-70% by adjusting outcomes for each user's pre-experiment behavior. Saves weeks per test. Standard at FAANG.

The peeking problem

Looking at your test repeatedly and stopping when "significant" inflates the false-positive rate from 5% to ~25% with 5 peeks. This is the #1 mistake. Either commit to fixed-horizon and don't look, or use sequential tests with proper alpha-spending. NEVER eyeball-peek.

05

Architecture patterns

Production A/B platform components:

Assignment service — sub-millisecond bucket lookup. Often runs at the edge.
Exposure logger — high-throughput event sink (Kafka), often deduped per (user, experiment).
Outcome pipeline — event stream joined to exposures via point-in-time logic.
Analysis engine — Spark / Snowflake jobs computing daily metrics, p-values, confidence intervals per experiment.
Experiment registry — config service holding all running experiments + their definitions.
UI dashboard — for PMs to see results, segment by user dimension, decide.

06

Real-world platforms

Statsig

Modern SaaS A/B + flags

Combined feature flag + A/B platform. Used by OpenAI, Notion, Brex. Supports CUPED, sequential testing.

Optimizely

Original commercial

Pioneered the space. Now enterprise-focused. Handles personalization + experimentation in one product.

Eppo / Split

Modern alternatives

Both focus on stats rigor (CUPED, sequential). Pricier than feature-flag-only tools but rigor justifies the cost.

Internal at FAANG

All build their own

Netflix XP, Meta Deltoid, LinkedIn T-Rex, Airbnb ERF. Each team customizes for their scale + culture.

07

Used in problems

News feed runs hundreds of A/B tests on ranker variants. E-commerce tests checkout flows, pricing displays, recs widgets. Recommendation algorithm uses A/B to evaluate model changes. Notification system tests message variants for engagement.

📺

References & Videos

A/B Testing Architecture

ByteByteGo · 8 min

A/B Testing in Production

TechWorld with Nana · 15 min

Netflix Experimentation Platform

Netflix Tech Blog

A/B Testing System Design

GeeksforGeeks