Real-time Fraud Detection

A system that scores every incoming payment / account action in under 100 ms and decides: approve, review, or block. The hard parts: a low-latency feature store that can answer "how many card swipes has this user made in the last 5 minutes?" in < 10 ms; an online ML model co-located with features returning a risk score in single-digit milliseconds; and a feedback loop that closes the gap between a human-confirmed chargeback (days later) and retraining the model, without letting the fraudsters get a week head-start. Stripe Radar, PayPal, Visa all run systems like this at 10K+ tx/sec.

⚡ Core: Online Features + ML Scoring + Feedback< 100 ms decision~10K tx/secLow false-positiveExplainable (some)

Requirements

Functional

Score each payment / login / account-action in real-time; return risk score
Decision = {approve, review (2FA / manual), block} based on score thresholds
ML model plus rule engine — explicit rules for hard blocks (known-bad email domain, sanctioned country)
Per-merchant / per-product thresholds (risk appetite varies)
Ingest chargeback / dispute labels for feedback; retrain model weekly
Explainability: surface top contributing features for reviewers
Bulk queries for analyst investigations; post-hoc scoring of historical txns

Non-Functional

p99 scoring latency < 100 ms end-to-end; p50 < 30 ms
99.99% availability — payments block if this fails
Scale to 10K tx/sec sustained; 30K peak (holiday spikes)
Feature freshness: user's recent actions visible < 1 s
Low false-positive rate — every false block is a pissed-off customer
Graceful degrade: on outage, fall back to conservative rule-based decisions

Scale Estimation

Transactions / sec (peak)

~30K

Stripe ~10K avg; Visa authorizes up to ~65K/sec globally at peaks

Feature count / score

~200–500

user + card + merchant + session + velocity + graph features

Feature store QPS

~150K

30K tx × avg 5 feature lookups (batched); must respond < 10 ms

Model inference latency

< 5 ms

GBDT typical; neural nets cached and batched if used

Chargeback rate

~0.1–1%

industry baseline; our goal is to reduce vs baseline without hurting approval rate

Label latency

1–90 days

chargebacks resolve slowly; training data lag is a fundamental constraint

API Design

POST/v1/score

Score a transaction. Body: {tx_id, user_id, amount, merchant_id, card_token, ip, device_id, ua, ...}. Returns {risk_score: 0.0–1.0, decision: approve|review|block, reason_codes: [...], model_version}. Latency budget 100 ms.

POST/v1/feedback/label

Attach outcome label to past transaction: {tx_id, label: chargeback|disputed|legitimate}. Labels flow into training data. Used by chargeback pipeline + manual review.

GET/v1/transactions/{tx_id}/explain

Return SHAP-style feature contributions for a past scoring decision. Used by manual reviewers + compliance.

POST/v1/rules

Add/edit explicit rule. Body: {condition: "user.country == 'XY' AND amount > 500", action: block}. Rules evaluated alongside ML; hard blocks short-circuit.

POST/v1/score/batch

Offline batch scoring for analyst investigations or model-shadow evaluation. Runs against current or specified model version.

Architecture

Four services in the hot path: scoring service (orchestrator), feature store (online + offline), rule engine (fast short-circuit), model service (inference). A parallel streaming pipeline keeps online features fresh from the event bus. An offline training pipeline consumes labeled outcomes + features to retrain models weekly.

Hot-path + offline training SVG

Request Flow — Step Through

Payment · auth request→Scoring svc · orchestrator→Rule engine · hard-block check→Feature fetch · parallel KV reads→Model · GBDT inference→Decision · threshold compare→Log + train · async to Kafka

Click Next Step to walk through the request flow.

Deep Dive — Hot Path + Feature Freshness + Feedback

Hot-path in < 100 ms. The scoring service orchestrates:

Rule engine (1–5 ms). Short-circuit for obvious blocks — known-bad IP, sanctioned country, velocity thresholds exceeded. Rules are fast and deterministic; ML isn't needed to say "decline a card transaction from 10 countries in 10 minutes."
Feature fetch (10–30 ms). Parallel reads: online features (Redis), profile store (card history, user account age), graph store (device ↔ card ↔ IP clusters). All KVs keyed by entity_id.
Feature engineering (2–5 ms). Combine raw features into model inputs — ratios, deltas, categorical encoding. Done in-process in the scoring service.
Model inference (2–10 ms). GBDT (XGBoost/LightGBM) models dominate production fraud because they're fast, explainable, and perform well on tabular features. Neural nets occasionally used for specific signals (transaction-text embeddings, sequence models) but orchestrated alongside, not replacing.
Decision (1 ms). Compare score to thresholds (per-merchant). Return {score, decision, reason_codes}.
Log everything async (not on critical path): tx + features + score + decision to Kafka → offline store for training.

Feature freshness — the velocity problem. Fraudsters exploit time gaps. If a stolen card has been used 5 times in the last minute and we only see "lifetime card velocity" computed nightly, we miss it. Online features need sub-second updates.

Pattern: stream processor reads every transaction event; maintains sliding-window counts per entity (user, card, IP, device) at multiple granularities (1 min, 5 min, 1 hr). Written to Redis keyed by {entity_id}:{window}:{stat}. The scoring service reads these in parallel with the user/merchant profile.

Exact counts at entity granularity are fine (a user has few recent transactions). But aggregate counts across entities (e.g., "how many distinct cards used this IP in the last hour") benefit from Count-Min Sketches for bounded memory.

Scoring Sequence — where the 100 ms budget goes Mermaid

sequenceDiagram participant C as Caller (payments) participant S as Scoring svc participant R as Rule engine participant F as Feature store participant M as Model svc C->>S: POST /v1/score {tx} S->>R: evaluate hard rules (~3 ms) alt hard block R-->>S: BLOCK + reason S-->>C: {decision: block} else no hard block par parallel fetch S->>F: online features (~15 ms) S->>F: profile + graph (~10 ms) end S->>S: engineer inputs (~3 ms) S->>M: infer (~5 ms) M-->>S: score 0.0–1.0 S->>S: decision by threshold S-->>C: {score, decision, explain} end S->>S: async emit to Kafka (off critical path)

Graph features — finding rings. Single transactions look innocent; the ring doesn't. Graph features capture network-level signals: "how many other accounts share this device fingerprint?" "Has this IP been used by accounts that later charged back?" Stored in a graph DB or a materialized graph (adjacency lists) in Bigtable. Updated by a graph updater consuming the event stream.

Feedback loop. Three label sources:

Chargebacks (1–90 days later) — highest signal, lowest latency. Comes from payment network reports.
User reports — "I didn't make this transaction." Fast (hours), but noisy (users sometimes forget).
Analyst review — internal team labels borderline cases flagged by the model.

All labels flow into a join table (tx_id, features, label). A weekly training job pulls the last N days of labeled data, retrains GBDT, shadow-tests on recent traffic, then canary-deploys to a small % of tx. If metrics hold, promote to 100%.

The 1–90 day label delay is fundamental. You can't retrain on today's data today. Partial mitigation: heuristic labels (e.g., "3 chargebacks on this card in 30 days = probably fraud") extrapolated forward. Unsupervised anomaly detection supplements supervised on the most-recent window.

Interview answer

"Scoring service orchestrates: rule engine short-circuits hard blocks, parallel feature fetch (online features + profile + graph) < 30 ms, feature engineering, GBDT model inference < 10 ms, threshold decision. Features kept fresh by stream processor (Flink) updating Redis-backed windowed stats on every event — sub-second freshness. Feedback from chargebacks + user reports + analyst review feeds weekly retrain via shadow/canary/promote. Graph features capture ring structure. Explainability via SHAP for reviewers. Degrade gracefully to rule-only on model outage."

Tradeoffs & Design Choices

GBDT vs deep learning. GBDT wins for tabular features: faster training, more explainable (per-feature contributions), runs in a few ms. Deep learning wins for unstructured inputs (text, sequence, images). Modern fraud systems use GBDT in the hot path; deep models run async for enrichment.
Precision vs recall. Block more = catch more fraud + annoy more real customers. Every org picks a point on this curve based on business pain. Reviews (2FA) are the middle ground — expensive but not blocking.
Rules + ML together. Pure ML is brittle for known-bad patterns (new fraud pattern, model hasn't seen it yet). Rules are maintainable by analysts. Hybrid is the default; don't propose "ML-only."
Shared feature store vs per-model features. Shared store = consistent offline/online features across models (training-serving skew minimized). Per-model features = team autonomy, but drift risk. Large orgs go shared; small orgs can tolerate per-model.
Synchronous block vs post-auth review. Synchronous needed for payment auth. Account security (logins, password resets) can tolerate asynchronous (score, flag for challenge on next action) — relaxes latency.

Failure Modes

💥

Model service outage blocks all payments

Scoring svc calls model, model down, scoring times out; payment auth blocks; merchants lose revenue.

→ Mitigation: graceful degrade — on model timeout, fall back to rule-engine-only decision with conservative threshold. Feature-based heuristic score from scoring svc directly. Users still get charged vs full outage.

📉

Concept drift — model quality degrades over time

Fraud patterns shift (new scam style); model was trained on old data; false-negative rate climbs.

→ Mitigation: weekly retrains; drift monitoring (compare score distribution + feature distribution to training); alert when drift exceeds threshold; fast emergency retrain path.

🥊

Adversarial evasion

Fraudsters probe the system, learn features that flag them, adjust behavior to just below threshold.

→ Mitigation: don't expose score/reason to end-user; rotate features; feature that depends on graph structure is harder to evade individually; human reviewers stay in the loop for novel patterns.

🔒

False positive cascade

A rule change or model change suddenly blocks 5× more users; CS overwhelmed; revenue drops.

→ Mitigation: shadow evaluation before rollout; canary 1% → 10% → 100% with metrics gates; automated rollback on false-positive rate spike; change management process for new rules.

🕳️

Feature store lag > 1 second

Streaming pipeline behind; scoring svc reads stale velocity counts; fresh burst of fraud passes through.

→ Mitigation: lag monitoring on Flink consumer; scoring svc checks feature-age staleness and inflates velocity-related signals when stale; spare capacity in stream processor; alert on > N second lag.

🔐

PII / feature store leak

Fraud features often include sensitive data (IP, device, email domain). Breach is a compliance nightmare.

→ Mitigation: encrypt at rest + in transit; strict RBAC on feature store; audit logs on all reads; hash-on-write for raw identifiers; regular access reviews.

Interview Tips

Rules + ML hybrid. Pure-ML is wrong. Mention rule engine short-circuit as a core design piece.
Online + offline feature store. Same features for training + serving. Training-serving skew is a classic production bug; show you know about it.
Label delay is fundamental. Acknowledge chargebacks take days-to-weeks. Don't propose "train on today's data, serve tomorrow" without caveats.
Graceful degrade. Payment auth can't hang. Fall back to rules + conservative threshold on any dependency failure. Payments > precision.
Graph features matter. Single-tx features miss rings. Shared-device-fingerprint clusters are a cheap and effective signal.
Explainability isn't nice-to-have. Compliance requires it. SHAP or tree-derived feature contributions are the standard.

Evolution

MVP — rule engine + analyst review

Hand-written if-then rules. Analysts manually review flagged transactions. Works at low scale; quickly overwhelmed as volume grows.

Supervised ML on batch features

Logistic regression or GBDT trained on historical chargeback data. Features computed from yesterday's batch. Good baseline but misses velocity attacks.

Online feature store + streaming

Features updated in near-real-time via Flink. Velocity counts, session behavior. Sub-second freshness catches fast fraud bursts.

Graph features + ensemble models

Device / IP / card linkage captures ring structure. Separate models per use case (card-present, card-not-present, account-takeover). Ensemble scoring.

Neural nets + LLM enrichment

Sequence models on transaction history; LLM-based features for text fields (merchant names, transaction descriptions). Self-supervised representation learning from large unlabeled data.

📺

References & Videos

Fraud Detection System Design

Arpit Bhayani · 25 min

How Stripe Detects Fraud

ByteByteGo · 8 min

How Stripe Radar works

Stripe Blog

Real-Time Fraud Detection at Scale

Netflix Tech Blog

Next up

PROBLEM

Payment Gateway

The service this fraud system sits in front of

Read →

PROBLEM

Recommendation Algorithm

Real-time ML serving with feature store

Read →

Real-time Fraud Detection

Requirements

Scale Estimation

API Design

Architecture

Deep Dive — Hot Path + Feature Freshness + Feedback

Tradeoffs & Design Choices

Failure Modes

Interview Tips

Similar Problems

Payment Gateway

Recommendation Algorithm

Twitter Trending

Rate Limiter

Count Active Users

Evolution

MVP — rule engine + analyst review

Supervised ML on batch features

Online feature store + streaming

Graph features + ensemble models

Neural nets + LLM enrichment

References & Videos

Payment Gateway

Recommendation Algorithm