System Design — 06

Payment Gateway

Design a system like Stripe or Amazon Pay that orchestrates multi-party financial transactions across unreliable networks — guaranteeing money is never lost, duplicated, or stuck in limbo.

IdempotencyExactly-OnceSaga PatternPCI-DSSDouble-Entry Ledger
01

Problem Statement

Design a payment gateway that sits between a customer clicking "Pay Now" and money actually moving between bank accounts. This isn't just "process a credit card" — it's a multi-party financial relay across 6 participants (customer, merchant, gateway, processor, card network, issuing bank) where every handoff is a network call that can fail, timeout, or return ambiguous results.

The system must process millions of financial transactions per day while guaranteeing exactly-once semantics at the business level. Most systems tolerate inconsistency — a tweet appearing late is fine, a leaderboard being slightly stale is acceptable. Payments are different: this is the one system where every failure mode has a direct financial consequence.

Core question: How do you guarantee that money is never lost, duplicated, or stuck in limbo — even when networks fail, services crash, and third-party providers go down?

What Makes This Unique

Idempotency Is Existential

A network timeout after calling Visa means you don't know if the charge went through. Without bulletproof idempotency, retries cause double charges — not a bug, a lawsuit.

State Machine Is the Product

A payment isn't a single action — it's a lifecycle: INITIATED → AUTHORIZED → CAPTURED → SETTLED (or FAILED, REFUNDED, DISPUTED). Every transition must be auditable and reversible.

Distributed Saga Across Organizations

You can't wrap Visa, your DB, and Kafka in a single transaction. You need sagas, compensation, and reconciliation across systems you don't control.

Compliance Shapes Architecture

PCI-DSS mandates network segmentation for card data. The tokenization vault runs in an isolated environment. Legal constraints drive technical decisions.

02

Requirements

Functional Requirements

  • Authorize + Capture (two-phase): Authorize puts a hold on funds; capture takes the money when the item ships. Separate operations because Amazon doesn't know if an item is in stock until the warehouse confirms.
  • Refunds (full and partial): A customer returns 2 of 3 items — refund $30 of $50 back to the original payment method. Track refund state independently (pending → processing → succeeded).
  • Idempotency on every mutation: Every API call that changes state accepts a client-provided idempotency key. Same key twice returns the cached result — no duplicate processing.
  • Retry and recover from ambiguous failures: When a processor call times out, query-back before retrying. Never double-charge, never lose a successful auth.
  • Webhook notifications: Push events (authorized, captured, refunded, disputed) to merchants with retry logic and HMAC signature verification.
  • Tokenization: Never store raw card numbers. Client-side SDK tokenizes card data in an isolated Card Data Environment (CDE).
  • Multi-currency support: Customer pays in AED, merchant settles in USD. Exchange rate applied at capture time, not auth time.
  • Fraud detection (pre-authorization): Synchronous fraud checks — velocity, AVS, CVV, device fingerprinting — before sending to the processor.

Non-Functional Requirements

  • Availability: 99.999% (five nines) — payment downtime = lost revenue. One minute during Prime Day could mean millions in abandoned carts.
  • Consistency: CP over AP for payment state — a ghost charge (processor charged but no record in our system) is worse than temporary unavailability.
  • Latency: P99 < 2 seconds end-to-end. The processor call alone takes 300ms–1.2s, leaving ~700ms budget for internal processing.
  • Throughput: 50,000 TPS design target (2.5x Prime Day peak). Payment systems can't gracefully degrade by dropping requests.
  • Auditability: Every state change produces an immutable audit log — who, what, when, previous state, new state. Legally required. Append-only, no deletes.
  • PCI-DSS: Network segmentation, HSM-backed encryption, quarterly vulnerability scans. The CDE is a walled garden.
  • Reconciliation: End-of-day, our records must match exactly what the banks say. Double-entry bookkeeping — every cent accounted for.

Key insight: Payment state requires strong consistency (CP). Reporting, analytics, and merchant dashboards can tolerate eventual consistency (AP). Apply the right model to the right data.

03

Scale Estimation

All numbers derived from Amazon's known order volume: ~1.5B orders/quarter → 17M orders/day. Each order triggers ~5 payment API calls (auth, capture, status checks, potential retry/refund), giving ~85M payment events/day.

~1,000
Steady-State TPS
20,000
Prime Day Peak TPS
50,000
Design Target TPS
~40 TB/yr
Transaction Storage
~46 TB/yr
Audit Log Storage
~20 GB
Idempotency Hot Store
< 700ms
Internal Latency Budget
50–100
API Server Instances

Numbers That Drive Architecture

50K TPS → Stateless, horizontally scaled API tier behind load balancers. 20 GB idempotency store → Redis cluster with 48-hour TTL — fits in memory, needs sub-ms latency. 40 TB/year, never deletable → Tiered storage: hot (30 days, primary DB), warm (1 year, read replicas), cold (1–7 years, S3 + Athena). 1,200ms P99 from processors → Async where possible — auth is sync (customer waiting), capture and settlement can be async.

04

API Design

Every endpoint is designed around one principle: the caller must never be uncertain about what happened to their money. Idempotency keys on every mutation, explicit state transitions, and clear error semantics.

Authorize Payment
POST /v1/payments/authorize
Authorization: Bearer sk_live_merchant_abc123
Idempotency-Key: ord_2024_0609_xyz789

{
  "amount": 4999,              // cents — never floating point
  "currency": "USD",
  "payment_method": {
    "type": "card",
    "token": "tok_visa_4242"   // tokenized, never raw PAN
  },
  "capture_mode": "manual",    // "manual" (Amazon) or "automatic" (coffee shop)
  "merchant_reference": "order_20250609_001"
}

→ 201 Created
{
  "id": "pay_9f8e7d6c5b4a",
  "status": "authorized",
  "amount_authorized": 4999,
  "amount_captured": 0,
  "amount_refunded": 0,
  "authorization_expires_at": "2025-06-16T14:30:00Z",
  "fraud_check": { "risk_score": 12, "risk_level": "low" },
  "processor_response": { "code": "approved", "auth_code": "A1B2C3" }
}
Capture Payment (at shipment)
POST /v1/payments/pay_9f8e7d6c5b4a/capture
Idempotency-Key: cap_shipment_001

{ "amount": 2999 }             // partial capture — only 1 of 2 items shipped

→ 200 OK
{
  "status": "partially_captured",
  "amount_authorized": 4999,
  "amount_captured": 2999,
  "amount_remaining": 2000
}
Refund
POST /v1/payments/pay_9f8e7d6c5b4a/refund
Idempotency-Key: refund_rma_456

{
  "amount": 2999,
  "reason": "customer_returned"  // required for dispute defense
}

→ 200 OK
{
  "status": "partially_refunded",
  "refunds": [{
    "id": "ref_def456",
    "status": "pending",         // refunds take 5–10 business days
    "estimated_arrival": "2025-06-18"
  }]
}
Error Response
{
  "error": {
    "type": "card_declined",
    "code": "insufficient_funds",
    "decline_code": "do_not_retry",    // vs "retry_allowed"
    "message": "The card has insufficient funds."
  }
}

Key design decisions: Amounts in cents (integers, never floats) to avoid rounding errors. Three separate amount fields (authorized, captured, refunded) for partial operations. decline_code tells merchants whether retrying is futile or might work.

05

High-Level Architecture

Three distinct paths with different latency and consistency requirements: Sync Hot Path (authorization, customer waiting, <2s), Async Warm Path (capture/refund, merchant triggered, seconds of delay OK), and Batch Cold Path (settlement/reconciliation, daily, correctness over speed).

Client Mobile / Web API Gateway Rate Limit · Auth Idempotency Redis Cluster Orchestrator Saga Coordinator Fraud Service < 80ms sync Token Vault CDE · HSM Processor Router Smart Routing Visa / MC Adyen Payment DB CockroachDB · ACID Outbox Poller At-Least-Once Kafka Event Bus Webhooks HMAC Signed Ledger Double-Entry Settlement Daily Reconciliation HTTPS Dedup Validate Route Detokenize Risk Score ACID Write + Outbox Publish

Component Summary

ComponentTechWhy
API GatewayNginx / EnvoyEdge policy, rate limiting, auth, TLS termination
Idempotency StoreRedis Cluster (6 nodes)Sub-ms dedup with SET NX, 48hr TTL, ~20 GB
OrchestratorGo/Java, 50–100 instancesStateless saga coordinator — payment lifecycle
Fraud ServicePython/Go + ML modelSync risk scoring in <80ms, tiered checks
Token Vault (CDE)HSM + encrypted DBPCI-DSS isolated card storage, field-level encryption
Processor RouterGo (low latency)Multi-processor abstraction, cost/availability routing
Payment DBCockroachDBSource of truth, ACID transactions, strong consistency
Outbox PollerGo/Java, 3–5 instancesReliable event publishing — eliminates dual-write
Event BusKafka (9+ brokers)Durable async event backbone, partition ordering
Webhook ServiceGo/Node, 10–15 instancesSigned merchant notifications with exponential backoff
LedgerPostgreSQLDouble-entry bookkeeping — debits must equal credits
Settlement EngineSpark (batch)Daily reconciliation against processor settlement files
Request Flow — Step Through
ClientAPI GatewayRedis (Dedup)OrchestratorFraud ServiceToken VaultProcessor RouterPayment DB + OutboxKafka → Webhooks
Click Next Step to walk through the request flow.
06

Deep Dive — Idempotency & Exactly-Once Processing

Exactly-once payment processing isn't one mechanism — it's three layers working together. Miss any one and the guarantee breaks. At 50K TPS with a 0.1% timeout rate, you get 432,000 ambiguous transactions per day where you don't know if the charge went through.

Layer 1 — Client-Side Idempotency Key

The merchant generates a unique key per intended payment action and sends it with every request (including retries). The key represents intent, not the request — e.g., order_123_authorize. The merchant generates it (not us) because only they know whether a request is a retry or a new payment.

Layer 2 — Gateway-Side Deduplication (Redis Protocol)

sequenceDiagram participant C as Client participant R as Redis participant O as Orchestrator participant P as Processor participant DB as Payment DB C->>R: SET NX idempotency:K alt Key is NEW R-->>O: NX succeeded O->>O: Validate request O->>P: Charge $49.99 P-->>O: Approved (auth_code) O->>DB: BEGIN: UPDATE payment + INSERT outbox COMMIT DB-->>O: Committed O->>R: SET K = {status: completed, response: ...} O-->>C: 201 Authorized else Key EXISTS (completed) R-->>O: Cached response found O->>O: Verify params hash matches O-->>C: 200 Return cached result else Key EXISTS (processing) R-->>O: Status = processing alt Recent (< 30s) O-->>C: 202 Accepted (Retry-After: 2s) else Stale (> 30s) O->>DB: Check for existing payment record alt Found in DB O->>R: Update to completed O-->>C: 200 Return DB result else Not in DB O->>P: Query-back: did you process this? O->>O: Re-process or record result end end end

The Crash Point Analysis

The order of operations matters. If we crash at each step, here's what happens:

Crash PointStateRecovery
After validationRedis: processing, DB: nothingRetry re-processes safely
After fraud checkRedis: processing, DB: nothingRetry re-runs fraud (stateless)
After token lookupRedis: processing, DB: nothingRetry re-fetches (read-only)
After processor callRedis: processing, DB: nothing, Visa may have chargedDANGER ZONE → Layer 3
After DB writeRedis: processing, DB: has recordRetry finds DB record, updates Redis
After Redis updateRedis: completed, DB: has recordNormal path — no issue

Layer 3 — Processor-Level Deduplication & Recovery

When we detect the dangerous crash point (after processor call, before DB write), we use three strategies in priority order:

Strategy 1: Processor Idempotency Keys

Most modern processors support their own idempotency. We forward our key — if we retry, the processor returns the original result. No double charge.

Strategy 2: Query-Back

Ask the processor: "Did you process charge X?" before retrying. Adds ~300ms but prevents double-charging with legacy processors.

Strategy 3: Pessimistic Reconciliation

If query-back also fails: don't retry. Mark as "ambiguous." Daily reconciliation resolves it. We'd rather fail a payment than double-charge.

The Four Invariants

1. For any key K, at most one processor charge is made — even across crashes and retries.
2. If a client receives success, the payment is durably recorded in the DB.
3. If a client receives failure, either no charge was made, or it will be auto-reversed via reconciliation.
4. No idempotency key is permanently locked — every "processing" state either completes, times out, or expires.

07

Key Design Decisions & Tradeoffs

SQL vs NoSQL for Payment State

✓ Chosen

SQL (PostgreSQL / CockroachDB)

ACID transactions for state + outbox in one commit. FOR UPDATE row locks prevent partial capture race conditions. Cross-row constraints (captured ≤ authorized) are trivial.

✗ Alternative

DynamoDB / Cassandra

Near-infinite horizontal scale without manual sharding, but no multi-table ACID transactions. You'd reimplement consistency in application code — risky for money.

Consistency Model for Payment State

✓ Chosen

Strong Consistency (CP)

Writes require quorum. A ghost charge (processor charged but our DB has no record) is catastrophic. We'd rather return 503 than lose track of money.

✗ Alternative

Eventual Consistency (AP)

Higher availability, but two replicas could disagree on payment status. A payment simultaneously "captured" and "failed" is unacceptable.

Event Delivery: Transactional Outbox vs CDC vs Dual Write

✓ Chosen

Transactional Outbox

Event INSERT in same ACID transaction as state update. Poller publishes to Kafka. You control event schema, it's decoupled from DB schema, and no dual-write risk.

✗ Alternative

CDC (Debezium) / Dual Write

CDC couples event schema to DB schema. Dual write has no atomicity — DB write succeeds but Kafka publish fails means downstream never learns about the payment.

Single vs Multi-Processor Routing

✓ Chosen

Multi-Processor with Smart Routing

Cost optimization saves ~$310M/year at Amazon scale. Failover if one processor has an outage. Authorization rate optimization recovers 2–3% of declines via cascading.

✗ Alternative

Single Processor (e.g., Stripe only)

One integration, one contract, simple. But your SLA is capped by their reliability, and you're locked into their pricing at scale.

Two-Phase Auth+Capture vs Single Charge

✓ Chosen

Separate Auth + Capture

Amazon charges at shipment, not checkout. Supports partial captures for split shipments. Voids are free and instant (vs refunds costing fees and taking 5–10 days).

✗ Alternative

Single Charge Endpoint

Simpler API — one call does everything. Fine for instant-delivery businesses (coffee shop). But can't handle e-commerce fulfillment flows.

Fraud Check: Synchronous vs Async

✓ Chosen

Tiered Sync (Fast-Path Optimization)

80% of transactions skip expensive Tier 3 checks (low-risk). Average fraud latency: ~44ms instead of 80ms for all. Post-auth async review catches stragglers.

✗ Alternative

Full Sync for All / Fully Async

Full sync adds 80ms to every transaction. Fully async means you've authorized fraud and must reverse — terrible UX and costly chargebacks.

08

What Can Go Wrong

🔴 Processor Timeout (Most Common)

Charge sent to Visa, response never arrives. Customer may or may not be charged. Never retry blindly. Query-back the processor first. If query-back also fails, mark as "ambiguous" and resolve in daily reconciliation. We'd rather fail a payment than double-charge — a failed payment loses one sale; a double charge loses trust.

🔴 DB Primary Failure During Payment Write

Processor approved, DB crashes before COMMIT. With synchronous replication: if COMMIT was replicated, the new primary has the record. If not, the retry uses the same idempotency key — processor returns original result, we write it successfully on the second attempt. Without sync replication: ghost charges — customer charged but no record in our system. Catastrophic.

🟡 Redis Cluster Failure (Idempotency Layer Down)

Fall back to DB-based dedup: SELECT FROM payments WHERE idempotency_key = K. Slower (~15ms vs 1ms) but functional. Small window where two simultaneous requests could both pass the DB check — processor idempotency (Layer 3) is the safety net. Flag all payments during Redis outage for manual review.

🟡 Cascade Failure Under Load (Death Spiral)

Fraud service gets slow → orchestrator retries → retries double load → fraud gets slower → everything collapses. Prevention: Circuit breakers (trip at 50% failure rate, stop calling fraud entirely), retry budgets (cap at 10% of traffic), bulkhead isolation (separate thread pools for auth/capture/refund), load shedding (reject status checks before captures before auths).

🟡 Outbox Poller Lag

Payments process correctly but events don't reach Kafka. Webhooks stop, settlement queue isn't fed. Recovery: Poller is stateless — restart picks up pending events. Scale horizontally with partitioned polling. Monitor outbox depth: alert at 1,000 pending, page at 10,000.

🟢 Reconciliation Mismatch

Our ledger says $12,847,293; processor says $12,847,518. Diff: $225. Common causes: timing boundary (different settlement days), currency conversion drift, lost outbox events. Resolution: Automated matching catches 99.5%, automated rules resolve 80% of exceptions, remaining 0.1% goes to manual review with 48hr SLA. Double-entry ledger self-diagnoses whether mismatch is internal or external.

🟢 Fraud Service Degradation

Tiered fallback: Level 1 (degraded, P99 > 80ms) — full check for high-risk only, fast-path low-risk with async review. Level 2 (critical, error rate > 30%) — blocklist check only, $500 hard limit, async review within 2 minutes. Level 3 (complete failure) — blocklist only (local cache), hard decline > $200, page VP of Engineering.

Anti-patterns

🚫
Single DB transaction wrapping bank call

External call inside a lock is minutes of blocked rows.

✓ Better: Auth/capture are separate external calls; saga with idempotency key per step.
🚫
Retry failed payments blindly

Double-charge = angry customers + chargebacks.

✓ Better: Idempotency key per attempt; dedup on (key, user). Only retry on safe errors (5xx, timeouts).
🚫
Store card numbers directly

PCI nightmare + breach exposure.

✓ Better: Tokenize via provider (Stripe, Braintree); store only the token + last-4 + exp.
09

Interview Tips

💡
Lead with the auth/capture split.
Ask "Should we support separate auth and capture?" as your first clarifying question. This immediately signals domain knowledge — most candidates treat payment as a single "charge" operation. Explain why: Amazon charges at shipment, not checkout.
Make idempotency your deep dive.
Say: "The most interesting challenge is exactly-once processing. Let me walk through the three layers." Then present the crash scenario — what happens when Visa's response is lost. This demonstrates edge-case thinking and distributed systems mastery.
🎯
Draw incrementally, not all at once.
Start with 4 boxes (client → gateway → orchestrator → processor). Add each component only when a requirement demands it. "I need idempotency, so I'll add Redis here." This shows you reason from requirements, not pattern-match.
🧠
Say "CP over AP" and explain why.
"I'm choosing strong consistency for payment state because a ghost charge — where the processor charged but our system has no record — is worse than temporary unavailability. I'd rather return 503 than lose track of money."
🔥
Mention reconciliation — most candidates don't.
"Every day, we verify our ledger matches the processor's settlement file. We use double-entry bookkeeping — if debits ≠ credits, something is wrong." This shows you understand the financial dimension, not just the technical one.
⚠️
Don't forget PCI-DSS and tokenization.
"Card data never touches our main infrastructure. A client-side SDK sends card details directly to an isolated tokenization vault. Our API servers only see opaque tokens." Without this, your design fails a compliance audit on day one.
📐
Amounts in cents, never floats.
"I store amounts as integers in cents because 0.1 + 0.2 ≠ 0.3 in floating point. At 500M transactions, rounding errors compound into real reconciliation mismatches." Small detail that signals real-world experience.
11

Evolution

How this design grows from MVP to planet-scale. Every stage is triggered by a specific pain point — not by anticipating scale you don't have yet. The only things you must get right from day one: the API contract (auth/capture split) and the idempotency guarantee.

1

MVP — Single Merchant, Single Processor (10 TPS)

Single API server → PostgreSQL → Stripe. Idempotency via DB unique constraint. No Redis, no Kafka, no outbox. Use Stripe's tokenization (Elements) so you never see card data. Critical: Build auth/capture split and idempotency contract from day one — these API decisions are near-impossible to change later.

2

Growing — Multiple Merchants (500 TPS)

Add Redis for idempotency (DB unique constraint doesn't scale past ~200 TPS). Add webhook delivery + transactional outbox (merchants need push notifications). Build fraud rules engine after first chargeback incident. Separate token vault to reduce Stripe per-transaction fees. System splits into sync path and async path — this separation is permanent.

3

Scaling — Multi-Processor (5K TPS)

Add multi-processor routing (cost optimization + failover after first processor outage). Build double-entry ledger and automated daily reconciliation. Shard DB by merchant_id. ML-based fraud scoring replaces rules engine. Add circuit breakers and retry budgets after first cascade failure.

4

Reliable — Multi-Region (20K TPS)

Active-active across 3 regions with CockroachDB for global strong consistency. PCI-DSS Level 1 certification (required above 6M txn/year). 3D Secure / SCA for European PSD2 compliance. Dedicated dispute management system. Regional fraud models — fraud patterns differ by geography.

5

Planet-Scale — Financial Platform (50K+ TPS)

Payment platform as a product — other companies build on your gateway. Direct card network integrations bypassing third-party processors. Real-time streaming settlement replacing daily batch. Embedded finance: lending, BNPL, insurance built on payment data. AI-powered fraud that adapts in real-time.

Next up