Exercise · Social & Feed

Ad Click Aggregator

Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.

Out of 10 points45 min whiteboardReference solution →
01

Prompt

10 billion clicks per day, counted in real-time for billing and analytics. Every click is money. The hard parts: a streaming aggregation pipeline that groups clicks by campaign, ad, geo, and minute with exactly-once semantics, inline fraud detection that filters bot clicks, click farms, and SDK spoofing before they pollute billing, and an OLAP query layer that answers "how many clicks did campaign X get in the last 24 hours?" in under a second. Google Ads, Meta Ads, Amazon DSP — same pipeline, different scale.

Time budget: 45 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.

02

Hints (progressive — click to reveal)

Hint 1

Lead with the pipeline. "Kafka ingest, inline fraud ML, Flink streaming aggregation, ClickHouse OLAP, Postgres billing." Name each component and why it's there. This shows you can design end-to-end streaming systems.

Hint 2

Exactly-once is the money question. "Flink checkpoints + Kafka consumer offsets + ClickHouse ReplacingMergeTree for idempotent writes." Double-counting = overcharging = lawsuit. Under-counting = revenue leak. Get this right.

Hint 3

Fraud detection is not optional. In ad tech, ~30% of clicks are fraudulent. If you skip fraud detection, your billing system is fundamentally broken. Name the three types: bot fingerprint, click farm, SDK spoofing.

03

Rubric — 10 points

  • +2 Lead with the pipeline. "Kafka ingest, inline fraud ML, Flink streaming aggregation, ClickHouse OLAP, Postgres billing." Name each component and why it's there. This shows you can design end-to-end streaming systems.
  • +2 Exactly-once is the money question. "Flink checkpoints + Kafka consumer offsets + ClickHouse ReplacingMergeTree for idempotent writes." Double-counting = overcharging = lawsuit. Under-counting = revenue leak. Get this right.
  • +2 Fraud detection is not optional. In ad tech, ~30% of clicks are fraudulent. If you skip fraud detection, your billing system is fundamentally broken. Name the three types: bot fingerprint, click farm, SDK spoofing.
  • +2 Explain the late event strategy. "Watermark-based windowing with 5-min allowed lateness. Beyond 5 min: dead-letter queue + nightly batch reconciliation." This shows you understand real-world event-time challenges.
  • +1 Distinguish storage tiers. "Raw clicks to S3 (audit), aggregates to ClickHouse (analytics), billing rows to Postgres (invoicing)." Each store is chosen for its strength. Using one store for everything fails.
  • +1 Know your numbers. 10B clicks/day = 115K/sec avg. 500 bytes/click = 5 TB/day raw. Pre-aggregated = ~100 MB/day. These numbers justify every architecture choice.

Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.

04

Red flags (things that tank the interview)

  • Count all clicks including bots — "we'll filter later"
  • Aggregate in batch hourly only — "real-time isn't needed"
  • Store raw clicks in Postgres — "it's our standard DB"