Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.
Out of 10 points60 min whiteboardReference solution →
01
Prompt
1M hosts each running an agent that collects CPU, memory, disk, and custom application metrics every 10-15 seconds. That is ~100M data points/sec flowing into your ingest pipeline. The hard parts: a time-series database that stores 10M unique series with columnar compression (delta-of-delta timestamps, XOR-encoded floats), a query engine that fans out across time-partitioned shards and returns aggregated results in <1 second over weeks of data, and an alert evaluator that runs thousands of threshold and anomaly-detection rules every 15-60 seconds without missing a single breach. Datadog, Prometheus, New Relic, VictoriaMetrics — same architecture, different trade-offs.
Time budget: 60 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.
02
Hints (progressive — click to reveal)
Hint 1
Start with the data model. "A time series is identified by (metric_name + set of tag key-value pairs). Each series has a stream of (timestamp, float_value) tuples." This grounds everything.
Hint 2
Name Gorilla encoding explicitly. "Delta-of-delta for timestamps, XOR for float values — from Facebook's 2015 paper. Gets ~1.37 bytes/point vs 16 bytes raw." Shows depth beyond "we use a TSDB."
Hint 3
Cardinality is the interview differentiator. Most candidates describe ingest and query. Few mention cardinality explosion. Say: "The most dangerous production issue is unbounded tag cardinality — one bad deploy tagging with request_id creates millions of series and OOMs the cluster."
03
Rubric — 10 points
+2 Start with the data model. "A time series is identified by (metric_name + set of tag key-value pairs). Each series has a stream of (timestamp, float_value) tuples." This grounds everything.
+2 Name Gorilla encoding explicitly. "Delta-of-delta for timestamps, XOR for float values — from Facebook's 2015 paper. Gets ~1.37 bytes/point vs 16 bytes raw." Shows depth beyond "we use a TSDB."
+2 Cardinality is the interview differentiator. Most candidates describe ingest and query. Few mention cardinality explosion. Say: "The most dangerous production issue is unbounded tag cardinality — one bad deploy tagging with request_id creates millions of series and OOMs the cluster."
+2 Separate ingest path from query path. Draw them as independent pipelines that share TSDB storage. This shows you understand read/write isolation — writers don't block readers.
+1 Address "who monitors the monitoring?" Dead-man's switch: alert evaluator emits a heartbeat; a separate watchdog in a different failure domain checks it. This is the kind of operational nuance that impresses.
+1 Mention push vs pull trade-off proactively. "Prometheus pulls, Datadog pushes. Push is better for ephemeral containers. Pull is simpler for static fleets. Modern systems support both via OpenTelemetry Collector."
Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.
04
Red flags (things that tank the interview)
Store every metric data point as a row in Postgres
Allow unbounded tag cardinality (user_id, request_id as metric tags)