Exercise · Infrastructure

Metrics & Monitoring

Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.

Out of 10 points60 min whiteboardReference solution →

Prompt

1M hosts each running an agent that collects CPU, memory, disk, and custom application metrics every 10-15 seconds. That is ~100M data points/sec flowing into your ingest pipeline. The hard parts: a time-series database that stores 10M unique series with columnar compression (delta-of-delta timestamps, XOR-encoded floats), a query engine that fans out across time-partitioned shards and returns aggregated results in <1 second over weeks of data, and an alert evaluator that runs thousands of threshold and anomaly-detection rules every 15-60 seconds without missing a single breach. Datadog, Prometheus, New Relic, VictoriaMetrics — same architecture, different trade-offs.

Time budget: 60 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.

Hints (progressive — click to reveal)

Hint 1

Start with the data model. "A time series is identified by (metric_name + set of tag key-value pairs). Each series has a stream of (timestamp, float_value) tuples." This grounds everything.

Hint 2

Name Gorilla encoding explicitly. "Delta-of-delta for timestamps, XOR for float values — from Facebook's 2015 paper. Gets ~1.37 bytes/point vs 16 bytes raw." Shows depth beyond "we use a TSDB."

Hint 3

Cardinality is the interview differentiator. Most candidates describe ingest and query. Few mention cardinality explosion. Say: "The most dangerous production issue is unbounded tag cardinality — one bad deploy tagging with request_id creates millions of series and OOMs the cluster."

Rubric — 10 points

+2 Start with the data model. "A time series is identified by (metric_name + set of tag key-value pairs). Each series has a stream of (timestamp, float_value) tuples." This grounds everything.
+2 Name Gorilla encoding explicitly. "Delta-of-delta for timestamps, XOR for float values — from Facebook's 2015 paper. Gets ~1.37 bytes/point vs 16 bytes raw." Shows depth beyond "we use a TSDB."
+2 Cardinality is the interview differentiator. Most candidates describe ingest and query. Few mention cardinality explosion. Say: "The most dangerous production issue is unbounded tag cardinality — one bad deploy tagging with request_id creates millions of series and OOMs the cluster."
+2 Separate ingest path from query path. Draw them as independent pipelines that share TSDB storage. This shows you understand read/write isolation — writers don't block readers.
+1 Address "who monitors the monitoring?" Dead-man's switch: alert evaluator emits a heartbeat; a separate watchdog in a different failure domain checks it. This is the kind of operational nuance that impresses.
+1 Mention push vs pull trade-off proactively. "Prometheus pulls, Datadog pushes. Push is better for ephemeral containers. Pull is simpler for static fleets. Modern systems support both via OpenTelemetry Collector."

Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.

Red flags (things that tank the interview)

Store every metric data point as a row in Postgres
Allow unbounded tag cardinality (user_id, request_id as metric tags)
Alert on raw noisy metrics without smoothing