Metrics & Monitoring

1M hosts each running an agent that collects CPU, memory, disk, and custom application metrics every 10-15 seconds. That is ~100M data points/sec flowing into your ingest pipeline. The hard parts: a time-series database that stores 10M unique series with columnar compression (delta-of-delta timestamps, XOR-encoded floats), a query engine that fans out across time-partitioned shards and returns aggregated results in <1 second over weeks of data, and an alert evaluator that runs thousands of threshold and anomaly-detection rules every 15-60 seconds without missing a single breach. Datadog, Prometheus, New Relic, VictoriaMetrics — same architecture, different trade-offs.

Core: Agents + TSDB + Query Engine + Alerting1M hosts10M unique series100M pts/sec ingestSub-second queries
02

Requirements

Functional
  • Agents on every host collect system + custom app metrics (CPU, mem, disk, request latency, queue depth, etc.)
  • Batch ingest via POST /metrics/ingest — each payload: metric_name, tags, value, timestamp
  • Flexible query language: avg(cpu.usage{host=X, region=us-east}) over arbitrary time ranges
  • Dashboard builder: users compose panels of queries, refresh live every 10-30 seconds
  • Alerting: threshold rules (CPU > 90% for 5 min) and anomaly detection (rolling mean + 3 sigma)
  • Notification channels: PagerDuty, Slack, email, webhooks on alert fire/resolve
  • Tag-based filtering: every metric has key-value tags (host, region, env, service) for slicing
  • Downsampling: automatic resolution reduction for older data to bound storage growth
Non-Functional
  • Ingest throughput: 100M data points/sec sustained
  • Query latency: < 1 second for 1-hour range over 10K series
  • Cardinality: handle 10M unique time series without OOM
  • Retention: 7 days full resolution, 30 days at 5-min, 1 year at 1-hour
  • Alert evaluation latency < 60 seconds end-to-end (metric emitted to alert fired)
  • 99.9% ingest availability — dropping metrics = blind spot during incidents
  • Horizontally scalable: add writer/reader nodes to handle more series and queries
  • Multi-tenant isolation: one org's cardinality bomb must not affect other tenants
  • Durable: no data loss on single-node failure; replicated writes across availability zones
03

Scale Estimation

Reporting hosts
~1M
each running an agent pushing metrics every 10-15 sec
Unique time series
~10M
each combo of (metric_name, tag_key=tag_value) is one series
Ingest rate
~100M pts/sec
1M hosts x ~100 metrics/host x 1 sample/10s
Query latency
< 1 sec
fan-out across time shards, merge, return
Storage (7-day raw)
~60 TB
100M pts/sec x 10 bytes x 86400s x 7 / compression ~10x
Alert rules
~50K
evaluated every 15-60 sec; each runs a TSDB query
Compression ratio
~12:1
Gorilla encoding: ~1.37 bytes/point vs 16 bytes raw (8B ts + 8B float)
Concurrent dashboards
~10K
each auto-refreshing every 10-30 sec = ~500 queries/sec sustained
04

API Design

POST/api/v1/metrics/ingest

Batch ingest. Body: [{metric_name, tags: {host, region, ...}, value: float, timestamp: epoch_sec}, ...]. Agents batch 100-500 samples per request. Returns {accepted: N}. Authenticated via API key per org.

GET/api/v1/query?expr=avg(cpu.usage{host=X})&start=T1&end=T2&step=60

Time-series query. Expression uses PromQL-like syntax: aggregation functions, label matchers, rate/increase transforms. Returns {series: [{tags, datapoints: [[ts, val], ...]}]}.

POST/api/v1/alerts

Create alert rule. Body: {name, expr: "avg(cpu.usage{env=prod}) > 90", for: "5m", severity: "critical", notify: ["pagerduty://team-infra", "slack://#alerts"]}. Supports threshold + anomaly detection rules.

GET/api/v1/dashboards/{id}

Fetch dashboard definition. Returns {panels: [{title, queries: [...], viz_type: "line|heatmap|gauge"}]}. Frontend renders and auto-refreshes every 10-30 sec.

GET/api/v1/alerts/{id}/history?start=T1&end=T2

Alert state history. Returns [{timestamp, state: "ok|pending|firing|resolved", value}]. Useful for post-incident review: "when did the alert first fire? how long was it pending?"

GET/api/v1/metadata/metrics?search=cpu

Search available metric names and their tag keys. Returns [{metric_name, type: "gauge|counter|histogram", tags: ["host","region",...], series_count}]. Powers autocomplete in dashboard builder.

05

Architecture

Two primary data paths: ingest path (agents push metrics through a gateway into Kafka, then TSDB writers compress and store in time-bucketed blocks) and query path (user queries fan out to TSDB readers across time shards, merge results). A separate alert evaluator runs on a cron loop, executing queries and firing notifications on threshold breach.

Ingest path detail: Each agent batches 100-500 samples and POSTs to the ingest gateway. The gateway validates the payload (metric name format, tag count limits, cardinality checks), authenticates via API key, and publishes to Kafka. Kafka is partitioned by hash(metric_name + sorted_tags) so all samples for the same series land on the same partition — this gives TSDB writers write locality. Writers consume from Kafka, buffer in memory for the current time block (2 hours), and periodically flush compressed chunks to disk.

Query path detail: The query gateway parses the expression, resolves label matchers against the metadata index to find matching series IDs, determines which time blocks overlap the requested time range, and fans out read requests to TSDB readers that own those blocks. Each reader decompresses the relevant chunks, computes partial aggregates, and streams results back. The gateway merges partial results and returns the final answer. For very wide time ranges, the gateway automatically routes to downsampled data.

Alert path detail: The alert evaluator is a stateful cron service that reads rules from a Postgres-backed rules DB. Each evaluation cycle, it partitions rules across evaluator instances (sharded by rule ID hash for horizontal scaling). Each instance runs its assigned rules as TSDB queries, tracks state transitions (ok/pending/firing/resolved), and publishes firing alerts to the notification service. The notification service handles deduplication, grouping (multiple related alerts into one page), and channel routing.

Metrics & Monitoring ArchitectureSVG
Agents (1M)CPU, mem, custom Ingest GatewayLB + auth + validate Kafkapartition by metric hash TSDB Writersencode + compress TSDB Storagetime-bucketed blockscolumnar, Gorilla enc Metadata Indexseries ID → tags lookup Usersdashboards + queries Query Gatewayparse + plan + route TSDB Readersfan-out time shards Alert Evaluatorcron 15-60s cycle Alert Rules DBthreshold + anomaly Notification svcPagerDuty/Slack/email Compaction + Downsampling: merge blocks, reduce resolution for older data
Request Flow — Step Through
Agent · collects metricsIngest Gateway · LB + auth + validateKafka · partition by metric hashTSDB Writer · Gorilla encode + writeTSDB Storage · time-bucketed blocksQuery Gateway · fan-out + mergeAlert Evaluator · cron checks rules
Click Next Step to walk through the request flow.
06

Deep Dive — TSDB Storage, Cardinality & Alerting

(a) TSDB storage model. Data is partitioned by time into blocks — typically 2-hour windows. Within each block, series are stored in columnar format with aggressive compression:

-- Gorilla encoding (Facebook, 2015):
-- Timestamps: delta-of-delta encoding
--   t0=1000, t1=1010, t2=1020, t3=1030
--   deltas: 10, 10, 10 → delta-of-delta: 0, 0, 0 → 2 bits each
-- Values: XOR encoding for IEEE 754 floats
--   v0=72.5, v1=72.6 → XOR = small diff → encode leading/trailing zeros
-- Result: ~1.37 bytes per data point (vs 16 bytes raw)

Compaction merges small blocks into larger ones periodically — for example, six 2-hour blocks merge into one 12-hour block with a single sorted index. This reduces the number of blocks the query engine must open for wide time ranges. Downsampling reduces resolution for older data: 1-min resolution for 7 days, 5-min for 30 days, 1-hour for 1 year. Downsampling pre-computes min/max/sum/count per interval so that queries on old data can still compute accurate aggregates. This keeps storage bounded while preserving long-term trends for capacity planning and SLA reporting.

(b) Cardinality explosion. Each unique combination of (metric_name, tag_key=tag_value) creates one time series. If someone adds user_id as a tag with 1M unique values — instant 1M series per metric. This kills ingestion rate, bloats the metadata index, and causes OOM on the TSDB.

// Bad: http.requests{user_id=U123456} → 1M series
// Good: http.requests{endpoint=/api/users, status=200} → ~500 series
// Solution: cardinality limit per metric (e.g., max 10K series)
//   + tag-value sampling for high-cardinality dimensions
//   + reject metrics that exceed cardinality budget at ingest gateway

(c) Alert evaluation loop. Every 15-60 seconds, the alert evaluator reads all active rules from the rules DB, executes each as a TSDB query, and compares the result against the threshold. For anomaly detection: compute rolling mean + standard deviation over a training window (e.g., same hour last 7 days), alert if current value exceeds mean + 3 sigma. The evaluator maintains state for each rule: pending (condition met but not yet for the required duration), firing (condition sustained past the for duration), and resolved (condition no longer met). State transitions trigger notifications.

// Alert evaluation pseudocode (runs every eval_interval):
// for each rule in active_rules:
//   result = tsdb.query(rule.expr, now - rule.window, now)
//   if result > rule.threshold:
//     if rule.state == "pending" && elapsed > rule.for_duration:
//       rule.state = "firing"
//       notify(rule.channels, "FIRING", result)
//     elif rule.state == "ok":
//       rule.state = "pending"; rule.pending_since = now
//   else:
//     if rule.state == "firing":
//       notify(rule.channels, "RESOLVED", result)
//     rule.state = "ok"
Metric Lifecycle — Ingest to AlertMermaid
sequenceDiagram participant A as Agent (host) participant IG as Ingest Gateway participant K as Kafka participant W as TSDB Writer participant T as TSDB Storage participant QG as Query Gateway participant AE as Alert Evaluator participant N as Notification svc A->>IG: POST /metrics/ingest (batch) IG->>K: publish to partition(metric_hash) K->>W: consume batch W->>T: encode (Gorilla) + write to 2h block Note over T: time-bucketed columnar storage AE->>QG: query "avg(cpu.usage{env=prod})" QG->>T: fan-out to 1h of time blocks T-->>QG: compressed chunks QG-->>AE: avg = 94.2% AE->>AE: 94.2 > threshold 90 for 5m? AE->>N: FIRE critical alert N->>N: PagerDuty + Slack + email

(d) Query fan-out and merge. A query like avg(cpu.usage{env=prod})[1h] requires reading data from multiple time blocks (e.g., if block size is 2 hours, a 1-hour query touches 1 block). But across 10K hosts, each block has 10K series entries. The query gateway identifies relevant series via the metadata index (inverted index: tag=value maps to series IDs), then fans out read requests to TSDB readers that own the relevant blocks. Each reader decompresses its chunk, computes a partial aggregate, and returns it. The gateway merges partial results into the final answer.

(e) Metadata index design. The metadata index is an inverted index: for each tag key-value pair, it stores the set of series IDs that contain that tag. Lookups are intersections of posting lists — identical to a search engine. For cpu.usage{host=web-01, region=us-east}, intersect the posting list for host=web-01 with region=us-east to find matching series IDs. This index lives in memory for fast lookup and is persisted to disk for durability.

// Inverted index structure:
// metric_name="cpu.usage"  → [series_1, series_2, ..., series_N]
// host="web-01"            → [series_1, series_47, series_902]
// region="us-east"         → [series_1, series_2, series_47, ...]
//
// Query: cpu.usage{host=web-01, region=us-east}
// → intersect([series_1,s_2,...], [s_1,s_47,s_902], [s_1,s_2,s_47,...])
// → [series_1]  ← only series matching ALL labels
Interview answer

"Agents on every host batch metrics and POST to an ingest gateway. The gateway publishes to Kafka partitioned by metric hash. TSDB writers consume batches, encode with Gorilla compression (delta-of-delta timestamps, XOR floats — ~1.37 bytes/point), and write to 2-hour time-bucketed blocks. Queries fan out across relevant time blocks, decompress, aggregate, and return in <1 second. Alert evaluator runs on a 15-60 second cron, executes each rule as a TSDB query, and fires notifications via PagerDuty/Slack when thresholds are breached. Cardinality is bounded by per-metric limits at the ingest gateway."

07

Anti-patterns

-
Store every metric data point as a row in Postgres

Write amplification: each INSERT updates indexes, WAL, MVCC. No columnar compression. At 100M pts/sec you would need 1000+ Postgres nodes. Querying 1 hour of data scans billions of rows with no time-locality optimization. B-tree indexes on timestamp don't help — you still read full rows.

Better: Purpose-built TSDB with time-bucketed columnar storage and Gorilla encoding. 10x compression, 100x faster range scans. Time-partitioned blocks allow dropping old data by deleting a directory.
-
Allow unbounded tag cardinality (user_id, request_id as metric tags)

1M unique user_ids as a tag = 1M time series per metric. Metadata index OOMs. The inverted index (tag-value to series-ID mapping) grows unboundedly. Ingestion slows as every new data point creates a new series. Queries matching on that metric must scan millions of posting lists.

Better: Cardinality limits per metric (e.g., max 10K series). Reject or sample high-cardinality tag values at the ingest gateway. High-cardinality dimensions belong in distributed tracing or logs, not metrics.
-
Alert on raw noisy metrics without smoothing

CPU spikes to 95% for 3 seconds, drops to 40%, spikes again. Raw threshold alert fires and resolves every 30 seconds — flapping. On-call engineer mutes the alert after the third page. When a real sustained incident happens, the muted alert never fires. Alert fatigue kills incident response.

Better: Alert on sustained conditions — "avg(cpu.usage) > 90% for 5 minutes." Use the for clause to require the condition to hold across multiple evaluation cycles. For anomaly detection, use rolling mean + 3 sigma over historical baselines (same hour, same day-of-week).
08

Tradeoffs & Design Choices

  • Push (Datadog) vs Pull (Prometheus). Push: agents POST metrics to a central gateway. Scales better for ephemeral containers (short-lived pods don't need to be discovered). Pull: monitoring server scrapes registered targets. Simpler for static fleets — no agent config needed, server controls scrape interval. Hybrid: Prometheus push-gateway for short-lived jobs.
  • Pre-aggregation vs raw storage. Pre-agg (compute avg/p99 at ingest) saves storage and speeds queries, but loses the ability to re-aggregate differently later. Raw storage preserves flexibility — you can compute any aggregation at query time — but costs 10-100x more storage. Most systems store raw for recent data, pre-agg for older data.
  • Purpose-built TSDB (Prometheus, VictoriaMetrics) vs general OLAP (ClickHouse). TSDB: optimized for time-range scans, Gorilla compression, built-in downsampling. OLAP: more flexible query language, handles higher cardinality, but less compression for regular time-series patterns. Datadog uses a custom TSDB; Uber Observability uses ClickHouse.
  • Single global TSDB vs per-region shards. Global: simpler queries across regions. Per-region: lower ingest latency, data locality, but cross-region queries require fan-out and merge. Most production systems use per-region with a global query aggregator.
  • Gorilla compression vs general-purpose (LZ4/Snappy). Gorilla achieves ~1.37 bytes/point for regular time series. LZ4 on raw floats gets ~4-5 bytes/point. Gorilla wins for uniform-interval numeric data; LZ4 wins for irregular or string data.
  • Alert on metrics vs alert on logs. Metric alerts are cheap (evaluate one number against a threshold). Log-based alerts require full-text search per evaluation — 100x more expensive. Use metrics for known failure modes (CPU, latency, error rate); logs for unknown/exploratory debugging.
  • In-memory hot tier vs disk-only. Keep the most recent block (current 2-hour window) in memory for fast writes and low-latency queries. Older blocks on SSD. Oldest blocks on object storage (S3). Tiered storage balances cost with query performance.
  • PromQL vs SQL for metrics. PromQL is purpose-built for time-series: rate(), histogram_quantile(), label-based selection are first-class. SQL requires verbose CTEs and window functions for the same operations. But SQL is universally known — ClickHouse and TimescaleDB bet on SQL familiarity lowering the learning curve. Trade-off: expressiveness vs accessibility.
  • Separate alert evaluator vs co-located with query engine. Separate process: can scale independently, failure doesn't affect user queries. Co-located: lower latency (no network hop for rule queries), simpler deployment. Most production systems use a separate evaluator for isolation.
09

Interview Tips

  1. Start with the data model. "A time series is identified by (metric_name + set of tag key-value pairs). Each series has a stream of (timestamp, float_value) tuples." This grounds everything.
  2. Name Gorilla encoding explicitly. "Delta-of-delta for timestamps, XOR for float values — from Facebook's 2015 paper. Gets ~1.37 bytes/point vs 16 bytes raw." Shows depth beyond "we use a TSDB."
  3. Cardinality is the interview differentiator. Most candidates describe ingest and query. Few mention cardinality explosion. Say: "The most dangerous production issue is unbounded tag cardinality — one bad deploy tagging with request_id creates millions of series and OOMs the cluster."
  4. Separate ingest path from query path. Draw them as independent pipelines that share TSDB storage. This shows you understand read/write isolation — writers don't block readers.
  5. Address "who monitors the monitoring?" Dead-man's switch: alert evaluator emits a heartbeat; a separate watchdog in a different failure domain checks it. This is the kind of operational nuance that impresses.
  6. Mention push vs pull trade-off proactively. "Prometheus pulls, Datadog pushes. Push is better for ephemeral containers. Pull is simpler for static fleets. Modern systems support both via OpenTelemetry Collector."
10

Failure Modes

-
Ingest pipeline lag (Kafka consumer behind)
TSDB writers can't keep up with ingest rate. Kafka lag grows. Dashboards show data from 10 minutes ago. Users think systems are healthy when they're not.
Mitigation: auto-scale writer pool based on Kafka consumer lag. Drop oldest unprocessed data if lag exceeds threshold (better to have recent data than complete history).
-
TSDB disk full — compaction stopped
Old blocks not compacted or downsampled. Disk fills up. Writers halt. New metrics dropped.
Mitigation: retention policy auto-deletes blocks older than TTL. Monitoring on disk usage with alert at 80%. Separate disk for WAL vs blocks.
-
Cardinality bomb from a bad deploy
A new deploy tags every metric with request_id (unique per request). 100K requests/sec = 100K new series/sec. Metadata index OOMs. Ingest fails for all tenants.
Mitigation: per-metric cardinality limit enforced at ingest gateway. Circuit breaker: if a metric exceeds 10K series in 1 minute, reject new tag combos. Alert the team that deployed it.
-
Alert evaluator crash — silent failure
Alert evaluator process dies. No alerts fire. Production goes down but nobody gets paged. The monitoring system itself has no monitoring.
Mitigation: dead-man's switch — alert evaluator emits a heartbeat metric every cycle. A separate watchdog (different failure domain) alerts if heartbeat is missing. "Who watches the watchmen?"
-
Query timeout on wide time ranges
User queries "avg CPU over 90 days" across 10K hosts. Query fans out to thousands of blocks. TSDB readers OOM or time out. Meanwhile, the query gateway holds the connection open, consuming resources.
Mitigation: query cost estimator rejects queries that would scan > N blocks. For wide ranges, use downsampled data automatically (1-hour resolution for >30 days). Query timeout + circuit breaker. Provide a "max query cost" per user/org.
-
Agent flood after network partition heals
Network partition isolates 100K hosts for 10 minutes. When it heals, all 100K agents retry simultaneously with buffered data. Ingest gateway sees a 10x spike. Kafka partitions lag. Dashboards show a gap followed by a spike.
Mitigation: agents use jittered retry with exponential backoff. Ingest gateway applies per-agent rate limiting. Kafka partitions are sized for 2-3x normal peak. Agents can drop oldest buffered data if buffer exceeds memory limit — recent data is more valuable.
-
Multi-tenant noisy neighbor
One org sends 10x their normal volume (new deploy with verbose metrics). Their traffic saturates shared Kafka partitions and TSDB writers. Other orgs experience ingest lag and stale dashboards.
Mitigation: per-org rate limits at the ingest gateway. Dedicated Kafka topics or partitions for large tenants. Usage-based quotas with burst allowance. Alert on per-org ingestion rate anomalies.
12

Evolution

1

Nagios + RRDtool (early 2000s)

Static host list, manual config per check. RRDtool stores fixed-size round-robin databases — automatically downsamples old data by overwriting. Works for 50 servers with manual configuration. Falls apart at 500 hosts: every new host requires editing a config file and restarting the daemon.

2

Graphite + StatsD (2010s)

Push model via UDP. StatsD aggregates counters/timers/gauges client-side, ships pre-aggregated values to Graphite. Whisper DB stores one file per metric (fixed-size, round-robin). Better than Nagios, but single-node storage bottleneck. No native clustering — Carbon Relay adds sharding but no replication.

3

Prometheus + PromQL (2015+)

Pull-based scraping of /metrics HTTP endpoints. Local TSDB with Gorilla-inspired compression. PromQL provides powerful query language with rate(), histogram_quantile(), and label-based filtering. Single-node by design; Thanos and Cortex add horizontal scaling, long-term storage on S3, and global query view across clusters.

4

Managed SaaS — Datadog / New Relic (2018+)

Push model with hosted agents and auto-instrumentation. Managed TSDB at massive scale, ML-based anomaly detection, correlation across metrics/traces/logs. Zero operational burden for the customer. Cost scales with data volume — can get expensive at 10M+ series.

5

OpenTelemetry convergence (2023+)

Unified SDK for metrics + traces + logs — one instrumentation library instead of three vendor-specific ones. Vendor-neutral collection pipeline (OTel Collector) with processors and exporters. Any backend (Prometheus, Datadog, Grafana Cloud, Honeycomb) via config change. The industry standard going forward, ending vendor lock-in at the collection layer.

Next up