(a) TSDB storage model. Data is partitioned by time into blocks — typically 2-hour windows. Within each block, series are stored in columnar format with aggressive compression:
-- Gorilla encoding (Facebook, 2015):
-- Timestamps: delta-of-delta encoding
-- t0=1000, t1=1010, t2=1020, t3=1030
-- deltas: 10, 10, 10 → delta-of-delta: 0, 0, 0 → 2 bits each
-- Values: XOR encoding for IEEE 754 floats
-- v0=72.5, v1=72.6 → XOR = small diff → encode leading/trailing zeros
-- Result: ~1.37 bytes per data point (vs 16 bytes raw)
Compaction merges small blocks into larger ones periodically — for example, six 2-hour blocks merge into one 12-hour block with a single sorted index. This reduces the number of blocks the query engine must open for wide time ranges. Downsampling reduces resolution for older data: 1-min resolution for 7 days, 5-min for 30 days, 1-hour for 1 year. Downsampling pre-computes min/max/sum/count per interval so that queries on old data can still compute accurate aggregates. This keeps storage bounded while preserving long-term trends for capacity planning and SLA reporting.
(b) Cardinality explosion. Each unique combination of (metric_name, tag_key=tag_value) creates one time series. If someone adds user_id as a tag with 1M unique values — instant 1M series per metric. This kills ingestion rate, bloats the metadata index, and causes OOM on the TSDB.
// Bad: http.requests{user_id=U123456} → 1M series
// Good: http.requests{endpoint=/api/users, status=200} → ~500 series
// Solution: cardinality limit per metric (e.g., max 10K series)
// + tag-value sampling for high-cardinality dimensions
// + reject metrics that exceed cardinality budget at ingest gateway
(c) Alert evaluation loop. Every 15-60 seconds, the alert evaluator reads all active rules from the rules DB, executes each as a TSDB query, and compares the result against the threshold. For anomaly detection: compute rolling mean + standard deviation over a training window (e.g., same hour last 7 days), alert if current value exceeds mean + 3 sigma. The evaluator maintains state for each rule: pending (condition met but not yet for the required duration), firing (condition sustained past the for duration), and resolved (condition no longer met). State transitions trigger notifications.
// Alert evaluation pseudocode (runs every eval_interval):
// for each rule in active_rules:
// result = tsdb.query(rule.expr, now - rule.window, now)
// if result > rule.threshold:
// if rule.state == "pending" && elapsed > rule.for_duration:
// rule.state = "firing"
// notify(rule.channels, "FIRING", result)
// elif rule.state == "ok":
// rule.state = "pending"; rule.pending_since = now
// else:
// if rule.state == "firing":
// notify(rule.channels, "RESOLVED", result)
// rule.state = "ok"
Metric Lifecycle — Ingest to AlertMermaid
sequenceDiagram
participant A as Agent (host)
participant IG as Ingest Gateway
participant K as Kafka
participant W as TSDB Writer
participant T as TSDB Storage
participant QG as Query Gateway
participant AE as Alert Evaluator
participant N as Notification svc
A->>IG: POST /metrics/ingest (batch)
IG->>K: publish to partition(metric_hash)
K->>W: consume batch
W->>T: encode (Gorilla) + write to 2h block
Note over T: time-bucketed columnar storage
AE->>QG: query "avg(cpu.usage{env=prod})"
QG->>T: fan-out to 1h of time blocks
T-->>QG: compressed chunks
QG-->>AE: avg = 94.2%
AE->>AE: 94.2 > threshold 90 for 5m?
AE->>N: FIRE critical alert
N->>N: PagerDuty + Slack + email
(d) Query fan-out and merge. A query like avg(cpu.usage{env=prod})[1h] requires reading data from multiple time blocks (e.g., if block size is 2 hours, a 1-hour query touches 1 block). But across 10K hosts, each block has 10K series entries. The query gateway identifies relevant series via the metadata index (inverted index: tag=value maps to series IDs), then fans out read requests to TSDB readers that own the relevant blocks. Each reader decompresses its chunk, computes a partial aggregate, and returns it. The gateway merges partial results into the final answer.
(e) Metadata index design. The metadata index is an inverted index: for each tag key-value pair, it stores the set of series IDs that contain that tag. Lookups are intersections of posting lists — identical to a search engine. For cpu.usage{host=web-01, region=us-east}, intersect the posting list for host=web-01 with region=us-east to find matching series IDs. This index lives in memory for fast lookup and is persisted to disk for durability.
// Inverted index structure:
// metric_name="cpu.usage" → [series_1, series_2, ..., series_N]
// host="web-01" → [series_1, series_47, series_902]
// region="us-east" → [series_1, series_2, series_47, ...]
//
// Query: cpu.usage{host=web-01, region=us-east}
// → intersect([series_1,s_2,...], [s_1,s_47,s_902], [s_1,s_2,s_47,...])
// → [series_1] ← only series matching ALL labels
Interview answer
"Agents on every host batch metrics and POST to an ingest gateway. The gateway publishes to Kafka partitioned by metric hash. TSDB writers consume batches, encode with Gorilla compression (delta-of-delta timestamps, XOR floats — ~1.37 bytes/point), and write to 2-hour time-bucketed blocks. Queries fan out across relevant time blocks, decompress, aggregate, and return in <1 second. Alert evaluator runs on a 15-60 second cron, executes each rule as a TSDB query, and fires notifications via PagerDuty/Slack when thresholds are breached. Cardinality is bounded by per-metric limits at the ingest gateway."