Distributed Logging Framework

Design a centralized logging system that ingests hundreds of thousands of events per second from thousands of services, stores them durably with compression, and enables full-text search across billions of log lines in seconds.

Write-HeavyAppend-OnlyFull-Text SearchStreamingTiered Storage
01

Problem Statement

Think of a single server running your app. Something goes wrong — a user can't check out. You ssh in, run tail -f /var/log/app.log, find the error, fix it. Simple.

Now your company grows to 2,000 microservices across 100,000 containers. A user reports "I placed an order but never got a confirmation email." To debug this, you'd need to trace what happened across the API Gateway, Auth Service, Order Service, Payment Service, and Notification Service — each running on dozens of servers. You don't know which specific server handled this user's request.

What Breaks at Scale

Ephemeral Infrastructure

On Kubernetes, containers are killed and restarted constantly. When a container dies, its local logs vanish. The evidence is destroyed before you even know there's a bug.

No Correlation

Even if you could search all servers, logs have different formats, clock skew, and no way to link "this auth log" to "this payment log" for the same user request.

No Proactive Detection

You're only looking at logs after a user complains. If the payment service has been throwing errors for 20 minutes, nobody knows without centralized alerting.

Compliance & Audit

Regulations (SOX, GDPR, PCI-DSS) require retaining certain logs for months or years. Individual servers can't guarantee that — disks fill up, servers get decommissioned.

Core question: How do you build a system that ingests 500K+ log events per second, stores them durably for weeks to years, and lets engineers search across all of them in seconds?

Why This Is Architecturally Interesting

This is fundamentally a write-heavy, append-only, time-series-ish problem with a full-text search requirement bolted on. Most systems optimize for either writes OR search, not both. The competing forces:

  • Insane write volume — 500K+ events/sec. Most databases buckle under this write load.
  • Full-text search — Engineers want to search for specific error messages, user IDs buried in log lines, stack trace fragments across petabytes.
  • Time-series nature — Almost every query includes a time range, so storage must be optimized for time-partitioned access.
  • 100:1 write-to-read ratio — Most logs are never read, but when they are (during an incident at 3am), search must be fast.
  • Cost pressure — At 20+ TB/day raw data, the difference between indexing everything vs indexing selectively is millions of dollars per year.

What We're NOT Building

Not a metrics system (numeric time-series like CPU = 73% — different storage model). Not distributed tracing (request flow across services via trace IDs — overlaps but has its own data model). Not a SIEM (security analysis layer on top of logs). These are all related "pillars of observability" — but each warrants its own design.

02

Requirements

Functional Requirements

  • Log Ingestion — Accept log events from thousands of services via lightweight agents on each host. Each log carries: timestamp, service name, severity level, message body, and arbitrary key-value metadata.
  • Full-Text Search — Query logs by time range + structured filters (service, severity) + keyword search in message body. Support boolean operators, phrase queries, and wildcard matching.
  • Live Tail — Real-time streaming of logs matching a filter, like tail -f across the entire fleet. Must work with zero indexing delay.
  • Alerting Rules — Define threshold-based rules (e.g., "alert if service=payments AND level=ERROR exceeds 100/min"). Evaluate in real-time, fire notifications via PagerDuty/Slack.
  • Retention Policies — Hot tier (fast search, 7 days), Warm tier (slower search, 30 days), Cold tier (archive only, 1 year+), with automatic tiering.
  • Multi-Tenancy — Team-level isolation and per-tenant ingestion quotas to prevent noisy-neighbor problems.

Non-Functional Requirements

  • Write Throughput: 500K events/sec sustained, 2M/sec burst — this is a write-dominated system
  • Ingestion-to-Searchable Latency: < 30 seconds — near real-time, not real-time (live tail handles the real-time case)
  • Search Latency: P99 < 5 seconds for queries spanning 1 hour of data
  • Durability: Zero log loss once acknowledged by the ingestion layer
  • Availability: 99.9% — the logging system must be more available than the systems it monitors
  • Storage Efficiency: Compression is survival — logs compress 10:1, which is the difference between manageable and bankrupt

Key tension: 500K writes/sec demands append-only async ingestion, but full-text search demands pre-built indexes. The entire architecture navigates this conflict.

Requirements Conflict Matrix

RequirementPushes TowardConflicts With
500K writes/secAppend-only, batched, async indexingSearch latency (async = delay)
Full-text searchInverted index (Elasticsearch-style)Storage cost, write throughput
<30s to searchableFrequent index flushesEfficient segment sizes
DurabilityKafka replication, acks=allWrite latency (more acks = slower)
Cost efficiencyCompress everything, index lessSearch speed (less index = slower)
Live tailDirect Kafka consumptionSeparate infrastructure path
03

Scale Estimation

Everything derived from first principles. Starting assumptions: 2,000 microservices, each running ~50 instances, each emitting ~5 log lines/sec. Average log event size: 500 bytes.

500K/sec
Sustained write load
2M/sec
Peak burst (4x)
43B/day
Events per day
21.6 TB/day
Raw data volume
2.2 TB/day
Compressed (10:1)
250 MB/sec
Ingestion bandwidth
100:1
Write-to-read ratio
~$12K/mo
Storage cost (all tiers)

Storage Per Tier

TierDurationDailyTotalMediaEst. Cost/Mo
Hot7 days2.8 TB~20 TBNVMe SSD~$3,000
Warm30 days2.2 TB~66 TBHDD / EBS~$3,300
Cold365 days2.2 TB~800 TBS3 / GCS~$5,600

Kafka Sizing

Derivation

At 250 MB/sec ingest, each broker handles ~100 MB/sec → 6 brokers minimum. With 24h retention and replication factor 3: 21.6 TB × 3 = 64.8 TB total Kafka storage. ~200 partitions across the topic, partitioned by service name with overflow hashing for hot services. Three independent consumer groups: Indexer, Tail Service, Alert Evaluator.

Indexer Sizing

Derivation

Each indexer instance handles ~75K events/sec (tokenization + index building). Sustained: 500K ÷ 75K = 7 instances minimum. Peak capacity (2M/sec): 27 instances. Fleet sized at 30 indexers with auto-scaling. Each flushes a segment every 30 seconds containing ~2.25M events (~112 MB compressed), producing ~20,000 segments/day requiring background compaction.

Key insight: At 500K writes/sec, this system is an order of magnitude beyond what any single database can handle. The architecture isn't chosen by preference — it's forced by the numbers.

04

API Design

Three core APIs serving three different personas: the log agent (machine, write-heavy), the debugging engineer (human, search-heavy), and the engineer watching live logs (human, streaming).

POST /v1/ingest — Log Ingestion (The Firehose)
POST /v1/ingest
Authorization: Bearer <service-api-key>
Content-Type: application/json
Content-Encoding: zstd

{
  "source": {
    "service": "payment-service",
    "host": "ip-10-0-3-42",
    "region": "us-east-1",
    "environment": "production"
  },
  "logs": [
    {
      "ts": "2025-06-15T14:23:01.123Z",
      "level": "ERROR",
      "msg": "Failed to charge card ending 4242: CARD_DECLINED",
      "meta": {
        "trace_id": "abc-123-def",
        "user_id": "u_98765",
        "error_code": "CARD_DECLINED",
        "latency_ms": 342
      }
    }
    // ... up to 1000 events per batch
  ]
}

→ 202 Accepted
{ "accepted": 1000, "failed": 0, "lag_hint": "normal" }

Why 202 Accepted?

Not 201 Created. The log is in Kafka (durable) but not yet searchable. 202 honestly communicates: "received and durably stored, but processing is ongoing."

Why Batch?

At 500K events/sec, individual HTTP calls would mean 500K TCP connections/sec. Batching 1,000 events per call reduces this to 500 calls/sec — trivial.

Why Separate Source?

All logs in a batch share the same source metadata. Extracting it saves ~100 bytes × 1,000 events = 100 KB per batch. At 500 batches/sec, that's 50 MB/sec saved.

Why lag_hint?

Cooperative backpressure. When the system is under pressure, agents can increase batch sizes, reduce flush frequency, or drop DEBUG-level logs.

POST /v1/search — Full-Text Search
POST /v1/search
Authorization: Bearer <user-token>

{
  "query": "card declined timeout",
  "filters": {
    "service": ["payment-service", "checkout-service"],
    "level": ["ERROR", "WARN"],
    "meta.trace_id": "abc-123-def"
  },
  "time_range": {
    "from": "2025-06-15T14:00:00Z",
    "to": "2025-06-15T15:00:00Z"
  },
  "sort": "desc",
  "limit": 100,
  "cursor": null
}

→ 200 OK
{
  "hits": 847,
  "scanned": 1240000,
  "took_ms": 1230,
  "logs": [ ... ],
  "next_cursor": "eyJsYXN0X3RzIjoi..."
}

Why POST for search?

Complex queries with nested filters and long query strings hit URL length limits. POST body has no practical size limit — same approach Elasticsearch uses.

Cursor-based pagination

Offset pagination ("skip 5000 results") is wasteful at depth. Cursor pagination uses an opaque token to seek directly to the next position. Every page is equally fast.

POST /v1/aggregate — Dashboard Queries
POST /v1/aggregate
{
  "time_range": { "from": "...", "to": "..." },
  "filters": { "environment": "production" },
  "group_by": ["service", "level"],
  "interval": "5m",
  "metric": "count"
}

→ Buckets of counts per group per time interval

Separate endpoint because execution is fundamentally different — aggregation reads only structured columns, never touching message bodies. Pre-computed rollup tables can answer these queries reading just a few hundred rows instead of scanning billions of events.

GET /v1/tail — Live Tail (SSE Stream)
GET /v1/tail?service=payment-service&level=ERROR
Authorization: Bearer <user-token>
Accept: text/event-stream

→ Server-Sent Events stream:
data: {"ts":"...","service":"payment-service","level":"ERROR","msg":"Failed to charge..."}
data: {"ts":"...","service":"payment-service","level":"ERROR","msg":"Timeout connecting..."}
: heartbeat

Why SSE over WebSocket?

Unidirectional (server pushes to client — no data sent back after connection). Auto-reconnection built into the spec. HTTP-native — works through all proxies and load balancers without special configuration. Heartbeats every 15 seconds keep connections alive.

POST /v1/alerts — Alert Rule Management
POST /v1/alerts
{
  "name": "Payment errors spike",
  "condition": {
    "filter": { "service": "payment-service", "level": "ERROR" },
    "threshold": 100, "window": "5m", "comparison": "above"
  },
  "actions": [
    { "type": "pagerduty", "severity": "critical" },
    { "type": "slack", "channel": "#payments-oncall" }
  ],
  "cooldown": "15m"
}

Supporting: GET/PUT/DELETE /v1/alerts/{id}, POST /v1/alerts/{id}/mute

Authentication Model

APIMethodPatternAuthThroughput
IngestPOSTRequest/ResponseAPI key (service)500K events/sec
SearchPOSTRequest/ResponseUser token (OAuth)500 queries/min
AggregatePOSTRequest/ResponseUser token100 queries/min
TailGETSSE streamUser token200 connections
AlertsCRUDRequest/ResponseUser tokenLow

Key insight: The ingestion API and search API are completely decoupled. They share no synchronous dependency. Kafka and the indexer are the asynchronous bridge between them.

05

High-Level Architecture

Every component exists because the numbers demanded it. The pipeline flows left-to-right: Agents → Kafka → Consumers (Indexer, Tail, Alerts) → Storage → Query Service → UI.

Agents 100K hosts Gateway Validate + Route Kafka Durable Buffer 6 brokers, RF=3 Indexer 30 workers Tail Service SSE → Clients Alert Eval Sliding windows Hot Store NVMe SSD, 7d Warm Store HDD, 30d Cold Store S3/GCS, 365d Query Service Scatter-Gather Metadata DB PostgreSQL Tier Manager Compact + Move HTTP+zstd Produce Consume Consume Consume Flush Age out Age out Read

Component Roles

Agents (100K hosts)

Lightweight daemons on every host. Watch log files, batch events, compress with zstd, ship to gateway via HTTP. Buffer locally on disk during outages. Priority-based dropping when buffer is full (DEBUG first, ERROR last).

Ingestion Gateway

Stateless HTTP service. Validates batches, normalizes timestamps, enforces per-tenant rate limits, publishes to Kafka. Shields Kafka from 100K direct agent connections.

Kafka (6+ brokers)

The backbone. Decouples producers from consumers. 24-hour retention for replay. Three independent consumer groups: Indexer, Tail Service, Alert Evaluator. Partitioned by service name with overflow hashing.

Indexer (30 workers)

Consumes from Kafka, tokenizes messages, builds inverted index segments with roaring bitmaps, compresses with zstd, flushes every 30 seconds. The most compute-intensive component.

Index Store (Hot/Warm/Cold)

Hot: NVMe SSD with full inverted index (7 days). Warm: HDD with bloom filters only (30 days). Cold: S3 archive (365 days). Segments replicated to 2 nodes.

Query Service

Stateless scatter-gather. Consults metadata DB to find relevant segments, prunes via bloom filters, fans out parallel sub-queries to index nodes, merges results. Hard timeout at 30 seconds.

Tail Service

Reads directly from Kafka (bypasses index). Evaluates in-memory filters per connected client. Pushes matches via SSE. Compiled filter trees for efficient evaluation at 500K events/sec × 200 clients.

Alert Evaluator

Another Kafka consumer. Maintains sliding window counters per alert rule (2.4 KB per rule). Fires notifications when thresholds are breached. Runs in hot-standby pair for HA.

Failure Isolation

Component DownImpactWhy Contained
One agentThat host's logs buffer locallyAgents are independent
Gateway instanceAgents retry, other instances serveStateless, load-balanced
One Kafka brokerZero — replicas serveRF=3, min.insync=2
Indexer fleetSearch shows stale dataKafka buffers 24 hours
Index store nodeQueries route to replicaSegments replicated 2x
Query serviceSearch down, ingestion fineCompletely separate path
Tail serviceLive tail down, search fineIndependent consumer

Core principle: No single component failure stops log ingestion. Kafka absorbs downstream failures. The worst case is stale search results, never data loss.

Request Flow — Step Through
App (stdout)AgentGatewayKafkaIndexerIndex StoreQuery ServiceEngineer
Click Next Step to walk through the request flow.
06

Deep Dive — Log Storage & Indexing

The single hardest problem: how do you store 43 billion log events per day and make any of them findable in seconds?

Why Naive Approaches Collapse

PostgreSQL?

~50K inserts/sec max per node. You'd need 50+ shards. And LIKE '%connection refused%' is a full table scan on 43B rows — hours per query.

Just grep files?

Works for one service ("grep this 150 MB file" = 1.5 sec). But "find 'timeout' across ALL services" means grepping 3 TB of raw data. 30+ seconds even parallelized.

The Inverted Index — Core Data Structure

Instead of "for each log, list its words," flip it: for each word, list which logs contain it.

Building an inverted index from 4 log events
Event 0: "Payment failed: card declined by processor"
Event 1: "Connection timeout to payment processor"
Event 2: "User login successful from Dubai"
Event 3: "Payment retry succeeded after card reauthorization"

Inverted Index:
  payment     → [0, 1, 3]    card       → [0, 3]
  failed      → [0]          timeout    → [1]
  processor   → [0, 1]       login      → [2]
  connection  → [1]          dubai      → [2]

Query: "payment card"
  1. Look up "payment" → [0, 1, 3]
  2. Look up "card"    → [0, 3]
  3. Intersect (AND)   → [0, 3]  ← found without scanning any text!

The Segment Structure

We don't build one giant index. Instead, we build many small self-contained segments, each covering a 30-second time window from one indexer (~2.25M events, ~112 MB compressed).

segment-20250615-142300-idx07-0042/
├── meta.json           ← time range, services, event count
├── columns/
│   ├── timestamp.col   ← delta-encoded sorted timestamps
│   ├── service.col     ← dictionary-encoded service names
│   └── level.col       ← dictionary-encoded levels
├── index/
│   ├── terms.fst       ← FST term dictionary (~200 KB)
│   └── postings.dat    ← roaring bitmap posting lists
├── bloom/
│   ├── service.bloom   ← "contains payment-service?" (~60 KB)
│   └── terms.bloom     ← "contains word 'timeout'?"
└── raw/
    └── messages.zst    ← block-addressable compressed messages

Key Data Structures

FST (Finite State Transducer) — Term Dictionary

Compressed trie mapping each term to its posting list offset. 50,000 terms in ~200 KB. Supports prefix queries ("pay*" → payment, payload, payee) at zero extra cost. Same structure used by Lucene/Elasticsearch.

Roaring Bitmaps — Posting Lists

Compressed lists of event IDs per term. Divide ID space into 65,536-ID chunks, use optimal encoding per chunk (sorted array for sparse, bitmap for dense, run-length for consecutive). 5–20× compression over raw arrays. AND/OR/NOT operations work directly on compressed form — no decompression needed.

Block-Addressable Compression

Raw messages compressed in ~10,000-event blocks. Each block decompressible independently. To fetch event 45,231, decompress only block 4 (~500 KB) instead of the entire 112 MB file.

Bloom Filters — First Line of Defense

Probabilistic "does this segment contain X?" check. ~60 KB per segment. If bloom says "no," skip the entire segment. Eliminates 90%+ of segments before any real work. Zero false negatives.

Query Execution — End to End

sequenceDiagram participant E as Engineer participant QS as Query Service participant MD as Metadata DB participant N1 as Index Node 1 participant N2 as Index Node 2 participant N3 as Index Node 3 E->>QS: POST /v1/search (payment-service, ERROR, "timeout", last 1h) QS->>MD: Which segments cover 14:00-15:00? MD-->>QS: 1,200 segments across 8 nodes Note over QS: Bloom filter pruning: 1,200 → 40 segments (97% reduction) par Parallel fan-out QS->>N1: Search 12 segments QS->>N2: Search 15 segments QS->>N3: Search 13 segments end Note over N1,N3: Per segment: columnar filter → inverted index lookup → bitmap AND → fetch matches N1-->>QS: 230 matching events N2-->>QS: 412 matching events N3-->>QS: 205 matching events Note over QS: Merge-sort by timestamp, return top 100 QS-->>E: 847 hits, 1,230ms, paginated results

The Three Schools — Compared

DimensionElasticsearchGrafana LokiClickHouse
PhilosophyIndex every wordIndex only labels, grep contentColumnar + bloom filters
Full-text speedExcellent (ms)Poor (seconds)Good (sub-second)
Structured filtersGoodExcellentExcellent
Write throughput~100K/sec/node~500K/sec/node~300K/sec/node
Storage costHigh (index overhead)Very lowLow
AggregationModeratePoorExcellent

Our Choice: The Hybrid Approach

Hot tier (7 days): Full inverted index. 95% of queries hit recent data — sub-second full-text search.
Warm/Cold tier: Label index + bloom filters + chunk scan (Loki-style). Slower (5–30s) but dramatically cheaper. Queries on old data are rare and less urgent.

Segment Lifecycle

PhaseAgeIndex TypeSearch SpeedAction
BirthT=0Full inverted + columnar<1sIndexer flushes to hot NVMe
Hot0–7dFull inverted + columnar<1sServes most queries
Compaction7dStripped to bloom only<30sMerge 60 segments → 1, move to warm
Warm7–30dBloom + label index<30sLarger compacted segments on HDD
Cold30–365dMetadata onlyMinutesUpload to S3, on-demand download
Delete>365dPurge from S3 + metadata DB

Compaction — Why It Matters

Hot tier produces ~20,000 segments/day. Without merging, queries would fan out across thousands of tiny files. Compaction merges segments within time windows: Level 0 (raw, 30s) → Level 1 (10 min) → Level 2 (1 hour) → Level 3 (1 day for warm). Same concept as LSM-tree compaction in LevelDB/RocksDB.

07

Key Design Decisions & Tradeoffs

Push vs Pull Collection

✓ Chosen

Push (Agents Send)

Agents batch, compress, and push logs to the gateway. Near-instant latency. Adding new hosts requires zero central configuration. Agents handle local buffering and backpressure.

✗ Alternative

Pull (System Scrapes)

Central collector polls 100K hosts. Inherent polling delay. 50K poll requests/sec just for discovery. Each host needs read-cursor state. Breaks down at log scale (works for Prometheus metrics because volumes are lower).

Kafka Buffer vs Direct-to-Store

✓ Chosen

Kafka Intermediate Buffer

Decouples ingestion from indexing. If indexer is down, zero logs lost (Kafka buffers 24h). Enables multiple independent consumers (indexer, tail, alerts). Supports replay for re-indexing.

✗ Alternative

Direct-to-Store

Fewer components, ~10ms less latency. But gateway becomes coupled to indexer speed. No multiple consumers. No replay. A slow indexer blocks ingestion, which blocks agents, which drops logs.

Indexing Depth — The Big Decision

✓ Chosen

Hybrid: Full Index (Hot) + Bloom (Warm/Cold)

Full inverted index on 7-day hot tier covers 95% of queries with sub-second search. Warm/cold uses cheap bloom+scan. Saves ~600 GB/day of index storage compared to indexing everything. Best balance of speed and cost.

✗ Alternatives

Full Index Everywhere / No Index

Full index everywhere: 3–6 TB/day of index overhead, massive cost. No index (Loki-style): 10x cheaper but broad full-text search takes 30+ seconds. Both extremes have significant downsides.

Partitioning: Time-First vs Service-First

✓ Chosen

Time-First Partitioning

Almost every query includes a time range — "last hour" immediately narrows to ~1,200 segments. Retention is time-based (delete all before date X = trivial). Compaction is natural by time window.

✗ Alternative

Service-First Partitioning

Optimal for single-service queries. But cross-service queries (trace_id=abc) require scanning every service partition. Complicates retention and compaction. Bloom filters mitigate the per-service pruning gap.

SSE vs WebSocket for Live Tail

✓ Chosen

Server-Sent Events (SSE)

Unidirectional (server → client). Auto-reconnection built into spec. HTTP-native — works through all proxies. Simpler server-side, fewer failure modes. Heartbeats keep connections alive.

✗ Alternative

WebSocket

Full-duplex, allows dynamic filter updates without reconnecting. But adds WebSocket upgrade complexity, manual reconnect logic, proxy compatibility issues — all for a feature (dynamic filters) that's rarely used.

Sync vs Async Indexing

✓ Chosen

Async Indexing (~30s Lag)

Ingestion layer is simple and fast (accept, validate, Kafka). Indexer batch-processes 2.25M events per segment — far more efficient than one-at-a-time. If indexer crashes, ingestion continues unaffected.

✗ Alternative

Synchronous Indexing

Logs searchable within 1 second. But requires massive Elasticsearch clusters at 500K events/sec. Refresh operations become bottlenecks during bursts. In-memory buffer increases data loss risk on crash.

Compression Algorithm

✓ Chosen

Zstd (Dictionary-Trained)

10–15:1 compression on logs at 200 MB/sec compress speed. Dictionary trained on log patterns learns common phrases, timestamp formats, field names. Best balance of ratio and speed.

✗ Alternatives

LZ4 / Gzip

LZ4: faster (500 MB/sec) but only 3–4:1 — adds 3.2 TB/day extra storage. Gzip: good ratio (5–6:1) but painfully slow (50 MB/sec) — 70% of indexer CPU on compression alone.

Shared vs Isolated Multi-Tenancy

✓ Chosen

Shared Storage + Logical Isolation

One Kafka cluster, one indexer fleet, one index store. Resource-efficient (unused capacity available to all). Simpler operations (1 cluster, not N). Enables cross-tenant queries for platform teams.

✗ Alternative

Physical Isolation Per Tenant

Perfect blast-radius isolation. A noisy tenant can never impact another. But N× operational burden. Unused capacity per tenant is wasted. Cross-tenant queries require fan-out across clusters.

08

What Can Go Wrong

Kafka Consumer Lag — Indexer Falls Behind

Cause: Write burst (bad deploy), indexer crash, slow storage. Impact: Logs appear in live tail (reads from Kafka) but not in search — confusing and erodes trust. Detection: Monitor kafka_consumer_lag_seconds, alert if >60s for >5min. Handling: Short-term: Kafka buffers 24h, indexer auto-catches up. Medium: auto-scale indexer fleet (Kafka rebalances partitions). Nuclear: enter degraded mode — skip inverted index, build only label indexes (3× throughput at reduced search quality).

Agent Disk Buffer Full — Silent Log Loss

Cause: Network partition, all gateways down. Agent buffers to local disk (1 GB cap). Impact: Most dangerous failure — logs silently disappear. No error in the logging UI. Detection: Agent heartbeats every 30s to separate health-check endpoint. Gap detection (if payment-service drops from 1K/sec to 0, something's wrong). On recovery, agent reports dropped count. Mitigation: Priority-based dropping — DEBUG first, then INFO, keep ERROR as long as possible.

Index Store Node Failure

Cause: Hardware failure, OOM kill, disk corruption. Impact: With replication factor 2, queries route to replica seamlessly — user doesn't notice. Recovery: Re-sync when node returns. If permanently lost, copy segments from surviving replicas. Worst case (both replicas lost): rebuild from Kafka (if within 24h) or cold storage.

Query of Death — Pathological Query Takes Down Query Layer

Cause: Broad unfiltered query ("search *.* for last 7 days"), misconfigured dashboard widget, automated script. Impact: All search queries slow down — logging UI unresponsive during incidents when it's needed most. Handling: 5 layers of defense: (1) query admission control (reject if >10K segments), (2) per-query resource limits (30s timeout, 4 GB memory, 10M events max), (3) per-user concurrency limits (max 10 concurrent), (4) circuit breaker (shed load at high error rate), (5) query quarantine (block known-bad patterns).

Kafka Cluster Failure — The Worst Case

Impact: Entire pipeline stops. Agents buffer locally, no new data in search, tail goes silent, alerts stop. Prevention: Brokers across 3+ AZs with rack-aware replication. Rolling upgrades one broker at a time. N+2 capacity. Recovery: Agents drain local buffers when Kafka recovers. Indexer catches up from committed offsets. Full recovery is automatic. DR option: Secondary Kafka cluster via MirrorMaker 2 in different region.

Hot Storage Full — Indexer Stalls

Cause: Unexpected volume spike, tier manager failed silently, retention policy misconfigured. Detection: Monitor hot_storage_used_pct — warn at 80%, critical at 90%, page at 95%. Handling: Auto: accelerate migration (move segments older than 5 days early). Emergency: force-migrate to 3 days, disable inverted index (5× less space), alert ops to provision storage.

Cascading Self-Logging — Feedback Loop

Cause: The logging system's own components emit error logs, which flow through the pipeline, increasing load, generating more errors. Handling: (1) Separate monitoring stack for logging infrastructure (Prometheus + Grafana, not itself). (2) Rate limit internal log sources (100 events/sec cap). (3) Circuit breakers on error logging — log once, then count: "Connection refused (12,847 occurrences in 60s)."

Clock Skew — Events With Wrong Timestamps

Impact: Events land in wrong time partitions. Correlation across services breaks (response appears before request). Handling: Dual timestamps — event_ts (from app, potentially skewed) and ingest_ts (from gateway, NTP-synced). Partition by ingest_ts, display event_ts. Flag events with |skew| > 30s. Pad query time ranges by ±60s internally.

Overarching principle: Every failure has a blast radius, and the architecture contains it. Kafka absorbs ingestion failures. Replication absorbs storage failures. Query limits absorb query failures. No single failure cascades to total system outage.

09

Interview Tips

💡
Start with the single-server baseline.
Begin with tail -f /var/log/app.log. Then explain what breaks at scale: ephemeral containers, no correlation, no alerting. This shows you think from first principles, not pattern-match from a memorized design.
Lead with the write-to-read ratio.
Say "This is a 100:1 write-to-read system" early. This single fact drives every architectural choice — async indexing, append-only storage, Kafka buffering. It shows you understand the dominant access pattern, not just the components.
🎯
Derive the numbers, don't memorize them.
Walk through: "2,000 services × 50 instances × 5 logs/sec = 500K events/sec. At 500 bytes each, that's 250 MB/sec..." Interviewers care about the derivation methodology, not the exact number.
🔑
Explain why Kafka, not just that Kafka.
Don't just draw Kafka in the diagram. Explain the three reasons it's here: decoupling (indexer can crash without losing data), multiple consumers (indexer + tail + alerts from one stream), replay (re-index if something goes wrong). This turns "I know Kafka" into "I understand distributed system design."
💰
Make the cost argument for tiered storage.
"Full inverted index on all data = 3–6 TB/day of index overhead. At $0.10/GB/month on SSD, that's $18K/month just for index storage. Tiered approach drops this to $3K/month by indexing only the hot tier." Interviewers love when candidates think about cost — it shows real-world awareness.
⚠️
Know the Elasticsearch vs Loki tradeoff cold.
If the interviewer asks "why not just use Elasticsearch?" — explain: it indexes everything (great for search, expensive at scale). Loki indexes only labels (cheap, but full-text search is slow). Your hybrid approach takes the best of both. This shows you understand the design space, not just one tool.
🧩
Separate the live tail path from the search path.
Many candidates forget that live tail can't go through the index (30-second lag). It needs a direct Kafka consumer path. Mentioning this unprompted demonstrates architectural depth and awareness of real-time requirements.
11

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Node, Files + Grep

One server. Agents send logs via HTTP. Append to time-partitioned files. Search = grep. Works for a small team with <5,000 events/sec. No indexing, no replication, no tiering. Ship fast, validate the product.

2

Growth — Kafka + Basic Indexing

Add Kafka as the ingestion buffer to decouple producers from consumers. Introduce a simple indexer that builds label-based indexes (service, level, time). Search by structured filters is fast; full-text still requires scanning. Add basic alerting as a Kafka consumer. Handles ~50K events/sec.

3

Scale — Inverted Index + Tiered Storage

Build inverted index on hot tier for full-text search. Introduce segment-based storage with roaring bitmaps and bloom filters. Add warm tier (HDD) and cold tier (S3) with automatic tiering via the Tier Manager. Segment compaction. Handles 500K events/sec.

4

Enterprise — Multi-Tenancy + Advanced Query

Per-tenant quotas and isolation at ingestion and query layers. Rich query DSL (boolean operators, regex, numeric ranges). Pre-computed rollup tables for dashboard aggregations. Query cost estimation and admission control. Live tail with compiled filter trees.

5

Planet-Scale — Multi-Region + ML

Cross-region replication via MirrorMaker 2 for disaster recovery. Region-local ingestion with global query federation. ML-powered anomaly detection on the alert evaluator stream. Log pattern clustering to auto-discover new error types. Cost: $500K+/month in infrastructure, but logging is now a competitive advantage.

Next up