Distributed Logging Framework — System Design

01

Problem Statement

Think of a single server running your app. Something goes wrong — a user can't check out. You ssh in, run tail -f /var/log/app.log, find the error, fix it. Simple.

Now your company grows to 2,000 microservices across 100,000 containers. A user reports "I placed an order but never got a confirmation email." To debug this, you'd need to trace what happened across the API Gateway, Auth Service, Order Service, Payment Service, and Notification Service — each running on dozens of servers. You don't know which specific server handled this user's request.

What Breaks at Scale

Ephemeral Infrastructure

On Kubernetes, containers are killed and restarted constantly. When a container dies, its local logs vanish. The evidence is destroyed before you even know there's a bug.

No Correlation

Even if you could search all servers, logs have different formats, clock skew, and no way to link "this auth log" to "this payment log" for the same user request.

No Proactive Detection

You're only looking at logs after a user complains. If the payment service has been throwing errors for 20 minutes, nobody knows without centralized alerting.

Compliance & Audit

Regulations (SOX, GDPR, PCI-DSS) require retaining certain logs for months or years. Individual servers can't guarantee that — disks fill up, servers get decommissioned.

Core question: How do you build a system that ingests 500K+ log events per second, stores them durably for weeks to years, and lets engineers search across all of them in seconds?

Why This Is Architecturally Interesting

This is fundamentally a write-heavy, append-only, time-series-ish problem with a full-text search requirement bolted on. Most systems optimize for either writes OR search, not both. The competing forces:

Insane write volume — 500K+ events/sec. Most databases buckle under this write load.
Full-text search — Engineers want to search for specific error messages, user IDs buried in log lines, stack trace fragments across petabytes.
Time-series nature — Almost every query includes a time range, so storage must be optimized for time-partitioned access.
100:1 write-to-read ratio — Most logs are never read, but when they are (during an incident at 3am), search must be fast.
Cost pressure — At 20+ TB/day raw data, the difference between indexing everything vs indexing selectively is millions of dollars per year.

What We're NOT Building

Not a metrics system (numeric time-series like CPU = 73% — different storage model). Not distributed tracing (request flow across services via trace IDs — overlaps but has its own data model). Not a SIEM (security analysis layer on top of logs). These are all related "pillars of observability" — but each warrants its own design.

02

Requirements

Functional Requirements

Log Ingestion — Accept log events from thousands of services via lightweight agents on each host. Each log carries: timestamp, service name, severity level, message body, and arbitrary key-value metadata.
Full-Text Search — Query logs by time range + structured filters (service, severity) + keyword search in message body. Support boolean operators, phrase queries, and wildcard matching.
Live Tail — Real-time streaming of logs matching a filter, like tail -f across the entire fleet. Must work with zero indexing delay.
Alerting Rules — Define threshold-based rules (e.g., "alert if service=payments AND level=ERROR exceeds 100/min"). Evaluate in real-time, fire notifications via PagerDuty/Slack.
Retention Policies — Hot tier (fast search, 7 days), Warm tier (slower search, 30 days), Cold tier (archive only, 1 year+), with automatic tiering.
Multi-Tenancy — Team-level isolation and per-tenant ingestion quotas to prevent noisy-neighbor problems.

Non-Functional Requirements

Write Throughput: 500K events/sec sustained, 2M/sec burst — this is a write-dominated system
Ingestion-to-Searchable Latency: < 30 seconds — near real-time, not real-time (live tail handles the real-time case)
Search Latency: P99 < 5 seconds for queries spanning 1 hour of data
Durability: Zero log loss once acknowledged by the ingestion layer
Availability: 99.9% — the logging system must be more available than the systems it monitors
Storage Efficiency: Compression is survival — logs compress 10:1, which is the difference between manageable and bankrupt

Key tension: 500K writes/sec demands append-only async ingestion, but full-text search demands pre-built indexes. The entire architecture navigates this conflict.

Requirements Conflict Matrix

Requirement	Pushes Toward	Conflicts With
500K writes/sec	Append-only, batched, async indexing	Search latency (async = delay)
Full-text search	Inverted index (Elasticsearch-style)	Storage cost, write throughput
<30s to searchable	Frequent index flushes	Efficient segment sizes
Durability	Kafka replication, acks=all	Write latency (more acks = slower)
Cost efficiency	Compress everything, index less	Search speed (less index = slower)
Live tail	Direct Kafka consumption	Separate infrastructure path

03

Scale Estimation

Everything derived from first principles. Starting assumptions: 2,000 microservices, each running ~50 instances, each emitting ~5 log lines/sec. Average log event size: 500 bytes.

500K/sec

Sustained write load

2M/sec

Peak burst (4x)

43B/day

Events per day

21.6 TB/day

Raw data volume

2.2 TB/day

Compressed (10:1)

250 MB/sec

Ingestion bandwidth

100:1

Write-to-read ratio

~$12K/mo

Storage cost (all tiers)

Storage Per Tier

Tier	Duration	Daily	Total	Media	Est. Cost/Mo
Hot	7 days	2.8 TB	~20 TB	NVMe SSD	~$3,000
Warm	30 days	2.2 TB	~66 TB	HDD / EBS	~$3,300
Cold	365 days	2.2 TB	~800 TB	S3 / GCS	~$5,600

Kafka Sizing

Derivation

At 250 MB/sec ingest, each broker handles ~100 MB/sec → 6 brokers minimum. With 24h retention and replication factor 3: 21.6 TB × 3 = 64.8 TB total Kafka storage. ~200 partitions across the topic, partitioned by service name with overflow hashing for hot services. Three independent consumer groups: Indexer, Tail Service, Alert Evaluator.

Indexer Sizing

Derivation

Each indexer instance handles ~75K events/sec (tokenization + index building). Sustained: 500K ÷ 75K = 7 instances minimum. Peak capacity (2M/sec): 27 instances. Fleet sized at 30 indexers with auto-scaling. Each flushes a segment every 30 seconds containing ~2.25M events (~112 MB compressed), producing ~20,000 segments/day requiring background compaction.

Key insight: At 500K writes/sec, this system is an order of magnitude beyond what any single database can handle. The architecture isn't chosen by preference — it's forced by the numbers.

04

API Design

Three core APIs serving three different personas: the log agent (machine, write-heavy), the debugging engineer (human, search-heavy), and the engineer watching live logs (human, streaming).

POST /v1/ingest — Log Ingestion (The Firehose)

        POST /v1/ingest
Authorization: Bearer <service-api-key>
Content-Type: application/json
Content-Encoding: zstd

{
  "source": {
    "service": "payment-service",
    "host": "ip-10-0-3-42",
    "region": "us-east-1",
    "environment": "production"
  },
  "logs": [
    {
      "ts": "2025-06-15T14:23:01.123Z",
      "level": "ERROR",
      "msg": "Failed to charge card ending 4242: CARD_DECLINED",
      "meta": {
        "trace_id": "abc-123-def",
        "user_id": "u_98765",
        "error_code": "CARD_DECLINED",
        "latency_ms": 342
      }
    }
    // ... up to 1000 events per batch
  ]
}

→ 202 Accepted
{ "accepted": 1000, "failed": 0, "lag_hint": "normal" }
      

Why 202 Accepted?

Not 201 Created. The log is in Kafka (durable) but not yet searchable. 202 honestly communicates: "received and durably stored, but processing is ongoing."

Why Batch?

At 500K events/sec, individual HTTP calls would mean 500K TCP connections/sec. Batching 1,000 events per call reduces this to 500 calls/sec — trivial.

Why Separate Source?

All logs in a batch share the same source metadata. Extracting it saves ~100 bytes × 1,000 events = 100 KB per batch. At 500 batches/sec, that's 50 MB/sec saved.

Why lag_hint?

Cooperative backpressure. When the system is under pressure, agents can increase batch sizes, reduce flush frequency, or drop DEBUG-level logs.

POST /v1/search — Full-Text Search

        POST /v1/search
Authorization: Bearer <user-token>

{
  "query": "card declined timeout",
  "filters": {
    "service": ["payment-service", "checkout-service"],
    "level": ["ERROR", "WARN"],
    "meta.trace_id": "abc-123-def"
  },
  "time_range": {
    "from": "2025-06-15T14:00:00Z",
    "to": "2025-06-15T15:00:00Z"
  },
  "sort": "desc",
  "limit": 100,
  "cursor": null
}

→ 200 OK
{
  "hits": 847,
  "scanned": 1240000,
  "took_ms": 1230,
  "logs": [ ... ],
  "next_cursor": "eyJsYXN0X3RzIjoi..."
}
      

Why POST for search?

Complex queries with nested filters and long query strings hit URL length limits. POST body has no practical size limit — same approach Elasticsearch uses.

Cursor-based pagination

Offset pagination ("skip 5000 results") is wasteful at depth. Cursor pagination uses an opaque token to seek directly to the next position. Every page is equally fast.

POST /v1/aggregate — Dashboard Queries

        POST /v1/aggregate
{
  "time_range": { "from": "...", "to": "..." },
  "filters": { "environment": "production" },
  "group_by": ["service", "level"],
  "interval": "5m",
  "metric": "count"
}

→ Buckets of counts per group per time interval
      

Separate endpoint because execution is fundamentally different — aggregation reads only structured columns, never touching message bodies. Pre-computed rollup tables can answer these queries reading just a few hundred rows instead of scanning billions of events.

GET /v1/tail — Live Tail (SSE Stream)

        GET /v1/tail?service=payment-service&level=ERROR
Authorization: Bearer <user-token>
Accept: text/event-stream

→ Server-Sent Events stream:
data: {"ts":"...","service":"payment-service","level":"ERROR","msg":"Failed to charge..."}
data: {"ts":"...","service":"payment-service","level":"ERROR","msg":"Timeout connecting..."}
: heartbeat
      

Why SSE over WebSocket?

Unidirectional (server pushes to client — no data sent back after connection). Auto-reconnection built into the spec. HTTP-native — works through all proxies and load balancers without special configuration. Heartbeats every 15 seconds keep connections alive.

POST /v1/alerts — Alert Rule Management

        POST /v1/alerts
{
  "name": "Payment errors spike",
  "condition": {
    "filter": { "service": "payment-service", "level": "ERROR" },
    "threshold": 100, "window": "5m", "comparison": "above"
  },
  "actions": [
    { "type": "pagerduty", "severity": "critical" },
    { "type": "slack", "channel": "#payments-oncall" }
  ],
  "cooldown": "15m"
}

Supporting: GET/PUT/DELETE /v1/alerts/{id}, POST /v1/alerts/{id}/mute
      

Authentication Model

API	Method	Pattern	Auth	Throughput
Ingest	POST	Request/Response	API key (service)	500K events/sec
Search	POST	Request/Response	User token (OAuth)	500 queries/min
Aggregate	POST	Request/Response	User token	100 queries/min
Tail	GET	SSE stream	User token	200 connections
Alerts	CRUD	Request/Response	User token	Low

Key insight: The ingestion API and search API are completely decoupled. They share no synchronous dependency. Kafka and the indexer are the asynchronous bridge between them.

05

High-Level Architecture

Every component exists because the numbers demanded it. The pipeline flows left-to-right: Agents → Kafka → Consumers (Indexer, Tail, Alerts) → Storage → Query Service → UI.

Component Roles

Agents (100K hosts)

Lightweight daemons on every host. Watch log files, batch events, compress with zstd, ship to gateway via HTTP. Buffer locally on disk during outages. Priority-based dropping when buffer is full (DEBUG first, ERROR last).

Ingestion Gateway

Stateless HTTP service. Validates batches, normalizes timestamps, enforces per-tenant rate limits, publishes to Kafka. Shields Kafka from 100K direct agent connections.

Kafka (6+ brokers)

The backbone. Decouples producers from consumers. 24-hour retention for replay. Three independent consumer groups: Indexer, Tail Service, Alert Evaluator. Partitioned by service name with overflow hashing.

Indexer (30 workers)

Consumes from Kafka, tokenizes messages, builds inverted index segments with roaring bitmaps, compresses with zstd, flushes every 30 seconds. The most compute-intensive component.

Index Store (Hot/Warm/Cold)

Hot: NVMe SSD with full inverted index (7 days). Warm: HDD with bloom filters only (30 days). Cold: S3 archive (365 days). Segments replicated to 2 nodes.

Query Service

Stateless scatter-gather. Consults metadata DB to find relevant segments, prunes via bloom filters, fans out parallel sub-queries to index nodes, merges results. Hard timeout at 30 seconds.

Tail Service

Reads directly from Kafka (bypasses index). Evaluates in-memory filters per connected client. Pushes matches via SSE. Compiled filter trees for efficient evaluation at 500K events/sec × 200 clients.

Alert Evaluator

Another Kafka consumer. Maintains sliding window counters per alert rule (2.4 KB per rule). Fires notifications when thresholds are breached. Runs in hot-standby pair for HA.

Failure Isolation

Component Down	Impact	Why Contained
One agent	That host's logs buffer locally	Agents are independent
Gateway instance	Agents retry, other instances serve	Stateless, load-balanced
One Kafka broker	Zero — replicas serve	RF=3, min.insync=2
Indexer fleet	Search shows stale data	Kafka buffers 24 hours
Index store node	Queries route to replica	Segments replicated 2x
Query service	Search down, ingestion fine	Completely separate path
Tail service	Live tail down, search fine	Independent consumer

Core principle: No single component failure stops log ingestion. Kafka absorbs downstream failures. The worst case is stale search results, never data loss.

Request Flow — Step Through

App (stdout)→Agent→Gateway→Kafka→Indexer→Index Store→Query Service→Engineer

Click Next Step to walk through the request flow.

06

Deep Dive — Log Storage & Indexing

The single hardest problem: how do you store 43 billion log events per day and make any of them findable in seconds?

Why Naive Approaches Collapse

PostgreSQL?

~50K inserts/sec max per node. You'd need 50+ shards. And LIKE '%connection refused%' is a full table scan on 43B rows — hours per query.

Just grep files?

Works for one service ("grep this 150 MB file" = 1.5 sec). But "find 'timeout' across ALL services" means grepping 3 TB of raw data. 30+ seconds even parallelized.

The Inverted Index — Core Data Structure

Instead of "for each log, list its words," flip it: for each word, list which logs contain it.

Building an inverted index from 4 log events

        Event 0: "Payment failed: card declined by processor"
Event 1: "Connection timeout to payment processor"
Event 2: "User login successful from Dubai"
Event 3: "Payment retry succeeded after card reauthorization"

Inverted Index:
  payment     → [0, 1, 3]    card       → [0, 3]
  failed      → [0]          timeout    → [1]
  processor   → [0, 1]       login      → [2]
  connection  → [1]          dubai      → [2]

Query: "payment card"
  1. Look up "payment" → [0, 1, 3]
  2. Look up "card"    → [0, 3]
  3. Intersect (AND)   → [0, 3]  ← found without scanning any text!
      

The Segment Structure

We don't build one giant index. Instead, we build many small self-contained segments, each covering a 30-second time window from one indexer (~2.25M events, ~112 MB compressed).

        segment-20250615-142300-idx07-0042/
├── meta.json           ← time range, services, event count
├── columns/
│   ├── timestamp.col   ← delta-encoded sorted timestamps
│   ├── service.col     ← dictionary-encoded service names
│   └── level.col       ← dictionary-encoded levels
├── index/
│   ├── terms.fst       ← FST term dictionary (~200 KB)
│   └── postings.dat    ← roaring bitmap posting lists
├── bloom/
│   ├── service.bloom   ← "contains payment-service?" (~60 KB)
│   └── terms.bloom     ← "contains word 'timeout'?"
└── raw/
    └── messages.zst    ← block-addressable compressed messages
      

Key Data Structures

FST (Finite State Transducer) — Term Dictionary

Compressed trie mapping each term to its posting list offset. 50,000 terms in ~200 KB. Supports prefix queries ("pay*" → payment, payload, payee) at zero extra cost. Same structure used by Lucene/Elasticsearch.

Roaring Bitmaps — Posting Lists

Compressed lists of event IDs per term. Divide ID space into 65,536-ID chunks, use optimal encoding per chunk (sorted array for sparse, bitmap for dense, run-length for consecutive). 5–20× compression over raw arrays. AND/OR/NOT operations work directly on compressed form — no decompression needed.

Block-Addressable Compression

Raw messages compressed in ~10,000-event blocks. Each block decompressible independently. To fetch event 45,231, decompress only block 4 (~500 KB) instead of the entire 112 MB file.

Bloom Filters — First Line of Defense

Probabilistic "does this segment contain X?" check. ~60 KB per segment. If bloom says "no," skip the entire segment. Eliminates 90%+ of segments before any real work. Zero false negatives.

Query Execution — End to End

sequenceDiagram participant E as Engineer participant QS as Query Service participant MD as Metadata DB participant N1 as Index Node 1 participant N2 as Index Node 2 participant N3 as Index Node 3 E->>QS: POST /v1/search (payment-service, ERROR, "timeout", last 1h) QS->>MD: Which segments cover 14:00-15:00? MD-->>QS: 1,200 segments across 8 nodes Note over QS: Bloom filter pruning: 1,200 → 40 segments (97% reduction) par Parallel fan-out QS->>N1: Search 12 segments QS->>N2: Search 15 segments QS->>N3: Search 13 segments end Note over N1,N3: Per segment: columnar filter → inverted index lookup → bitmap AND → fetch matches N1-->>QS: 230 matching events N2-->>QS: 412 matching events N3-->>QS: 205 matching events Note over QS: Merge-sort by timestamp, return top 100 QS-->>E: 847 hits, 1,230ms, paginated results

The Three Schools — Compared

Dimension	Elasticsearch	Grafana Loki	ClickHouse
Philosophy	Index every word	Index only labels, grep content	Columnar + bloom filters
Full-text speed	Excellent (ms)	Poor (seconds)	Good (sub-second)
Structured filters	Good	Excellent	Excellent
Write throughput	~100K/sec/node	~500K/sec/node	~300K/sec/node
Storage cost	High (index overhead)	Very low	Low
Aggregation	Moderate	Poor	Excellent

Our Choice: The Hybrid Approach

Hot tier (7 days): Full inverted index. 95% of queries hit recent data — sub-second full-text search.
Warm/Cold tier: Label index + bloom filters + chunk scan (Loki-style). Slower (5–30s) but dramatically cheaper. Queries on old data are rare and less urgent.

Segment Lifecycle

Phase	Age	Index Type	Search Speed	Action
Birth	T=0	Full inverted + columnar	<1s	Indexer flushes to hot NVMe
Hot	0–7d	Full inverted + columnar	<1s	Serves most queries
Compaction	7d	Stripped to bloom only	<30s	Merge 60 segments → 1, move to warm
Warm	7–30d	Bloom + label index	<30s	Larger compacted segments on HDD
Cold	30–365d	Metadata only	Minutes	Upload to S3, on-demand download
Delete	>365d	—	—	Purge from S3 + metadata DB

Compaction — Why It Matters

Hot tier produces ~20,000 segments/day. Without merging, queries would fan out across thousands of tiny files. Compaction merges segments within time windows: Level 0 (raw, 30s) → Level 1 (10 min) → Level 2 (1 hour) → Level 3 (1 day for warm). Same concept as LSM-tree compaction in LevelDB/RocksDB.

07

Key Design Decisions & Tradeoffs

Push vs Pull Collection

✓ Chosen

Push (Agents Send)

Agents batch, compress, and push logs to the gateway. Near-instant latency. Adding new hosts requires zero central configuration. Agents handle local buffering and backpressure.

✗ Alternative

Pull (System Scrapes)

Central collector polls 100K hosts. Inherent polling delay. 50K poll requests/sec just for discovery. Each host needs read-cursor state. Breaks down at log scale (works for Prometheus metrics because volumes are lower).

Kafka Buffer vs Direct-to-Store

✓ Chosen

Kafka Intermediate Buffer

Decouples ingestion from indexing. If indexer is down, zero logs lost (Kafka buffers 24h). Enables multiple independent consumers (indexer, tail, alerts). Supports replay for re-indexing.

✗ Alternative

Direct-to-Store

Fewer components, ~10ms less latency. But gateway becomes coupled to indexer speed. No multiple consumers. No replay. A slow indexer blocks ingestion, which blocks agents, which drops logs.

Indexing Depth — The Big Decision

✓ Chosen

Hybrid: Full Index (Hot) + Bloom (Warm/Cold)

Full inverted index on 7-day hot tier covers 95% of queries with sub-second search. Warm/cold uses cheap bloom+scan. Saves ~600 GB/day of index storage compared to indexing everything. Best balance of speed and cost.

✗ Alternatives

Full Index Everywhere / No Index

Full index everywhere: 3–6 TB/day of index overhead, massive cost. No index (Loki-style): 10x cheaper but broad full-text search takes 30+ seconds. Both extremes have significant downsides.

Partitioning: Time-First vs Service-First

✓ Chosen

Time-First Partitioning

Almost every query includes a time range — "last hour" immediately narrows to ~1,200 segments. Retention is time-based (delete all before date X = trivial). Compaction is natural by time window.

✗ Alternative

Service-First Partitioning

Optimal for single-service queries. But cross-service queries (trace_id=abc) require scanning every service partition. Complicates retention and compaction. Bloom filters mitigate the per-service pruning gap.

SSE vs WebSocket for Live Tail

✓ Chosen

Server-Sent Events (SSE)

Unidirectional (server → client). Auto-reconnection built into spec. HTTP-native — works through all proxies. Simpler server-side, fewer failure modes. Heartbeats keep connections alive.

✗ Alternative

WebSocket

Full-duplex, allows dynamic filter updates without reconnecting. But adds WebSocket upgrade complexity, manual reconnect logic, proxy compatibility issues — all for a feature (dynamic filters) that's rarely used.

Sync vs Async Indexing

✓ Chosen

Async Indexing (~30s Lag)

Ingestion layer is simple and fast (accept, validate, Kafka). Indexer batch-processes 2.25M events per segment — far more efficient than one-at-a-time. If indexer crashes, ingestion continues unaffected.

✗ Alternative

Synchronous Indexing

Logs searchable within 1 second. But requires massive Elasticsearch clusters at 500K events/sec. Refresh operations become bottlenecks during bursts. In-memory buffer increases data loss risk on crash.

Compression Algorithm

✓ Chosen

Zstd (Dictionary-Trained)

10–15:1 compression on logs at 200 MB/sec compress speed. Dictionary trained on log patterns learns common phrases, timestamp formats, field names. Best balance of ratio and speed.

✗ Alternatives

LZ4 / Gzip

LZ4: faster (500 MB/sec) but only 3–4:1 — adds 3.2 TB/day extra storage. Gzip: good ratio (5–6:1) but painfully slow (50 MB/sec) — 70% of indexer CPU on compression alone.

Shared vs Isolated Multi-Tenancy

✓ Chosen

Shared Storage + Logical Isolation

One Kafka cluster, one indexer fleet, one index store. Resource-efficient (unused capacity available to all). Simpler operations (1 cluster, not N). Enables cross-tenant queries for platform teams.

✗ Alternative

Physical Isolation Per Tenant

Perfect blast-radius isolation. A noisy tenant can never impact another. But N× operational burden. Unused capacity per tenant is wasted. Cross-tenant queries require fan-out across clusters.

08

What Can Go Wrong

Kafka Consumer Lag — Indexer Falls Behind

Cause: Write burst (bad deploy), indexer crash, slow storage. Impact: Logs appear in live tail (reads from Kafka) but not in search — confusing and erodes trust. Detection: Monitor kafka_consumer_lag_seconds, alert if >60s for >5min. Handling: Short-term: Kafka buffers 24h, indexer auto-catches up. Medium: auto-scale indexer fleet (Kafka rebalances partitions). Nuclear: enter degraded mode — skip inverted index, build only label indexes (3× throughput at reduced search quality).

Agent Disk Buffer Full — Silent Log Loss

Cause: Network partition, all gateways down. Agent buffers to local disk (1 GB cap). Impact: Most dangerous failure — logs silently disappear. No error in the logging UI. Detection: Agent heartbeats every 30s to separate health-check endpoint. Gap detection (if payment-service drops from 1K/sec to 0, something's wrong). On recovery, agent reports dropped count. Mitigation: Priority-based dropping — DEBUG first, then INFO, keep ERROR as long as possible.

Index Store Node Failure

Cause: Hardware failure, OOM kill, disk corruption. Impact: With replication factor 2, queries route to replica seamlessly — user doesn't notice. Recovery: Re-sync when node returns. If permanently lost, copy segments from surviving replicas. Worst case (both replicas lost): rebuild from Kafka (if within 24h) or cold storage.

Query of Death — Pathological Query Takes Down Query Layer

Cause: Broad unfiltered query ("search *.* for last 7 days"), misconfigured dashboard widget, automated script. Impact: All search queries slow down — logging UI unresponsive during incidents when it's needed most. Handling: 5 layers of defense: (1) query admission control (reject if >10K segments), (2) per-query resource limits (30s timeout, 4 GB memory, 10M events max), (3) per-user concurrency limits (max 10 concurrent), (4) circuit breaker (shed load at high error rate), (5) query quarantine (block known-bad patterns).

Kafka Cluster Failure — The Worst Case

Impact: Entire pipeline stops. Agents buffer locally, no new data in search, tail goes silent, alerts stop. Prevention: Brokers across 3+ AZs with rack-aware replication. Rolling upgrades one broker at a time. N+2 capacity. Recovery: Agents drain local buffers when Kafka recovers. Indexer catches up from committed offsets. Full recovery is automatic. DR option: Secondary Kafka cluster via MirrorMaker 2 in different region.

Hot Storage Full — Indexer Stalls

Cause: Unexpected volume spike, tier manager failed silently, retention policy misconfigured. Detection: Monitor hot_storage_used_pct — warn at 80%, critical at 90%, page at 95%. Handling: Auto: accelerate migration (move segments older than 5 days early). Emergency: force-migrate to 3 days, disable inverted index (5× less space), alert ops to provision storage.

Cascading Self-Logging — Feedback Loop

Cause: The logging system's own components emit error logs, which flow through the pipeline, increasing load, generating more errors. Handling: (1) Separate monitoring stack for logging infrastructure (Prometheus + Grafana, not itself). (2) Rate limit internal log sources (100 events/sec cap). (3) Circuit breakers on error logging — log once, then count: "Connection refused (12,847 occurrences in 60s)."

Clock Skew — Events With Wrong Timestamps

Impact: Events land in wrong time partitions. Correlation across services breaks (response appears before request). Handling: Dual timestamps — event_ts (from app, potentially skewed) and ingest_ts (from gateway, NTP-synced). Partition by ingest_ts, display event_ts. Flag events with |skew| > 30s. Pad query time ranges by ±60s internally.

Overarching principle: Every failure has a blast radius, and the architecture contains it. Kafka absorbs ingestion failures. Replication absorbs storage failures. Query limits absorb query failures. No single failure cascades to total system outage.

09

Interview Tips

💡

Start with the single-server baseline.
Begin with tail -f /var/log/app.log. Then explain what breaks at scale: ephemeral containers, no correlation, no alerting. This shows you think from first principles, not pattern-match from a memorized design.

⚡

Lead with the write-to-read ratio.
Say "This is a 100:1 write-to-read system" early. This single fact drives every architectural choice — async indexing, append-only storage, Kafka buffering. It shows you understand the dominant access pattern, not just the components.

🎯

Derive the numbers, don't memorize them.
Walk through: "2,000 services × 50 instances × 5 logs/sec = 500K events/sec. At 500 bytes each, that's 250 MB/sec..." Interviewers care about the derivation methodology, not the exact number.

🔑

Explain why Kafka, not just that Kafka.
Don't just draw Kafka in the diagram. Explain the three reasons it's here: decoupling (indexer can crash without losing data), multiple consumers (indexer + tail + alerts from one stream), replay (re-index if something goes wrong). This turns "I know Kafka" into "I understand distributed system design."

💰

Make the cost argument for tiered storage.
"Full inverted index on all data = 3–6 TB/day of index overhead. At $0.10/GB/month on SSD, that's $18K/month just for index storage. Tiered approach drops this to $3K/month by indexing only the hot tier." Interviewers love when candidates think about cost — it shows real-world awareness.

⚠️

Know the Elasticsearch vs Loki tradeoff cold.
If the interviewer asks "why not just use Elasticsearch?" — explain: it indexes everything (great for search, expensive at scale). Loki indexes only labels (cheap, but full-text search is slow). Your hybrid approach takes the best of both. This shows you understand the design space, not just one tool.

🧩

Separate the live tail path from the search path.
Many candidates forget that live tail can't go through the index (30-second lag). It needs a direct Kafka consumer path. Mentioning this unprompted demonstrates architectural depth and awareness of real-time requirements.

10

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Node, Files + Grep

One server. Agents send logs via HTTP. Append to time-partitioned files. Search = grep. Works for a small team with <5,000 events/sec. No indexing, no replication, no tiering. Ship fast, validate the product.

2

Growth — Kafka + Basic Indexing

Add Kafka as the ingestion buffer to decouple producers from consumers. Introduce a simple indexer that builds label-based indexes (service, level, time). Search by structured filters is fast; full-text still requires scanning. Add basic alerting as a Kafka consumer. Handles ~50K events/sec.

3

Scale — Inverted Index + Tiered Storage

Build inverted index on hot tier for full-text search. Introduce segment-based storage with roaring bitmaps and bloom filters. Add warm tier (HDD) and cold tier (S3) with automatic tiering via the Tier Manager. Segment compaction. Handles 500K events/sec.

4

Enterprise — Multi-Tenancy + Advanced Query

Per-tenant quotas and isolation at ingestion and query layers. Rich query DSL (boolean operators, regex, numeric ranges). Pre-computed rollup tables for dashboard aggregations. Query cost estimation and admission control. Live tail with compiled filter trees.

5

Planet-Scale — Multi-Region + ML

Cross-region replication via MirrorMaker 2 for disaster recovery. Region-local ingestion with global query federation. ML-powered anomaly detection on the alert evaluator stream. Log pattern clustering to auto-discover new error types. Cost: $500K+/month in infrastructure, but logging is now a competitive advantage.

📺

References & Videos

Distributed Logging Architecture

Hussein Nasser · 20 min

Design a Distributed Logging System

AlgoMaster

Elasticsearch Reference

Elastic