Problem Statement
Think of a single server running your app. Something goes wrong — a user can't check out. You ssh in, run tail -f /var/log/app.log, find the error, fix it. Simple.
Now your company grows to 2,000 microservices across 100,000 containers. A user reports "I placed an order but never got a confirmation email." To debug this, you'd need to trace what happened across the API Gateway, Auth Service, Order Service, Payment Service, and Notification Service — each running on dozens of servers. You don't know which specific server handled this user's request.
What Breaks at Scale
Ephemeral Infrastructure
On Kubernetes, containers are killed and restarted constantly. When a container dies, its local logs vanish. The evidence is destroyed before you even know there's a bug.
No Correlation
Even if you could search all servers, logs have different formats, clock skew, and no way to link "this auth log" to "this payment log" for the same user request.
No Proactive Detection
You're only looking at logs after a user complains. If the payment service has been throwing errors for 20 minutes, nobody knows without centralized alerting.
Compliance & Audit
Regulations (SOX, GDPR, PCI-DSS) require retaining certain logs for months or years. Individual servers can't guarantee that — disks fill up, servers get decommissioned.
Core question: How do you build a system that ingests 500K+ log events per second, stores them durably for weeks to years, and lets engineers search across all of them in seconds?
Why This Is Architecturally Interesting
This is fundamentally a write-heavy, append-only, time-series-ish problem with a full-text search requirement bolted on. Most systems optimize for either writes OR search, not both. The competing forces:
- Insane write volume — 500K+ events/sec. Most databases buckle under this write load.
- Full-text search — Engineers want to search for specific error messages, user IDs buried in log lines, stack trace fragments across petabytes.
- Time-series nature — Almost every query includes a time range, so storage must be optimized for time-partitioned access.
- 100:1 write-to-read ratio — Most logs are never read, but when they are (during an incident at 3am), search must be fast.
- Cost pressure — At 20+ TB/day raw data, the difference between indexing everything vs indexing selectively is millions of dollars per year.
What We're NOT Building
Not a metrics system (numeric time-series like CPU = 73% — different storage model). Not distributed tracing (request flow across services via trace IDs — overlaps but has its own data model). Not a SIEM (security analysis layer on top of logs). These are all related "pillars of observability" — but each warrants its own design.