Concept · Networking & Delivery

Service Mesh

01

Why this matters

You have 50 microservices in production. Every service-to-service call needs: TLS encryption, mTLS auth, retries, timeouts, circuit breaking, observability, traffic routing. Implementing all of that in every service, in every language is impossible to maintain. Half your services do retries differently; some forgot mTLS; nobody can tell where the latency cliff is.

A service mesh moves all of that into a sidecar proxy that runs alongside each service. Application code becomes simple HTTP/gRPC calls; the sidecar handles everything operational. One config file in one place changes behavior everywhere.

02

The architecture in one picture

Each pod runs your service container plus a sidecar proxy (Envoy is the dominant choice). All traffic in/out of the pod goes through the sidecar. The sidecar handles: TLS termination, mTLS to peers, retries, timeouts, circuit breaking, traffic shaping, request logging, distributed tracing.

A central control plane (Istiod, Linkerd's control plane, Consul) pushes configuration to all sidecars: who can talk to whom, what timeouts apply where, how to route traffic. Sidecars pull this via xDS from Envoy.

Service Mesh TopologyMermaid
flowchart LR CP[Control Plane
Istiod / Linkerd] subgraph PodA[Pod A] A[App A] --- SA[Sidecar] end subgraph PodB[Pod B] SB[Sidecar] --- B[App B] end subgraph PodC[Pod C] SC[Sidecar] --- C[App C] end SA -- mTLS + retries + circuit breaker --> SB SA -- mTLS --> SC CP -. xDS config .-> SA CP -. xDS config .-> SB CP -. xDS config .-> SC
03

What the mesh actually delivers

  • Zero-trust networking — mTLS for every service-to-service call, automatic certificate rotation. App code talks plaintext HTTP locally; the sidecar wraps it.
  • Traffic shaping — "send 5% of v2 traffic to the canary." Configured centrally, applied at the data plane, no code changes.
  • Resilience policies — retries, timeouts, circuit breakers, rate limits. One YAML changes them everywhere.
  • Observability — uniform metrics (success rate, P99, RPS) per service edge, automatic distributed tracing, no app-side instrumentation.
  • Authorization — "service A can call service B's /orders endpoint but not /admin." Enforced at the sidecar, not the app.
04

The cost of all this magic

Service mesh isn't free. Real costs:

  • Latency tax — every request goes through 2 extra proxy hops (out via your sidecar, in via the peer's). Adds 1-3ms per call.
  • Memory tax — each pod runs an Envoy (~50-100 MB). At 1000 pods = 100 GB just for sidecars.
  • Operational complexity — the mesh itself is a distributed system. Istio has reputation for being hard to debug. A misconfigured mesh policy can take down inter-service communication everywhere at once.
  • Yet another layer — when something is slow, the failure could be in app, sidecar, control plane, or peer's sidecar. Diagnosis surface area grows.
Don't reach for it too early

If you have 5 services, you don't need a mesh — implement retries + tracing in a shared library. Consider a mesh once you have ≥ 20 services in multiple languages, or strict zero-trust requirements.

05

Deep dive — sidecar vs sidecar-less

The sidecar pattern is dominant but not universal. Recent alternatives:

Linkerd's micro-proxy — a Rust-based proxy ~10× lighter than Envoy. Same model, fraction of the resource cost.

Istio Ambient Mesh — the new architecture (2023+). Replaces per-pod sidecars with shared per-node ztunnels for L4 (mTLS) + optional waypoint proxies for L7. Cuts memory cost dramatically; trade-off is more complex routing.

eBPF-based meshes (Cilium Service Mesh) — implement mesh capabilities in the kernel via eBPF. No userland proxies in the data path. Lowest latency overhead. Newer; less mature than Istio/Linkerd.

The trajectory: the original "sidecar everywhere" pattern is being replaced by lighter approaches. By 2027 most production meshes will likely be ambient-style or eBPF-based. The capabilities stay the same; the overhead shrinks.

The interview answer

"At sufficient microservice scale, a mesh is essential — uniform mTLS, retries, observability without per-service code. We picked Linkerd for its lower overhead vs Istio. Tradeoff: extra latency (~2ms/call), extra memory (~50MB/pod). Worth it once we passed ~30 services."

06

Real-world

Istio

Most-deployed mesh

Envoy data plane + Istiod control plane. Feature-rich; complex. Used by Google, IBM, big enterprises.

Linkerd

Lighter alternative

Rust-based micro-proxy. ~10× less memory than Istio. Simpler config. CNCF graduated.

Consul Connect

HashiCorp's mesh

Tight Vault + Consul integration. Multi-platform (VMs + containers). Common in HashiCorp shops.

Cilium Service Mesh

eBPF-based

Sidecar-free; mesh logic in kernel. Lowest overhead. Newest; production-ready as of 2023+.

07

Used in problems

News feed's microservices run inside Istio for mTLS + uniform observability. Uber's massive microservice fleet uses Envoy-based mesh internally. E-commerce uses mesh for traffic shaping during checkout-flow experiments.

Next up