LinkedIn MutualConnection Search

01

Problem Statement

Given any two users on a professional social network like LinkedIn, efficiently find and display their shared (mutual) connections. It sounds simple — a set intersection. User A's connections ∩ User B's connections. You could write it in one line of SQL.

But at LinkedIn's scale, that one line becomes one of the hardest problems in social networking infrastructure. Before the page finishes rendering, the system has already answered: "You and this person share 47 connections — here are their faces." That query ran in under 50ms, against a graph with over 1 billion nodes and 250 billion edges.

Mutual connections don't just appear on profiles. They show up on search result cards (10–20 per page), in "People You May Know" feeds, in connection request screens, in messaging previews. A single user session might trigger 50–100 mutual connection lookups without the user even thinking about it.

Core question: How do you make set intersection feel instant at billion-user scale, when the underlying graph is constantly changing and every page view demands multiple intersection queries?

The Four Sub-Problems

The Intersection Problem

Computing the overlap between two potentially large sets (power users have 30,000+ connections) without scanning everything at query time.

The Freshness Problem

Connections change constantly (~5M new connections/day). Stale mutual counts erode user trust in the product.

The Fan-out Problem

A single page load can require mutual computation against dozens of profiles simultaneously — search results, PYMK, feed cards.

The Asymmetry Problem

The viewer's connection list is reused across all lookups in a session, but each target profile's list is different. This reuse is the key optimization.

02

Requirements

Functional Requirements

FR1 — Get Mutual Connections: Given viewer + target, return the list of shared connections with profile data, ranked by relevance
FR2 — Get Mutual Count: Lightweight count-only query for bulk surfaces (search cards, PYMK tiles)
FR3 — Paginated Retrieval: Cursor-based pagination over potentially hundreds of mutual connections
FR4 — Real-Time-ish Updates: New connections reflected in mutual counts within 30 seconds
FR5 — Bulk Mutual Resolution: Batch API for 10–50 profiles in a single request, reusing viewer's connection list

Non-Functional Requirements

NFR1 — Latency: sub-50ms p99 for count, sub-100ms for full list
NFR2 — Throughput: ~120K–140K peak QPS (80% count-only, 20% full list)
NFR3 — Consistency: Eventual, with read-your-own-writes for the active user
NFR4 — Availability: 99.99% — graceful degradation (show profile without mutuals vs. fail)
NFR5 — Storage: ~5TB adjacency data for 1B users, must fit in memory for hot path
NFR6 — Scalability: Horizontal — adding users solved by adding machines

Out of Scope

2nd-degree connections (friends-of-friends), PYMK ranking algorithm, connection suggestions explanations, privacy controls, and blocking/restricting.

03

Scale Estimation

The numbers here directly force the architecture. By the end, we'll know whether precomputation is feasible and whether adjacency lists fit in memory.

1B

Total Users

60M

Daily Active Users

~500

Avg Connections/User

250B

Total Graph Edges

~140K

Peak QPS

5M/day

New Connections

~1.2 TB

Active User Adj. Data

10¹⁸

Possible Pairs (impossible)

Key Derivation

60M DAU × 50 lookups/session = 3B lookups/day = ~35K avg QPS. Peak 3–4× = ~140K QPS. Split: 110K count-only + 28K full-list. Precomputing all pairs: 10¹⁸ entries — impossible. Even friend-of-friend pairs: 250 trillion. Query-time intersection is mandatory.

Critical conclusion: Adjacency lists must live in memory (200K+ list fetches/sec would crush disk). Redis cluster of ~20–40 nodes for active users is non-negotiable.

04

API Design

Three public endpoints reflecting the three access patterns. Viewer identity is implicit (auth token) — you can only ask "what are MY mutuals with target", preventing graph scraping.

GET — Full Mutual List (Profile View)

GET /v1/connections/mutual?target_user_id={id}&cursor={cursor}&limit=10&sort=relevance

Response: {
  "mutual_connections": [
    { "user_id": "u_83a2f1", "name": "Priya Sharma",
      "headline": "Staff Engineer at Google",
      "profile_photo_url": "https://...",
      "mutual_context": { "shared_company": "Google" } }
  ],
  "total_count": 47,
  "next_cursor": "eyJsYX...",
  "has_more": true
}

GET — Count Only (Search Cards, PYMK)

GET /v1/connections/mutual/count?target_user_id={id}

Response: {
  "target_user_id": "u_83a2f1",
  "mutual_count": 47,
  "sample_mutuals": [                    // 2–3 faces for social proof
    { "user_id": "u_7cd312", "name": "Priya Sharma", "profile_photo_url": "..." }
  ]
}

POST — Batch Counts (Search Results, Feed)

POST /v1/connections/mutual/batch
Body: { "target_user_ids": ["u_001", ..., "u_020"], "include_samples": true }

// Viewer's adjacency list loaded ONCE — intersected against all 20 targets
// Bloom filter pre-check eliminates ~70% of full list loads
Response: {
  "results": [
    { "target_user_id": "u_001", "mutual_count": 47, "sample_mutuals": [...] },
    { "target_user_id": "u_002", "mutual_count": 0, "sample_mutuals": [] }
  ]
}

Internal Event — Connection Changed

Topic: connection_events (Kafka)
{ "event_type": "connection_created",
  "user_a_id": "u_83a2f1", "user_b_id": "u_44e891",
  "timestamp": "2025-06-15T14:32:01Z" }

→ Invalidate adj + bloom cache for both users
→ Proactively warm cache for both users

05

High-Level Architecture

The read path is optimized around two insights: the viewer's adjacency list is reused across batch queries, and most profile pairs share zero mutuals (skippable via Bloom filter).

Request Flow — Step Through

Client→API Gateway→Mutual Service→Redis (Bloom)→Bloom Check→Redis (Adj)→Intersect→Profile Svc→Response

Click Next Step to walk through the request flow.

06

Deep Dive — Bloom Filter Pre-check & Set Intersection

The most impactful optimization isn't a faster intersection algorithm — it's avoiding the intersection entirely for the ~70% of profile pairs that share zero mutuals.

The Adaptive Intersection Engine

The system picks the best algorithm at runtime based on set sizes:

Scenario	Algorithm	Why	Time
Zero mutuals (70%)	Bloom pre-check	Skip intersection entirely	~0.0001ms
Small × Small (300×300)	Sorted merge	Sequential access, cache-friendly	~0.002ms
Small × Large (300×25K)	Galloping search	O(\|S\| × log\|L\|) beats O(S+L)	~0.01ms
Large × Large (25K×25K)	Roaring bitmap AND	SIMD bitwise ops, 3× faster	~0.05ms
Batch (1 viewer × 20 targets)	Hash-set probe	Build viewer set once, probe N times	~0.02ms total

Bloom Filter — The Zero-Mutual Fast Path

A Bloom filter is a probabilistic data structure that answers "is X possibly in this set?" If it says no, the element is definitely not present. If it says yes, the element is probably present (with a tunable false positive rate).

The Compound FPR Trap

With 1% per-element FPR and 500 viewer IDs probed against a target bloom, the chance of at least one false positive is 1 − 0.99⁵⁰⁰ = 99.3%. The bloom is nearly useless as a yes/no gate! Solution: count hits vs. expected false positives. If hits ≤ expected_FP × 1.5, declare zero mutuals. This gives ~90% skip rate on true-zero pairs.

Statistical Threshold Approach

bloom_check(viewer_list, target_bloom, fpr=0.01):
    hits = sum(1 for id in viewer_list if target_bloom.might_contain(id))
    expected_fp = len(viewer_list) × fpr        // e.g. 500 × 0.01 = 5
    
    if hits ≤ expected_fp × 1.5:                // ≤ 7.5 → likely zero mutuals
        return ZERO_MUTUALS                     // Skip full intersection
    else:
        return NEEDS_INTERSECTION               // Load full list, intersect

Bloom Filter Sizing

Connections	FPR	Filter Size	Hash Functions	vs Full List
200	1%	240 bytes	7	15× smaller
500	1%	598 bytes	7	6.7× smaller
5,000	1%	5.9 KB	7	6.7× smaller
25,000	1%	29.3 KB	7	6.7× smaller

Batch Pipeline with Bloom Pre-check

sequenceDiagram participant C as Client participant M as Mutual Service participant RB as Redis (Bloom) participant RA as Redis (Adj) participant P as Profile Service C->>M: POST /batch (20 targets) M->>RA: GET viewer adjacency list RA-->>M: 500 IDs (4KB) M->>RB: MGET 20 target bloom filters RB-->>M: 20 blooms (~12KB total) Note over M: Probe viewer IDs against
each bloom. 14 return ZERO.
6 return NEEDS_INTERSECTION. M->>RA: MGET 6 target adj lists RA-->>M: 6 lists (~24KB) Note over M: Hash-set intersect viewer
set against 6 targets. M->>P: Enrich 12 sample mutuals P-->>M: Names + photos M-->>C: 20 results (14 zero, 6 with counts)

Impact: Bloom Saves 60% of Redis Operations

Without bloom: 21 Redis calls (1 viewer + 20 target lists), 84KB transferred. With bloom: 8 Redis calls (1 viewer + 1 MGET blooms + 6 selected lists), 40KB transferred. At 100K+ batch QPS, that's hundreds of thousands of saved Redis calls per second.

07

Users notice wrong counts. "~50" undermines trust. Exact intersection is fast enough (sub-50ms) for display surfaces.

08

What Can Go Wrong

Redis Shard Failure

One shard down → 5% of single queries affected, but 40% of batch requests (probability at least 1 of 20 targets is on dead shard). Mitigation: Replica promotion in 5–15s, DB fallback with rate limiting, client renders profiles without mutuals for affected targets.

Kafka Consumer Lag — Stale Mutual Counts

30-minute lag → 208K users with stale adjacency lists (0.35% of DAU). Most won't notice, but "just connected" users will. Mitigation: 1-hour TTL safety net bounds staleness. Freshness tokens for read-your-own-writes bypass. Consumer lag alerting.

Bloom Filter Corruption

If bloom becomes a subset of reality (missing bits), it returns false "zero mutuals" — the dangerous case. Mitigation: Atomic rebuild (adj + bloom together via Redis pipeline). Bloom TTL shorter than adj TTL (45min vs 60min). FP-rate monitoring detects corrupted filters.

Power-User Thundering Herd

Influencer's 240KB adjacency list TTL expires. 500 requests simultaneously miss cache, all hit Graph DB. Mitigation: Singleflight (request coalescing) — 500 requests collapse into 1 DB read. VIP cache warming pre-refreshes before TTL expiry. Stale-while-revalidate.

Graph DB Overload from Cache Miss Storm

Redis eviction event → thousands of cache misses → DB receives 50K QPS (sized for 1K). Mitigation: Rate-limit DB fallback at 5K QPS. Tiered degradation: return nulls for uncached targets. Token-bucket controlled cache backfill.

Celebrity Hot-Key in Redis

Viral CEO profile → 1M reads/sec on single Redis shard (capacity: 100K). Latency spikes for all keys on that shard. Mitigation: Local LRU cache (30s TTL) on service instances. Hot-key detection → auto-replicate key to 8 shards. Redis read replicas.

Event Pipeline Data Loss

Kafka event lost → connection exists in DB but cache never invalidated. Mitigation: Transactional outbox pattern (event written in same DB transaction as connection). Hourly reconciliation job samples 10K users, rebuilds mismatched caches.

09

Interview Tips

💡

Lead with the combinatorial impossibility.
Within 2 minutes, before drawing any box: "Precomputing all pairs is 10¹⁸ entries. Each new connection invalidates thousands. So query-time intersection is the only option — the entire design follows from that constraint."

⚡

Name the three access patterns unprompted.
Single profile view (100ms budget), bulk count for search cards (50ms, batch of 20), PYMK ranking pipeline (seconds, thousands of candidates). Each needs a different optimization. When you say "the viewer's list is reused across all 20," you've revealed the key insight.

🎯

Don't say "just use a graph database."
Neo4j can express the query elegantly but can't hit 100K QPS at sub-50ms. It's the persistence layer, not the query engine. The hot path runs through Redis with app-layer intersection.

🔢

Know the Bloom filter compound FPR trap.
1% per-element FPR × 500 probes = 99.3% compound FP rate. The bloom is useless as a yes/no gate. Bring up the statistical threshold approach proactively — if the interviewer catches this before you do, it's a red flag.

📊

Derive scale from product behavior, not thin air.
"60M DAU × 50 lookups/session × 3–4× peak = ~140K QPS." Trace the logic so they can verify. Split count vs. list — shows you understand the different performance profiles.

🏗️

Show the single-machine baseline first.
"1M users × 500 connections × 8 bytes = 4GB. Fits in one machine. Sorted merge takes 5μs. We distribute for storage (5TB), not CPU." Demonstrates you add complexity only when data forces it.

🧪

Prepare for "now add 2nd-degree connections."
"That's union of 500 connection lists (250K candidates), not intersection of 2. Can't be real-time — needs batch precomputation. Our mutual system handles the real-time layer; 2nd-degree is a batch layer on top."

10

Evolution

How this design grows from MVP to planet-scale. Each transition is triggered by a specific bottleneck — not arbitrary growth.

1

MVP — SQL JOIN, No Caching (<500K users)

Single PostgreSQL database. SELECT friend_id FROM connections c1 JOIN connections c2 ON c1.friend_id = c2.friend_id. Works in <10ms for small lists. Build and ship in a day. Trigger: P95 latency exceeds 100ms as connection density grows.

2

Cache Layer — Redis + Batch API (500K–10M users)

Redis caches adjacency lists. App-layer intersection. Batch endpoint (viewer-list reuse). Direct cache invalidation on connection changes. 2–3 Redis nodes. Trigger: DB fallback overload from growing cache-miss traffic.

3

Events + Bloom Filters (10M–100M users)

Kafka decouples invalidation. Bloom filters skip 70% of list loads. Adaptive intersection algorithm. Profile Service for enrichment. Trigger: Power-user latency spikes, Graph DB needs sharding.

4

Sharding + Power-User Optimizations (100M–500M users)

Sharded Graph DB (5–10 primaries). VIP cache warming. Singleflight for thundering herds. Roaring bitmaps for large-large intersections. Local LRU cache for hot keys. Trigger: PYMK pipeline cost, multi-region latency requirements.

5

Planet Scale — Multi-Region + Approximate Methods (500M–1B+ users)

Regional Redis clusters (US, Europe, APAC, India). MinHash signatures for PYMK scoring. Tiered storage (hot/warm/cold). Community-based graph partitioning. CSR in-memory graph index for hottest users. You're LinkedIn.

📺

References & Videos

People You May Know — Graph Algorithms

Gaurav Sen · 15 min

Real-Time Graph Serving at LinkedIn

LinkedIn Engineering

Problem Statement

The Four Sub-Problems

The Intersection Problem

The Freshness Problem

The Fan-out Problem

The Asymmetry Problem

Requirements

Functional Requirements

Non-Functional Requirements

Out of Scope

Scale Estimation

Key Derivation

API Design

High-Level Architecture

Deep Dive — Bloom Filter Pre-check & Set Intersection

The Adaptive Intersection Engine

Bloom Filter — The Zero-Mutual Fast Path

The Compound FPR Trap

Bloom Filter Sizing

Batch Pipeline with Bloom Pre-check

Impact: Bloom Saves 60% of Redis Operations

Key Design Decisions & Tradeoffs

1. Query-Time Intersection vs. Precomputed Tables

Query-Time Intersection

Precomputed Mutual Table

2. Serialized Sorted Arrays vs. Redis ZSET

Sorted Byte Arrays

Redis ZSET

3. Application-Layer vs. Redis-Side Intersection

Application Layer (Mutual Service)

Redis SINTER

4. Bloom Filter: Statistical Threshold vs. Ultra-Low FPR

Statistical Threshold (1% FPR)

Ultra-Low FPR (0.01%)

5. Split Storage vs. Combined Keys

Separate Keys (adj + bloom)

Combined Key

6. Exact vs. Approximate Counts

Exact for Display, MinHash for Ranking

Approximate Everywhere

What Can Go Wrong

Redis Shard Failure

Kafka Consumer Lag — Stale Mutual Counts

Bloom Filter Corruption

Power-User Thundering Herd

Graph DB Overload from Cache Miss Storm

Celebrity Hot-Key in Redis

Event Pipeline Data Loss

Interview Tips

Similar Problems

Twitter / News Feed

Search Autocomplete

Rate Limiter

Recommendation Engine

Real-Time Leaderboards

People You May Know

Evolution

MVP — SQL JOIN, No Caching (<500K users)

Cache Layer — Redis + Batch API (500K–10M users)

Events + Bloom Filters (10M–100M users)

Sharding + Power-User Optimizations (100M–500M users)

Planet Scale — Multi-Region + Approximate Methods (500M–1B+ users)

References & Videos

Twitter / News Feed

Search Autocomplete

SQL vs NoSQL