Concept · Reliability

Graceful Degradation

01

Why this matters

Your recommendation service is down. Does your homepage return a 500, or does it just skip the "Recommended for you" section and still render? The first is a hard failure; the second is graceful degradation. Same failure, dramatically different user experience.

Graceful degradation is the discipline of designing so that partial-system failure produces partial-feature loss rather than total outage. Pairs tightly with circuit breakers.

02

The fallback ladder

When a downstream call fails, pick the best still-available response:

  1. Cached stale data. "Here's yesterday's recommendations." User doesn't notice.
  2. Computed default. "Here's the trending list instead." Non-personalized but useful.
  3. Simpler feature. "Search still works; filters disabled."
  4. Empty-but-valid response. Component renders blank, rest of page works.
  5. Friendly error. "Can't show this right now" + a retry button.
  6. Hard failure. Last resort. Only for unrecoverable dependencies.

Each step down the ladder trades user experience for keeping the system responsive. The art is mapping each dependency to an appropriate fallback before it fails.

03

Patterns in practice

Feature flags for hot-disable. When a downstream starts misbehaving, flip a feature flag off instantly. The caller short-circuits to the fallback without deploying code. Stripe, LinkedIn, and Netflix all wire critical features this way.

Timeout-first, not retry-first. Every downstream call has a tight deadline (200ms, not 30s). If it's not back in time, skip it and use fallback. Don't let slow dependencies cascade into slow users.

Fallback is also tested. Most outages happen when the primary fails AND the fallback was never rehearsed. Chaos engineering (force-fail dependency X; verify fallback engages correctly) is how you keep fallbacks real.

Degrade cheap, not expensive. Serving trending-for-all to a million users is cheap; recomputing personalized recs synchronously is not. If the recs service is down, the fallback must be a pre-warmed global ranking, not a synchronous emergency calculation.

04

Service-level example — Amazon product page

ComponentNormalFallback
Product title & priceDB lookupHard requirement — fail page if unavailable
Customer reviewsReviews serviceCached; if > 24h old show anyway
"Frequently bought together"Recs serviceHide the widget
Inventory / shipping ETAInventory service"Usually ships in 1-2 days" static
User's order history linkAccount serviceShow link anyway; user clicks → handle failure there

Only title + price is a hard dependency. Everything else has a fallback. Why Amazon rarely shows a blank product page even during outages.

05

Deep dive — the static-fallback trick

For top-of-funnel pages (homepage, category browse), Amazon famously generates a static HTML fallback every few minutes. If the dynamic tier is overwhelmed, CDN/edge serves the static version — no personalization, but the site stays up. Users see something instead of 503.

Generalized: every dynamic page should have a stale-OK story.

  • Product pages — cached HTML with default recs. User sees the product; recs widget may be empty.
  • Search — cached popular-query results. "Exact query didn't complete; here's what most people searched today."
  • Checkout — no fallback. Payment flow must be exact; if it can't complete, show "try again."

Identify which pages must be fresh vs can survive staleness. Usually the answer is: most pages can. The hard real-time requirement lives only in transactional flows.

Production rule

"For every synchronous call to another service, answer: if this fails, what do we show the user?" If the answer is "a 500 page," you haven't designed the degradation. If the answer is "we hide that widget and log the incident," you have.

06

Real-world

Netflix

Hundreds of fallbacks

If recs down → show trending. If artwork down → show default poster. If titles down → show 503 (rare; title metadata is replicated heavily).

Amazon static fallback

Pre-generated pages

Top-N product and category pages snapshot every few minutes to S3. CDN serves them when origin is overloaded.

LinkedIn

Feature flags for every dependency

Each feature has a kill switch. When a service misbehaves, an on-call engineer flips the flag and the feature disables cluster-wide in seconds.

Google Search

Progressive fallback

If spellcheck down → skip "did you mean." If personalization down → show generic ranking. If core index down → 500 (no graceful option).

07

Used in problems

News feed falls back to "trending" if the ranker is slow. YouTube/Netflix falls back to default posters if metadata is stale. E-commerce hides optional widgets when their services degrade. Notification system uses channel-level fallbacks (SMS fails → email; email fails → in-app).

Next up