Concept · Scaling

Autoscaling

01

Why this matters

Traffic is spiky. Your e-commerce site does 1000 RPS on a Tuesday afternoon and 15,000 RPS during Black Friday. Static provisioning is wasteful (pay for peak capacity 24/7) or risky (provision for average, die during spikes). Autoscaling — adding/removing instances based on live load — is the ops problem that replaces "capacity planning" with "set up the metric and walk away."

It looks simple ("scale when CPU > 70%"). It isn't. Tune it wrong and you get thrash, slow response to spikes, unnecessary AWS bills, or cascading failures.

02

The three strategies

Reactive

Scale when a metric crosses a threshold

"Scale up when CPU > 70%; scale down when CPU < 30%." Simple. Fast enough for most workloads. Metric lag + instance startup time mean real spikes hurt — you're always a minute behind.

Predictive / scheduled

Pre-scale for known patterns

"Every weekday at 8:30am, traffic triples — scale up at 8:15am." Uses historical traffic shapes (ML or manual schedules). Great when patterns are stable. Useless for unexpected viral spikes.

Target tracking

Maintain a metric at a setpoint

"Keep average CPU at 50%." The autoscaler uses PID-like control to add/remove instances. Smoother than pure reactive; production-ready. AWS ALB + target tracking is the default in 2025.

03

What metric to scale on

MetricReflectsCatch
CPU usageCompute-bound workIrrelevant for I/O-bound services (80% idle waiting for DB)
Request count / RPSTraffic volumeIgnores request cost variance — one expensive request ≠ one cheap one
Latency P99User-perceived slownessScales after users already felt pain
Queue depthBacklog of pending workOnly meaningful if you have an actual queue in front
Active connectionsLoad on persistent-connection servicesBest for WebSocket / DB proxy tiers
Custom (e.g., GPU util)Specific bottleneckRequires custom metric pipeline

For a typical web API: request count per instance is usually the cleanest signal. CPU is noisy; latency is too lagging. Combine: primary metric = RPS/instance; secondary = P99 latency as a safety trigger.

04

Deep dive — the four operational traps

1. Scale-up latency. Instance boot + warmup (JVM, connection pool, cache population) = 30–120 seconds. Your traffic doubled 10 seconds ago; you still have old capacity. Mitigation: target utilization below 100% (60–70%) so headroom absorbs spikes while new instances boot. Keep a warm pool in critical systems.

2. Thrashing. CPU alternates between 75% and 65%. Threshold-based autoscaler adds, then removes, then adds... Cooldown periods (wait 5 min after scaling before scaling again) and hysteresis (scale up at 75%, scale down at 45% — different thresholds) prevent this.

3. Cascading scale. API tier scales up → hits the database → DB saturates → everyone's latency spikes → autoscaler adds more API instances → DB dies faster. Every downstream must either scale too or have backpressure. Circuit breakers are your friend.

4. Cost explosions. A bug makes every request take 10× longer. CPU stays high; autoscaler adds instances; bill doubles overnight. Set hard maxes. Every autoscale group needs an upper bound — even if "upper bound hit" means users see 503, that's cheaper than a $50k surprise.

Thrashing — Single Threshold vs Hysteresis + Cooldown SVG
Single threshold (70%) — thrashes every 2 min 70% +1 +1 +1 +1 +1 -1 -1 -1 -1 -1 Scale up @ 75%, down @ 45% + 5min cooldown — stable 75% 45% +1 (cooldown 5min)
05

Real-world

AWS Auto Scaling Groups

Target tracking by default

Set a target (e.g., 60% CPU) and AWS adjusts capacity. Supports custom CloudWatch metrics. Battle-tested.

Kubernetes HPA

Horizontal Pod Autoscaler

Scales pod count based on metrics (CPU, memory, custom). Combined with Cluster Autoscaler for node-level scaling.

KEDA

Event-driven autoscaling

Kubernetes add-on. Scales based on queue length, Kafka lag, Pub/Sub backlog. Perfect for worker pools.

Lambda / Cloud Run

Per-request autoscaling

Serverless platforms autoscale at the request level. Trade: cold starts on scale-up. Great for variable workloads.

06

Used in problems

News feed API tier autoscales on RPS. YouTube/Netflix encoding workers autoscale on queue depth. E-commerce checkout tier pre-scales for known events (Black Friday, product launches).

Next up