Traffic is spiky. Your e-commerce site does 1000 RPS on a Tuesday afternoon and 15,000 RPS during Black Friday. Static provisioning is wasteful (pay for peak capacity 24/7) or risky (provision for average, die during spikes). Autoscaling — adding/removing instances based on live load — is the ops problem that replaces "capacity planning" with "set up the metric and walk away."
It looks simple ("scale when CPU > 70%"). It isn't. Tune it wrong and you get thrash, slow response to spikes, unnecessary AWS bills, or cascading failures.
02
The three strategies
Reactive
Scale when a metric crosses a threshold
"Scale up when CPU > 70%; scale down when CPU < 30%." Simple. Fast enough for most workloads. Metric lag + instance startup time mean real spikes hurt — you're always a minute behind.
Predictive / scheduled
Pre-scale for known patterns
"Every weekday at 8:30am, traffic triples — scale up at 8:15am." Uses historical traffic shapes (ML or manual schedules). Great when patterns are stable. Useless for unexpected viral spikes.
Target tracking
Maintain a metric at a setpoint
"Keep average CPU at 50%." The autoscaler uses PID-like control to add/remove instances. Smoother than pure reactive; production-ready. AWS ALB + target tracking is the default in 2025.
03
What metric to scale on
Metric
Reflects
Catch
CPU usage
Compute-bound work
Irrelevant for I/O-bound services (80% idle waiting for DB)
Request count / RPS
Traffic volume
Ignores request cost variance — one expensive request ≠ one cheap one
Latency P99
User-perceived slowness
Scales after users already felt pain
Queue depth
Backlog of pending work
Only meaningful if you have an actual queue in front
Active connections
Load on persistent-connection services
Best for WebSocket / DB proxy tiers
Custom (e.g., GPU util)
Specific bottleneck
Requires custom metric pipeline
For a typical web API: request count per instance is usually the cleanest signal. CPU is noisy; latency is too lagging. Combine: primary metric = RPS/instance; secondary = P99 latency as a safety trigger.
04
Deep dive — the four operational traps
1. Scale-up latency. Instance boot + warmup (JVM, connection pool, cache population) = 30–120 seconds. Your traffic doubled 10 seconds ago; you still have old capacity. Mitigation: target utilization below 100% (60–70%) so headroom absorbs spikes while new instances boot. Keep a warm pool in critical systems.
2. Thrashing. CPU alternates between 75% and 65%. Threshold-based autoscaler adds, then removes, then adds... Cooldown periods (wait 5 min after scaling before scaling again) and hysteresis (scale up at 75%, scale down at 45% — different thresholds) prevent this.
3. Cascading scale. API tier scales up → hits the database → DB saturates → everyone's latency spikes → autoscaler adds more API instances → DB dies faster. Every downstream must either scale too or have backpressure. Circuit breakers are your friend.
4. Cost explosions. A bug makes every request take 10× longer. CPU stays high; autoscaler adds instances; bill doubles overnight. Set hard maxes. Every autoscale group needs an upper bound — even if "upper bound hit" means users see 503, that's cheaper than a $50k surprise.
Thrashing — Single Threshold vs Hysteresis + CooldownSVG
05
Real-world
AWS Auto Scaling Groups
Target tracking by default
Set a target (e.g., 60% CPU) and AWS adjusts capacity. Supports custom CloudWatch metrics. Battle-tested.
Kubernetes HPA
Horizontal Pod Autoscaler
Scales pod count based on metrics (CPU, memory, custom). Combined with Cluster Autoscaler for node-level scaling.
KEDA
Event-driven autoscaling
Kubernetes add-on. Scales based on queue length, Kafka lag, Pub/Sub backlog. Perfect for worker pools.
Lambda / Cloud Run
Per-request autoscaling
Serverless platforms autoscale at the request level. Trade: cold starts on scale-up. Great for variable workloads.
06
Used in problems
News feed API tier autoscales on RPS. YouTube/Netflix encoding workers autoscale on queue depth. E-commerce checkout tier pre-scales for known events (Black Friday, product launches).