Concept · Data Structures

URL Encoding & Base62

01

Why this matters

"Make me a short URL." Sounds trivial. The shortcode must be: unique, URL-safe, short (6-8 chars), and ideally non-guessable. Plus, you need to mint billions of them at thousands per second without coordination overhead. The encoding choice — Base62 vs Base64 vs hash-based — drives the whole system design. The classic URL-shortener interview question is really about this trade-off.

02

Three approaches

StrategyHowProsCons
Hash + truncateSHA256 of URL → take first N chars Base62Deterministic — same URL always returns the same codeCollisions inevitable; need DB lookup to confirm uniqueness, retry if collision
Counter + Base62Auto-increment counter, encode in Base62Guaranteed unique; predictable length progressionCodes are sequential / guessable; counter is a hot key
Pre-generated poolGenerate batch of random codes, store, hand outRandom + collision-free; no counter contentionNeed a maintenance job to refill; storage overhead
03

Why Base62

Alphabet: 0-9 a-z A-Z = 62 characters. All URL-safe (no escaping). Each character carries log₂(62) ≈ 5.95 bits.

  • 6 chars = 62⁶ ≈ 56 billion codes
  • 7 chars = 62⁷ ≈ 3.5 trillion codes
  • 8 chars = 62⁸ ≈ 218 trillion codes

Why not Base64? Base64 includes + and / which aren't URL-safe — they'd need percent-encoding. URL-safe Base64 (- and _) is fine but visually ugly. Base62 reads like a license plate.

Why not hex (Base16)? Each char is only 4 bits → twice as long for the same key space. Base62 is the sweet spot of "URL-safe + readable + dense."

Snowflake 64-bit ID generator
package snowflake

import (
    "sync"
    "time"
)

const (
    epoch     int64 = 1577836800000 // 2020-01-01 UTC ms
    machineBits = 10
    seqBits     = 12
    maxSeq      = (1 << seqBits) - 1
)

type Node struct {
    mu       sync.Mutex
    machine  int64 // 0..1023
    lastMs   int64
    seq      int64
}

func (n *Node) NextID() int64 {
    n.mu.Lock(); defer n.mu.Unlock()
    now := time.Now().UnixMilli() - epoch
    if now == n.lastMs {
        n.seq = (n.seq + 1) & maxSeq
        if n.seq == 0 {
            for now <= n.lastMs { now = time.Now().UnixMilli() - epoch }
        }
    } else { n.seq = 0 }
    n.lastMs = now
    return (now << (machineBits+seqBits)) | (n.machine << seqBits) | n.seq
}

// 4096 IDs/ms/machine × 1024 machines = 4B IDs/sec ceiling
04

Deep dive — the counter problem and how to scale it

Naive: a single Postgres counter. SELECT nextval('shortcode_seq'). Encode in Base62. Insert. Done.

Problem: every shortener creation takes a round-trip to one DB. At 10k QPS, the counter is a hot row. Single point of contention.

Range-batch trick — each app server reserves a range of 10,000 IDs from the central counter. It hands out IDs locally without DB calls. When the range runs low, it requests another range. Coordination overhead drops 10,000×. This is how Twitter Snowflake-style ID generators work.

Snowflake IDs — 64 bits split into: timestamp (41 bits) + machine ID (10 bits) + sequence-within-ms (12 bits). Each server generates locally with no coordination. Roughly time-ordered. Trivially Base62-encodable to ~11 chars.

Pre-generated random pool — for non-guessable codes (interview ask: "make codes hard to enumerate"), a worker pre-generates batches of cryptographically-random 7-char Base62 strings, deduplicates against existing codes, stores them as "available." App grabs one per insertion. No contention; no enumeration attack.

Production mix

Bit.ly + TinyURL use predictable counter-based codes for free tier, custom random codes for premium. Snowflake-style timestamp IDs are common when ordering matters (Twitter post IDs, Discord snowflakes).

Twitter Snowflake — 64-bit ID Layout SVG
64-bit ID = 1 sign bit + 41 timestamp + 10 machine + 12 sequence 0 sign timestamp (ms since epoch) 41 bits ~69 years machine 10 bits 1024 nodes sequence 12 bits 4096 / ms Per-machine throughput 4096 IDs/ms × 1000 ms = 4.1M IDs/sec/machine ×1024 machines = 4.2B IDs/sec cluster-wide No coordination · roughly time-ordered · ~11 chars Base62
05

Real-world

Bit.ly

Counter + Base62

Sequential by default; non-sequential / vanity codes for paid plans.

YouTube video IDs

Random 11-char Base64

11 chars × 6 bits = 66 bits of namespace. Non-guessable; videos can be unlisted via secret URL.

Twitter Snowflake

Timestamp + machine + seq

64-bit IDs, base-10 in URLs. Enables cursor pagination by time without an extra column.

Imgur / Reddit

5-7 char Base36 / Base62

Mix of counter-based (older content) and random (newer). Reddit visible IDs use Base36.

06

Used in problems

URL shortener is the canonical problem. Pastebin uses the same encoding tricks. Any "shareable link" feature (Google Drive shares, Notion pages, Stripe payment links) builds on these patterns.

Next up