The Philosophy of Distributed Systems
6-min read
Core Concepts URL copied
We will cover the fundamental concepts of distributed systems, including consistency models, fault tolerance, and building resilient APIs.
Consistency Models URL copied
Distributed systems are just computers arguing over whose clock is lying.[Missing footnote: clock]
Strong Consistency URL copied
Linearizability URL copied
Linearizability guarantees that every operation appears to take effect instantaneously at some point between its invocation and completion.
- Reads see the latest committed write.
- Clients do not need to reason about replica lag.
- The price is coordination.
def write(value):
quorum = await replicate_to_majority(value)
return {"ok": quorum >= 3}
sequenceDiagram participant C as Client participant L as Leader participant F as Follower C->>L: write(x=1) L->>F: replicate(x=1) F-->>L: ack L-->>C: committed
Practical Implications URL copied
In practice, strong consistency requires coordination between nodes, which limits throughput and increases latency under network partitions.
Trade-offs in Banking Systems URL copied
Banks historically chose consistency over availability — a double-spend is worse than a brief outage.
Trade-offs in Social Media URL copied
Social feeds tolerate eventual consistency; showing a post 200ms late is acceptable.
Implementation Approaches URL copied
Two-phase commit and Paxos are the canonical protocols for achieving strong consistency across distributed nodes.
Sequential Consistency URL copied
Sequential consistency relaxes the real-time requirement while still preserving program order within each process.
Inline math still works: $W + R > N$ is the quorum overlap rule.
Causal Consistency URL copied
Causal consistency ensures that operations causally related are seen in the same order by all nodes.
Causal vs Sequential URL copied
Causal consistency is strictly weaker than sequential — it only orders events that are causally linked, not all events globally.
Vector Clocks URL copied
Vector clocks track causal history by assigning each process a logical counter, allowing nodes to determine if two events are causally related or concurrent.
Eventual Consistency URL copied
Eventual consistency guarantees that, given no new updates, all replicas converge to the same value over time.
Conflict Resolution Strategies URL copied
Last Write Wins URL copied
Last Write Wins uses wall-clock timestamps to resolve conflicts — simple but vulnerable to clock skew.
Clock Skew Risks URL copied
In a distributed system, clocks drift. Two events written 1ms apart on different nodes can appear in the wrong order when clocks differ by more than 1ms.
Merge Functions URL copied
CRDTs (Conflict-free Replicated Data Types) define merge functions that are commutative, associative, and idempotent — so order of merging doesn't matter.
Read Repair URL copied
Read repair detects stale replicas at read time and patches them in the background.
Read-Your-Writes Consistency URL copied
A session-level guarantee: after a client writes a value, subsequent reads by the same client will reflect that write.
Implementation via Sticky Sessions URL copied
Load Balancer Affinity URL copied
Routing the same client to the same replica ensures read-your-writes without cross-node coordination.
Session Tokens URL copied
Encode a minimum-version token in the session cookie; replicas refuse to serve reads below that version.
Fault Tolerance URL copied
client -> api: request api -> replica_a: write api -> replica_b: write replica_b -> api: timeout
Failure Modes URL copied
Crash Failures URL copied
The node stops responding entirely. The simplest failure mode — dead nodes don't lie.
- TaskEasy to detect
- TaskEasy to recover from
- TaskEasy to explain to leadership
Byzantine Failures URL copied
The node responds with arbitrary or malicious data. Byzantine fault tolerance requires 3f+1 nodes to tolerate f traitors.
Byzantine Generals Problem URL copied
Lamport, Shostak, and Pease (1982) formalized the problem: generals must agree on a battle plan despite traitors sending conflicting messages.
Practical Byzantine Fault Tolerance URL copied
PBFT achieves Byzantine fault tolerance in O(n²) message complexity — too expensive for large clusters, practical for small trusted federations.
Replication Strategies URL copied
Leader-Follower Replication URL copied
One leader accepts writes; followers replicate asynchronously or synchronously.
Synchronous Followers URL copied
At least one follower is synchronous — the leader waits for acknowledgment before confirming a write to the client.
Semi-Synchronous Replication URL copied
MySQL's semi-sync: the leader waits for one follower before responding, then lets the rest replicate asynchronously.
Leaderless Replication URL copied
Any replica can accept writes; conflicts resolved by quorum reads.
Quorum Rules URL copied
With N replicas, W write quorum, and R read quorum: if W + R > N, reads and writes overlap and consistency is guaranteed.
Building Resilient APIs URL copied
Rate Limiting URL copied
Token Bucket Algorithm URL copied
The token bucket allows bursting up to a maximum capacity while enforcing a sustained average rate.
Burst Capacity URL copied
Burst capacity lets clients absorb short spikes without being throttled — tokens accumulate when the client is idle.
Token Refill Rate URL copied
The refill rate defines the sustained throughput ceiling. Burst capacity is a credit; the refill rate is the income.
Per-User vs Global Buckets URL copied
Per-user buckets prevent one client from starving others. Global buckets protect the server but allow noisy-neighbor problems.
Sliding Window Algorithm URL copied
The sliding window tracks request counts within a rolling time window, avoiding the boundary spikes of fixed-window counters.
Fixed Window vs Sliding Window URL copied
A fixed window resets at a hard boundary — a client can send 2× the rate limit by frontloading two windows. The sliding window eliminates this by computing the window from the current timestamp.
Redis Implementation URL copied
Store request timestamps in a sorted set; prune entries older than the window on each request; count remaining entries.
Circuit Breaker Pattern URL copied
States URL copied
Closed URL copied
Normal operation. Requests pass through. Failures are counted.
Open URL copied
The circuit has tripped. Requests fail immediately without hitting the downstream service.
Timeout Before Half-Open URL copied
After a cooldown period, the circuit transitions to half-open to probe whether the downstream has recovered.
Probe Request Strategy URL copied
Send one request. If it succeeds, close the circuit. If it fails, reopen and restart the cooldown.
Half-Open URL copied
A limited number of requests are allowed through as a recovery probe.
Failure Thresholds URL copied
Error Rate Threshold URL copied
Trip the circuit when the error rate exceeds a percentage over a rolling window, not a raw count — raw counts penalize high-traffic services unfairly.
Minimum Request Volume URL copied
Don't trip on 1 failure out of 1 request. Require a minimum volume before evaluating the error rate.
Hystrix Defaults URL copied
Netflix Hystrix defaults: 20 requests minimum, 50% error rate, 5s rolling window, 5s sleep window before half-open.
Retry Strategies URL copied
Exponential Backoff URL copied
Exponential backoff doubles the wait time after each failure, spreading retry load across time rather than concentrating it into a storm.
Jitter URL copied
Without jitter, synchronized clients retry at the same moment after a failure — coordinated thundering herds. Add random jitter to decorrelate retries.
Full Jitter vs Equal Jitter URL copied
Full jitter: sleep = random(0, base * 2^attempt). Equal jitter: sleep = base * 2^attempt / 2 + random(0, base * 2^attempt / 2). Equal jitter keeps a minimum floor.
AWS Recommendations URL copied
AWS documented these jitter strategies in 2015 after observing retry storms in S3 clients. Full jitter reduces contention the most.
Idempotency Keys URL copied
Why Idempotency Matters URL copied
Retries are only safe if the operation is idempotent. Non-idempotent retries can double-charge users, send duplicate emails, or corrupt state.
Idempotency Key Design URL copied
Generate the key client-side before the first attempt. A UUID derived from the request's logical identity works; a random UUID does not — it changes on retry.
Server-Side Deduplication Window URL copied
Servers store idempotency keys for a bounded window (e.g., 24 hours). Keys older than the window are evicted — the client must not retry stale operations.