CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.
Tags
CAP Theorem and Distributed Consensus
This is Part 9 of the System Design from Zero to Hero series.
TL;DR
The CAP theorem states that during a network partition you must choose between consistency and availability, and understanding this trade-off drives every distributed system decision. This post covers CAP and its PACELC extension, consensus algorithms like Raft, and practical conflict resolution strategies that real databases use.
Why This Matters
Throughout this series, we have built up a mental model of distributed systems: scaling, load balancing, databases, caching, message queues, and sharding. Every one of these components runs across multiple machines. And whenever you have multiple machines, you have a fundamental question: what happens when they cannot talk to each other?
The CAP theorem is not an abstract academic concept. It is the reason DynamoDB offers tunable consistency. It is the reason your bank's ATM might refuse a withdrawal during a network outage rather than risk double-spending. It is the reason distributed databases need consensus algorithms. Understanding CAP is understanding the constraints that shape every distributed architecture.
Core Concepts
The CAP Theorem Explained
The CAP theorem, proven by Seth Gilbert and Nancy Lynch in 2002 (building on Eric Brewer's 2000 conjecture), states that a distributed data store can provide at most two of three guarantees simultaneously:
Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability -- not to be confused with the "C" in ACID, which refers to data integrity constraints.
Availability (A): Every request receives a non-error response, without guaranteeing it contains the most recent write. The system remains operational even if some nodes are unreachable.
Partition Tolerance (P): The system continues to operate despite network partitions -- messages being dropped or delayed between nodes.
The critical insight: network partitions are not optional. In any distributed system, network failures will happen. Cables get cut. Switches fail. Cloud availability zones lose connectivity. You cannot choose to avoid partitions. So the real choice is: during a partition, do you sacrifice consistency or availability?
CP vs AP Systems
CP Systems (Consistency + Partition Tolerance): When a partition occurs, CP systems refuse to serve requests that might return stale data. They choose correctness over availability.
Examples:
- ›HBase: During a region server failure, affected regions are unavailable until failover completes
- ›MongoDB (with majority read concern): Reads block until they can confirm the data is committed to a majority of replicas
- ›etcd/ZooKeeper: Configuration stores that must never return stale data
AP Systems (Availability + Partition Tolerance): When a partition occurs, AP systems continue serving requests from any reachable node, even if the data might be stale or different nodes might return different values.
Examples:
- ›Cassandra: Every node can accept reads and writes. During a partition, different nodes may have different values for the same key.
- ›DynamoDB: Continues serving requests during partitions, reconciling conflicts later.
- ›DNS: Returns cached records even if the authoritative server is unreachable.
There is no CA system in practice. A system that is both consistent and available but not partition-tolerant is just a single-node system. The moment you distribute data across a network, partitions become possible.
The PACELC Extension
CAP only describes behavior during partitions. But what about normal operation? The PACELC theorem extends CAP:
If Partition (P), choose Availability (A) or Consistency (C); Else (E), choose Latency (L) or Consistency (C).
Even when the network is healthy, there is a trade-off between consistency and latency. Synchronously replicating a write to three nodes before acknowledging it is consistent but slow. Acknowledging after writing to one node is fast but risks data loss if that node fails.
| System | During Partition (PAC) | Normal Operation (ELC) |
|---|---|---|
| DynamoDB | PA | EL (fast reads with eventual consistency) |
| Cassandra | PA | EL (tunable per query) |
| MongoDB | PC | EC (waits for majority acknowledgment) |
| PostgreSQL (sync replication) | PC | EC (synchronous replication) |
| PostgreSQL (async replication) | PA | EL (fast writes, async replication) |
This framework is more useful than raw CAP for making real engineering decisions because it captures the latency trade-off you face every day, not just during rare partition events.
Eventual Consistency Patterns
AP systems accept that replicas will temporarily diverge. Eventual consistency guarantees that, given no new writes, all replicas will eventually converge to the same value. But "eventually" can mean milliseconds or hours.
Read-your-writes consistency: A user who just updated their profile should see the updated profile, even if other users temporarily see the old version. Achieve this by routing the user's reads to the same node that handled their write, or by including a version token in the response.
Monotonic reads: A user should never see older data after seeing newer data. If replica A returns version 5, a subsequent read should not return version 3 from replica B. Achieve this by pinning users to specific replicas (session affinity).
Causal consistency: If event B was caused by event A, then any node that sees B must also have seen A. Achieve this with vector clocks or logical timestamps.
Raft Consensus Algorithm
When you need strong consistency across distributed nodes, you need a consensus algorithm. Raft is the most widely used in modern systems (etcd, CockroachDB, TiKV) because it was designed to be understandable.
Raft works in two phases: leader election and log replication.
Leader Election:
1. All nodes start as FOLLOWERS
2. If a follower receives no heartbeat for a random timeout (150-300ms),
it becomes a CANDIDATE and starts an election
3. The candidate votes for itself and requests votes from other nodes
4. A node votes for the first candidate it hears from in each TERM
5. If a candidate receives votes from a majority, it becomes the LEADER
6. The leader sends periodic heartbeats to maintain authority
The randomized timeout prevents split votes -- if two nodes time out simultaneously, one will almost always time out first and win the election.
Log Replication:
1. Client sends a write to the LEADER
2. Leader appends the entry to its log
3. Leader replicates the entry to all FOLLOWERS
4. When a MAJORITY of nodes acknowledge, the entry is COMMITTED
5. Leader applies the committed entry to its state machine
6. Leader responds to the client
Here is a simplified Raft node implementation showing the core state transitions:
import enum
import random
import time
from dataclasses import dataclass, field
class Role(enum.Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
@dataclass
class LogEntry:
term: int
command: str
@dataclass
class RaftNode:
node_id: str
role: Role = Role.FOLLOWER
current_term: int = 0
voted_for: str = None
log: list = field(default_factory=list)
commit_index: int = -1
election_timeout: float = 0
def reset_election_timeout(self):
"""Randomized timeout prevents simultaneous elections."""
self.election_timeout = time.time() + random.uniform(0.15, 0.30)
def start_election(self):
"""Transition from follower to candidate."""
self.current_term += 1
self.role = Role.CANDIDATE
self.voted_for = self.node_id
self.reset_election_timeout()
# Request votes from all other nodes
# If majority responds with grant=True, become leader
return {
"type": "request_vote",
"term": self.current_term,
"candidate_id": self.node_id,
"last_log_index": len(self.log) - 1,
"last_log_term": self.log[-1].term if self.log else 0,
}
def handle_vote_request(self, term, candidate_id, last_log_index, last_log_term):
"""Decide whether to vote for a candidate."""
if term < self.current_term:
return False
if term > self.current_term:
self.current_term = term
self.role = Role.FOLLOWER
self.voted_for = None
# Vote if we haven't voted yet and candidate's log is up-to-date
if self.voted_for is None or self.voted_for == candidate_id:
my_last_term = self.log[-1].term if self.log else 0
my_last_index = len(self.log) - 1
if (last_log_term > my_last_term or
(last_log_term == my_last_term and last_log_index >= my_last_index)):
self.voted_for = candidate_id
self.reset_election_timeout()
return True
return False
def append_entry(self, command: str):
"""Leader appends a new entry and replicates to followers."""
if self.role != Role.LEADER:
raise Exception("Only the leader can append entries")
entry = LogEntry(term=self.current_term, command=command)
self.log.append(entry)
# Replicate to followers, commit when majority acknowledges
return entryThe Split-Brain Problem
Split-brain occurs when a network partition divides a cluster and both sides believe they are the active primary. Without a consensus algorithm, both sides accept writes, leading to data divergence.
Raft prevents split-brain through its majority requirement. A leader must maintain heartbeat contact with a majority of nodes. If a partition isolates the leader from the majority, the majority side elects a new leader, and the old leader's writes (if any) cannot be committed because it cannot reach a majority. When the partition heals, the old leader discovers the higher term and steps down.
Quorum reads and writes generalize this: in a cluster of N nodes, if you write to W nodes and read from R nodes, then as long as W + R > N, reads are guaranteed to see the latest write. Common configurations:
- ›N=3, W=2, R=2: Strong consistency (used by most CP systems)
- ›N=3, W=1, R=1: Fastest but no consistency guarantee
- ›N=3, W=3, R=1: Very durable writes, fast reads, but writes fail if any node is down
Conflict Resolution Strategies
In AP systems that accept writes on multiple nodes during a partition, conflicts will occur. How do you resolve them?
Last-Write-Wins (LWW): Each write carries a timestamp. When two conflicting writes are discovered, the one with the later timestamp wins.
def resolve_lww(value_a, timestamp_a, value_b, timestamp_b):
"""Last write wins - simple but can lose data."""
if timestamp_a >= timestamp_b:
return value_a
return value_bLWW is simple but lossy. If two users edit the same document during a partition, one edit is silently discarded. Cassandra uses LWW by default.
Vector Clocks: Vector clocks track causality. Each node maintains a counter, and the vector of all counters is attached to every value. Two values with incomparable vectors are concurrent (conflict), while one dominating the other indicates a causal ordering.
def compare_vector_clocks(vc_a: dict, vc_b: dict) -> str:
"""Compare two vector clocks to determine ordering."""
all_nodes = set(vc_a.keys()) | set(vc_b.keys())
a_greater = False
b_greater = False
for node in all_nodes:
a_val = vc_a.get(node, 0)
b_val = vc_b.get(node, 0)
if a_val > b_val:
a_greater = True
elif b_val > a_val:
b_greater = True
if a_greater and not b_greater:
return "A_AFTER_B" # A happened after B
elif b_greater and not a_greater:
return "B_AFTER_A" # B happened after A
elif a_greater and b_greater:
return "CONCURRENT" # Conflict - must merge
else:
return "EQUAL" # Same version
# Example
vc_a = {"node1": 3, "node2": 1} # Written by node1
vc_b = {"node1": 2, "node2": 2} # Written by node2
# Result: CONCURRENT - these writes happened independentlyWhen vector clocks detect a conflict, the application must resolve it. DynamoDB returns all conflicting values to the client and lets the application merge them. Amazon's shopping cart famously used this approach -- conflicting carts were merged by taking the union of items, preferring to show an extra item over losing one.
CRDTs (Conflict-free Replicated Data Types): CRDTs are data structures designed so that concurrent updates always converge automatically. A G-Counter (grow-only counter) assigns each node its own counter; the total is the sum of all counters. Two nodes can increment independently and merge by taking the max of each counter -- no conflicts possible.
Trade-offs and Decision Framework
| Scenario | Choose CP | Choose AP |
|---|---|---|
| Financial transactions | Yes | No |
| Shopping cart | No | Yes |
| User profile | Depends on field | Depends on field |
| Inventory count | Yes (overselling risk) | Only for display, not purchase |
| Social media feed | No | Yes |
| Configuration/feature flags | Yes | No |
Most real systems are not purely CP or AP. They are tunable. Cassandra lets you specify consistency level per query: ONE for AP reads, QUORUM for CP reads, ALL for maximum consistency. Choose per operation based on the business impact of reading stale data.
Common Interview Questions
Q: Is the CAP theorem still relevant? Some say it is outdated. A: CAP remains relevant as a foundational constraint, but PACELC is more practical for engineering decisions because it accounts for the latency-consistency trade-off during normal operation. CAP is binary during partitions; real systems need nuance.
Q: How does a system detect a network partition? A: Nodes detect partitions through heartbeat timeouts. If a node stops receiving heartbeats from the leader (Raft) or from peers (gossip protocol), it assumes a partition. The challenge is distinguishing a slow node from a partitioned node -- aggressive timeouts cause false positives; conservative timeouts cause slow failover.
Q: Why can't you just use timestamps for ordering in distributed systems? A: Clock skew. Physical clocks across machines are never perfectly synchronized. NTP can drift by milliseconds, and a leap second can cause larger jumps. This is why logical clocks (Lamport timestamps, vector clocks) or hybrid logical clocks (used by CockroachDB and YugabyteDB) exist -- they provide causal ordering without relying on synchronized physical clocks.
Q: How would you design a distributed lock service? A: Use a CP system with a consensus algorithm. etcd and ZooKeeper both provide distributed locks via Raft/ZAB consensus. The lock holder must periodically renew a lease. If the lease expires (holder crashed or was partitioned), the lock is released and another node can acquire it. Include fencing tokens to prevent a previously-partitioned lock holder from making stale writes after the lock has been reassigned.
What's Next
We have covered the theoretical foundations that govern distributed system behavior. In Part 10: Monitoring, Observability, and Site Reliability, we bring it all together with the practices that keep these systems running in production -- metrics, tracing, SLOs, and resilience patterns.
FAQ
Is the CAP theorem still relevant in modern system design?
Yes, but it is often oversimplified. Modern systems use the PACELC extension, which adds latency vs consistency trade-offs during normal operation, giving a more complete picture of distributed system behavior.
What is the difference between Raft and Paxos consensus algorithms?
Both achieve distributed consensus, but Raft was designed to be more understandable with clear leader election and log replication phases. Paxos is theoretically elegant but notoriously difficult to implement correctly.
How do databases like DynamoDB and Cassandra handle the CAP trade-off?
DynamoDB and Cassandra choose availability over strong consistency (AP systems) and offer tunable consistency levels, allowing developers to choose per-query whether they need strong or eventual consistency.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
Design a Rate Limiter: Algorithms and Implementation
Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.