Blog/System Design/CAP Theorem and Distributed Consensus

POST

November 12, 2025

LAST UPDATEDNovember 12, 2025

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.

CAP Theorem and Distributed Consensus

This is Part 9 of the System Design from Zero to Hero series.

TL;DR

The CAP theorem states that during a network partition you must choose between consistency and availability, and understanding this trade-off drives every distributed system decision. This post covers CAP and its PACELC extension, consensus algorithms like Raft, and practical conflict resolution strategies that real databases use.

Why This Matters

Throughout this series, we have built up a mental model of distributed systems: scaling, load balancing, databases, caching, message queues, and sharding. Every one of these components runs across multiple machines. And whenever you have multiple machines, you have a fundamental question: what happens when they cannot talk to each other?

The CAP theorem is not an abstract academic concept. It is the reason DynamoDB offers tunable consistency. It is the reason your bank's ATM might refuse a withdrawal during a network outage rather than risk double-spending. It is the reason distributed databases need consensus algorithms. Understanding CAP is understanding the constraints that shape every distributed architecture.

Core Concepts

The CAP Theorem Explained

The CAP theorem, proven by Seth Gilbert and Nancy Lynch in 2002 (building on Eric Brewer's 2000 conjecture), states that a distributed data store can provide at most two of three guarantees simultaneously:

Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability -- not to be confused with the "C" in ACID, which refers to data integrity constraints.

Availability (A): Every request receives a non-error response, without guaranteeing it contains the most recent write. The system remains operational even if some nodes are unreachable.

Partition Tolerance (P): The system continues to operate despite network partitions -- messages being dropped or delayed between nodes.

The critical insight: network partitions are not optional. In any distributed system, network failures will happen. Cables get cut. Switches fail. Cloud availability zones lose connectivity. You cannot choose to avoid partitions. So the real choice is: during a partition, do you sacrifice consistency or availability?

CP vs AP Systems

CP Systems (Consistency + Partition Tolerance): When a partition occurs, CP systems refuse to serve requests that might return stale data. They choose correctness over availability.

Examples:

›HBase: During a region server failure, affected regions are unavailable until failover completes
›MongoDB (with majority read concern): Reads block until they can confirm the data is committed to a majority of replicas
›etcd/ZooKeeper: Configuration stores that must never return stale data

AP Systems (Availability + Partition Tolerance): When a partition occurs, AP systems continue serving requests from any reachable node, even if the data might be stale or different nodes might return different values.

Examples:

›Cassandra: Every node can accept reads and writes. During a partition, different nodes may have different values for the same key.
›DynamoDB: Continues serving requests during partitions, reconciling conflicts later.
›DNS: Returns cached records even if the authoritative server is unreachable.

There is no CA system in practice. A system that is both consistent and available but not partition-tolerant is just a single-node system. The moment you distribute data across a network, partitions become possible.

The PACELC Extension

CAP only describes behavior during partitions. But what about normal operation? The PACELC theorem extends CAP:

If Partition (P), choose Availability (A) or Consistency (C); Else (E), choose Latency (L) or Consistency (C).

Even when the network is healthy, there is a trade-off between consistency and latency. Synchronously replicating a write to three nodes before acknowledging it is consistent but slow. Acknowledging after writing to one node is fast but risks data loss if that node fails.

System	During Partition (PAC)	Normal Operation (ELC)
DynamoDB	PA	EL (fast reads with eventual consistency)
Cassandra	PA	EL (tunable per query)
MongoDB	PC	EC (waits for majority acknowledgment)
PostgreSQL (sync replication)	PC	EC (synchronous replication)
PostgreSQL (async replication)	PA	EL (fast writes, async replication)

This framework is more useful than raw CAP for making real engineering decisions because it captures the latency trade-off you face every day, not just during rare partition events.

Eventual Consistency Patterns

AP systems accept that replicas will temporarily diverge. Eventual consistency guarantees that, given no new writes, all replicas will eventually converge to the same value. But "eventually" can mean milliseconds or hours.

Read-your-writes consistency: A user who just updated their profile should see the updated profile, even if other users temporarily see the old version. Achieve this by routing the user's reads to the same node that handled their write, or by including a version token in the response.

Monotonic reads: A user should never see older data after seeing newer data. If replica A returns version 5, a subsequent read should not return version 3 from replica B. Achieve this by pinning users to specific replicas (session affinity).

Causal consistency: If event B was caused by event A, then any node that sees B must also have seen A. Achieve this with vector clocks or logical timestamps.

Raft Consensus Algorithm

When you need strong consistency across distributed nodes, you need a consensus algorithm. Raft is the most widely used in modern systems (etcd, CockroachDB, TiKV) because it was designed to be understandable.

Raft works in two phases: leader election and log replication.

Leader Election:

1. All nodes start as FOLLOWERS
2. If a follower receives no heartbeat for a random timeout (150-300ms),
   it becomes a CANDIDATE and starts an election
3. The candidate votes for itself and requests votes from other nodes
4. A node votes for the first candidate it hears from in each TERM
5. If a candidate receives votes from a majority, it becomes the LEADER
6. The leader sends periodic heartbeats to maintain authority

The randomized timeout prevents split votes -- if two nodes time out simultaneously, one will almost always time out first and win the election.

Log Replication:

1. Client sends a write to the LEADER
2. Leader appends the entry to its log
3. Leader replicates the entry to all FOLLOWERS
4. When a MAJORITY of nodes acknowledge, the entry is COMMITTED
5. Leader applies the committed entry to its state machine
6. Leader responds to the client

Here is a simplified Raft node implementation showing the core state transitions:

python

import enum
import random
import time
from dataclasses import dataclass, field
 
class Role(enum.Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"
 
@dataclass
class LogEntry:
    term: int
    command: str
 
@dataclass
class RaftNode:
    node_id: str
    role: Role = Role.FOLLOWER
    current_term: int = 0
    voted_for: str = None
    log: list = field(default_factory=list)
    commit_index: int = -1
    election_timeout: float = 0
 
    def reset_election_timeout(self):
        """Randomized timeout prevents simultaneous elections."""
        self.election_timeout = time.time() + random.uniform(0.15, 0.30)
 
    def start_election(self):
        """Transition from follower to candidate."""
        self.current_term += 1
        self.role = Role.CANDIDATE
        self.voted_for = self.node_id
        self.reset_election_timeout()
        # Request votes from all other nodes
        # If majority responds with grant=True, become leader
        return {
            "type": "request_vote",
            "term": self.current_term,
            "candidate_id": self.node_id,
            "last_log_index": len(self.log) - 1,
            "last_log_term": self.log[-1].term if self.log else 0,
        }
 
    def handle_vote_request(self, term, candidate_id, last_log_index, last_log_term):
        """Decide whether to vote for a candidate."""
        if term < self.current_term:
            return False
 
        if term > self.current_term:
            self.current_term = term
            self.role = Role.FOLLOWER
            self.voted_for = None
 
        # Vote if we haven't voted yet and candidate's log is up-to-date
        if self.voted_for is None or self.voted_for == candidate_id:
            my_last_term = self.log[-1].term if self.log else 0
            my_last_index = len(self.log) - 1
 
            if (last_log_term > my_last_term or
                (last_log_term == my_last_term and last_log_index >= my_last_index)):
                self.voted_for = candidate_id
                self.reset_election_timeout()
                return True
 
        return False
 
    def append_entry(self, command: str):
        """Leader appends a new entry and replicates to followers."""
        if self.role != Role.LEADER:
            raise Exception("Only the leader can append entries")
 
        entry = LogEntry(term=self.current_term, command=command)
        self.log.append(entry)
        # Replicate to followers, commit when majority acknowledges
        return entry

The Split-Brain Problem

Split-brain occurs when a network partition divides a cluster and both sides believe they are the active primary. Without a consensus algorithm, both sides accept writes, leading to data divergence.

Raft prevents split-brain through its majority requirement. A leader must maintain heartbeat contact with a majority of nodes. If a partition isolates the leader from the majority, the majority side elects a new leader, and the old leader's writes (if any) cannot be committed because it cannot reach a majority. When the partition heals, the old leader discovers the higher term and steps down.

Quorum reads and writes generalize this: in a cluster of N nodes, if you write to W nodes and read from R nodes, then as long as W + R > N, reads are guaranteed to see the latest write. Common configurations:

›N=3, W=2, R=2: Strong consistency (used by most CP systems)
›N=3, W=1, R=1: Fastest but no consistency guarantee
›N=3, W=3, R=1: Very durable writes, fast reads, but writes fail if any node is down

Conflict Resolution Strategies

In AP systems that accept writes on multiple nodes during a partition, conflicts will occur. How do you resolve them?

Last-Write-Wins (LWW): Each write carries a timestamp. When two conflicting writes are discovered, the one with the later timestamp wins.

python

def resolve_lww(value_a, timestamp_a, value_b, timestamp_b):
    """Last write wins - simple but can lose data."""
    if timestamp_a >= timestamp_b:
        return value_a
    return value_b

LWW is simple but lossy. If two users edit the same document during a partition, one edit is silently discarded. Cassandra uses LWW by default.

Vector Clocks: Vector clocks track causality. Each node maintains a counter, and the vector of all counters is attached to every value. Two values with incomparable vectors are concurrent (conflict), while one dominating the other indicates a causal ordering.

python

def compare_vector_clocks(vc_a: dict, vc_b: dict) -> str:
    """Compare two vector clocks to determine ordering."""
    all_nodes = set(vc_a.keys()) | set(vc_b.keys())
 
    a_greater = False
    b_greater = False
 
    for node in all_nodes:
        a_val = vc_a.get(node, 0)
        b_val = vc_b.get(node, 0)
 
        if a_val > b_val:
            a_greater = True
        elif b_val > a_val:
            b_greater = True
 
    if a_greater and not b_greater:
        return "A_AFTER_B"     # A happened after B
    elif b_greater and not a_greater:
        return "B_AFTER_A"     # B happened after A
    elif a_greater and b_greater:
        return "CONCURRENT"    # Conflict - must merge
    else:
        return "EQUAL"         # Same version
 
 
# Example
vc_a = {"node1": 3, "node2": 1}  # Written by node1
vc_b = {"node1": 2, "node2": 2}  # Written by node2
# Result: CONCURRENT - these writes happened independently

When vector clocks detect a conflict, the application must resolve it. DynamoDB returns all conflicting values to the client and lets the application merge them. Amazon's shopping cart famously used this approach -- conflicting carts were merged by taking the union of items, preferring to show an extra item over losing one.

CRDTs (Conflict-free Replicated Data Types): CRDTs are data structures designed so that concurrent updates always converge automatically. A G-Counter (grow-only counter) assigns each node its own counter; the total is the sum of all counters. Two nodes can increment independently and merge by taking the max of each counter -- no conflicts possible.

Trade-offs and Decision Framework

Scenario	Choose CP	Choose AP
Financial transactions	Yes	No
Shopping cart	No	Yes
User profile	Depends on field	Depends on field
Inventory count	Yes (overselling risk)	Only for display, not purchase
Social media feed	No	Yes
Configuration/feature flags	Yes	No

Most real systems are not purely CP or AP. They are tunable. Cassandra lets you specify consistency level per query: ONE for AP reads, QUORUM for CP reads, ALL for maximum consistency. Choose per operation based on the business impact of reading stale data.

Common Interview Questions

Q: Is the CAP theorem still relevant? Some say it is outdated. A: CAP remains relevant as a foundational constraint, but PACELC is more practical for engineering decisions because it accounts for the latency-consistency trade-off during normal operation. CAP is binary during partitions; real systems need nuance.

Q: How does a system detect a network partition? A: Nodes detect partitions through heartbeat timeouts. If a node stops receiving heartbeats from the leader (Raft) or from peers (gossip protocol), it assumes a partition. The challenge is distinguishing a slow node from a partitioned node -- aggressive timeouts cause false positives; conservative timeouts cause slow failover.

Q: Why can't you just use timestamps for ordering in distributed systems? A: Clock skew. Physical clocks across machines are never perfectly synchronized. NTP can drift by milliseconds, and a leap second can cause larger jumps. This is why logical clocks (Lamport timestamps, vector clocks) or hybrid logical clocks (used by CockroachDB and YugabyteDB) exist -- they provide causal ordering without relying on synchronized physical clocks.

Q: How would you design a distributed lock service? A: Use a CP system with a consensus algorithm. etcd and ZooKeeper both provide distributed locks via Raft/ZAB consensus. The lock holder must periodically renew a lease. If the lease expires (holder crashed or was partitioned), the lock is released and another node can acquire it. Include fencing tokens to prevent a previously-partitioned lock holder from making stale writes after the lock has been reassigned.

What's Next

We have covered the theoretical foundations that govern distributed system behavior. In Part 10: Monitoring, Observability, and Site Reliability, we bring it all together with the practices that keep these systems running in production -- metrics, tracing, SLOs, and resilience patterns.

FAQ

Is the CAP theorem still relevant in modern system design?

Yes, but it is often oversimplified. Modern systems use the PACELC extension, which adds latency vs consistency trade-offs during normal operation, giving a more complete picture of distributed system behavior.

What is the difference between Raft and Paxos consensus algorithms?

Both achieve distributed consensus, but Raft was designed to be more understandable with clear leader election and log replication phases. Paxos is theoretically elegant but notoriously difficult to implement correctly.

How do databases like DynamoDB and Cassandra handle the CAP trade-off?

DynamoDB and Cassandra choose availability over strong consistency (AP systems) and offer tunable consistency levels, allowing developers to choose per-query whether they need strong or eventual consistency.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

Start a Conversation

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Next.js Server Actions for Form Handling

PostgreSQL vs MongoDB: When to Use Which Database

Design an E-Commerce Order Processing System

Jan 10, 202612 min read

System Design

E-Commerce

Saga Pattern

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Read Article

Monitoring, Observability, and Site Reliability

Dec 10, 20259 min read

System Design

Observability

Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

Read Article

Design a Rate Limiter: Algorithms and Implementation

Nov 05, 202511 min read

System Design

Rate Limiting

Algorithms

Design a Rate Limiter: Algorithms and Implementation

Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.

Read Article

CAP Theorem and Distributed Consensus

CAP Theorem and Distributed Consensus

TL;DR

Why This Matters

Core Concepts

The CAP Theorem Explained

CP vs AP Systems

The PACELC Extension

Eventual Consistency Patterns

Raft Consensus Algorithm

The Split-Brain Problem

Conflict Resolution Strategies

Trade-offs and Decision Framework

Common Interview Questions

What's Next

FAQ

Is the CAP theorem still relevant in modern system design?

What is the difference between Raft and Paxos consensus algorithms?

How do databases like DynamoDB and Cassandra handle the CAP trade-off?

Need help with a project?

Let's Build It

Sadam Hussain

Related Articles

Design an E-Commerce Order Processing System

Monitoring, Observability, and Site Reliability

Design a Rate Limiter: Algorithms and Implementation