Database Sharding and Partitioning at Scale
Deep dive into database sharding and partitioning strategies. Learn hash-based, range-based, and geographic sharding with real-world examples and trade-offs.
Tags
Database Sharding and Partitioning at Scale
This is Part 7 of the System Design from Zero to Hero series.
TL;DR
Sharding splits a database across multiple machines by a shard key, enabling horizontal scaling but introducing complexity in cross-shard queries and rebalancing. Choosing the right shard key and partitioning strategy is one of the most consequential decisions you will make in a distributed system.
Why This Matters
In Part 4, we covered how to choose between SQL and NoSQL databases. In Part 2, we discussed vertical and horizontal scaling. But there comes a point where even the beefiest single database server cannot handle your data volume or query load. At that point, you need to split your data across multiple machines.
This is where sharding enters the picture. Every large-scale system -- from social media platforms handling billions of posts to e-commerce platforms processing millions of orders -- relies on some form of data partitioning. Getting it right means seamless scaling. Getting it wrong means years of painful migrations.
Core Concepts
Partitioning vs Sharding
These terms are often used interchangeably, but they are distinct:
Partitioning splits data within a single database instance. PostgreSQL table partitioning, for example, divides a large table into smaller physical tables based on a partition key. The database engine handles routing queries to the correct partition transparently.
Sharding distributes partitions across multiple database instances running on different machines. Each shard is an independent database that holds a subset of the total data. The application (or a proxy layer) must route queries to the correct shard.
Vertical Partitioning splits a table by columns. You might store frequently accessed columns (user_id, name, email) in one table and rarely accessed columns (bio, preferences, settings) in another. This reduces row size and improves cache efficiency.
Horizontal Partitioning splits a table by rows. Users with IDs 1-1,000,000 go to partition A, and users with IDs 1,000,001-2,000,000 go to partition B. This is the basis of sharding.
Range-Based Sharding
Range-based sharding assigns rows to shards based on value ranges of the shard key.
Shard 1: user_id 1 - 1,000,000
Shard 2: user_id 1,000,001 - 2,000,000
Shard 3: user_id 2,000,001 - 3,000,000
Advantages:
- ›Range queries are efficient (find all users with IDs between X and Y)
- ›Easy to understand and implement
- ›Supports sequential scans within a range
Disadvantages:
- ›Prone to hotspots. If new users are always assigned incrementing IDs, Shard 3 handles all the write traffic while Shard 1 sits idle.
- ›Uneven data distribution if ranges are not carefully calibrated
Range-based sharding works well for time-series data where you query by time ranges and can tolerate that recent data shards are hotter than historical ones.
Hash-Based Sharding
Hash-based sharding applies a hash function to the shard key and uses modulo to determine the shard:
def get_shard(user_id: int, num_shards: int) -> int:
return hash(user_id) % num_shardsThis distributes data uniformly across shards, eliminating the hotspot problem of range-based sharding.
The critical flaw: When you add or remove shards, num_shards changes, and almost every key maps to a different shard. This means rehashing and moving nearly all your data -- a catastrophic operation on a live system.
Consistent Hashing Ring
Consistent hashing solves the rehashing problem. Instead of hash(key) % N, it maps both keys and nodes onto a circular hash space (a ring).
Node A (position 90)
/ \
/ \
Node D (position 340) Node B (position 180)
\ /
\ /
Node C (position 270)
Each key is assigned to the first node encountered when walking clockwise from the key's position on the ring. When a node is added or removed, only the keys between the new node and its predecessor need to move -- roughly 1/N of the total data instead of nearly all of it.
import hashlib
from bisect import bisect_right
class ConsistentHashRing:
def __init__(self, nodes=None, virtual_nodes=150):
self.ring = {}
self.sorted_keys = []
self.virtual_nodes = virtual_nodes
if nodes:
for node in nodes:
self.add_node(node)
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node: str):
"""Add a node with virtual nodes for better distribution."""
for i in range(self.virtual_nodes):
virtual_key = f"{node}:vn{i}"
hash_val = self._hash(virtual_key)
self.ring[hash_val] = node
self.sorted_keys.append(hash_val)
self.sorted_keys.sort()
def remove_node(self, node: str):
"""Remove a node and all its virtual nodes."""
for i in range(self.virtual_nodes):
virtual_key = f"{node}:vn{i}"
hash_val = self._hash(virtual_key)
del self.ring[hash_val]
self.sorted_keys.remove(hash_val)
def get_node(self, key: str) -> str:
"""Find which node a key belongs to."""
if not self.ring:
raise Exception("No nodes in ring")
hash_val = self._hash(key)
idx = bisect_right(self.sorted_keys, hash_val)
# Wrap around to the first node if past the end
if idx == len(self.sorted_keys):
idx = 0
return self.ring[self.sorted_keys[idx]]
# Usage
ring = ConsistentHashRing(nodes=["shard-1", "shard-2", "shard-3"])
shard = ring.get_node("user:12345") # Deterministically routes to a shard
# Adding a new shard only moves ~1/N of the keys
ring.add_node("shard-4")Virtual nodes are critical. Without them, nodes may be unevenly spaced on the ring, leading to skewed data distribution. Each physical node gets multiple positions (virtual nodes) on the ring, smoothing out the distribution.
Shard Key Selection
The shard key is the single most important decision in a sharding strategy. A bad shard key can make the entire system unusable.
Good shard key properties:
- ›High cardinality -- Many distinct values to distribute across shards evenly
- ›Even distribution -- No single value dominates (avoid country_code if 60% of users are in one country)
- ›Query alignment -- Most queries include the shard key, avoiding cross-shard operations
- ›Immutability -- Changing a shard key value means moving the row between shards
Common shard keys by domain:
- ›Multi-tenant SaaS:
tenant_id(all tenant data on one shard, no cross-shard queries for tenant operations) - ›Social media:
user_id(user's posts, followers, and activity colocated) - ›E-commerce:
order_idorcustomer_iddepending on query patterns - ›Time-series:
device_id+ time range (not timestamp alone, which creates write hotspots)
Cross-Shard Queries and Joins
Once data is sharded, queries that span multiple shards become expensive. A query like "find all orders over $100 across all customers" must be sent to every shard, and results must be aggregated at the application layer.
async def cross_shard_query(query: str, shards: list) -> list:
"""Execute a query across all shards and merge results."""
tasks = []
for shard in shards:
tasks.append(execute_on_shard(shard, query))
# Fan out to all shards in parallel
results = await asyncio.gather(*tasks)
# Merge and sort results at the application layer
merged = []
for result_set in results:
merged.extend(result_set)
return sorted(merged, key=lambda r: r['created_at'], reverse=True)Strategies for minimizing cross-shard operations:
- ›Denormalize: Store frequently joined data together on the same shard
- ›Global tables: Replicate small, rarely-changing reference tables to every shard
- ›Application-level joins: Fetch from multiple shards and join in application code
Global Secondary Indexes
When you shard by user_id but need to look up users by email, you have two options:
Local secondary index: Each shard indexes only its own data. Looking up by email requires querying all shards (scatter-gather). Simple to maintain, expensive to query.
Global secondary index: A separate service maps email to the shard containing that user. Fast lookups, but the index itself becomes a component that must be kept in sync with all shards.
Rebalancing Strategies
When shards become unevenly loaded, you need to rebalance:
- ›Split and merge: Split a hot shard into two smaller ones, or merge cold shards. Requires application-level routing updates.
- ›Virtual shards (logical shards): Create many more logical shards than physical nodes. Rebalancing means reassigning logical shards to different physical nodes without moving data within a shard.
- ›Online resharding: Tools like Vitess perform live resharding by copying data to new shards while serving reads from old shards, then cutting over.
Practical Implementation
Vitess (MySQL Sharding)
Vitess, originally built at YouTube, adds horizontal sharding to MySQL:
# Vitess VSchema - defines sharding rules
{
"sharded": true,
"vindexes": {
"hash": {
"type": "hash"
}
},
"tables": {
"users": {
"column_vindexes": [
{
"column": "user_id",
"name": "hash"
}
]
},
"orders": {
"column_vindexes": [
{
"column": "user_id",
"name": "hash"
}
]
}
}
}Vitess handles query routing, cross-shard queries, and online resharding transparently. Applications connect to Vitess as if it were a single MySQL instance.
Citus (PostgreSQL Sharding)
Citus extends PostgreSQL with distributed tables:
-- Create a distributed table sharded by tenant_id
SELECT create_distributed_table('orders', 'tenant_id');
-- Colocate related tables on the same shards
SELECT create_distributed_table('order_items', 'tenant_id',
colocate_with => 'orders');
-- Queries that include tenant_id are routed to a single shard
SELECT * FROM orders WHERE tenant_id = 42 AND status = 'pending';
-- Cross-shard queries work but are slower
SELECT status, count(*) FROM orders GROUP BY status;The colocate_with option is powerful -- it ensures that related tables sharing the same shard key land on the same physical node, enabling local joins.
Trade-offs and Decision Framework
Shard when:
- ›A single database cannot handle the write throughput
- ›Dataset exceeds the storage capacity of a single machine
- ›You need data locality (geographic sharding)
Do not shard when:
- ›Read replicas can handle the load (as covered in Part 2)
- ›Vertical scaling is still viable
- ›Your data model requires many cross-entity joins
- ›Caching can reduce database load sufficiently
Sharding is a one-way door. Once you shard, you cannot easily unshard. Exhaust all other scaling options first -- read replicas, caching, query optimization, vertical scaling. Shard only when you must.
| Strategy | Distribution | Range Queries | Hotspot Risk | Rebalancing |
|---|---|---|---|---|
| Range-based | Uneven | Efficient | High | Easy (split ranges) |
| Hash-based | Even | Scatter-gather | Low | Hard (rehash all) |
| Consistent hashing | Even | Scatter-gather | Low | Easy (move ~1/N) |
| Directory-based | Configurable | Depends | Configurable | Manual |
Common Interview Questions
Q: You have a users table with 2 billion rows. How do you shard it?
A: Shard by user_id using consistent hashing. Use virtual nodes for even distribution. Colocate user-related tables (posts, settings) on the same shard. Build a global secondary index for email-based lookups. Start with 256 logical shards mapped to a smaller number of physical nodes for easier future rebalancing.
Q: How do you handle a cross-shard JOIN between orders and products? A: Replicate the products table (small, read-heavy, rarely changes) to every shard as a reference table. This converts the cross-shard join into a local join. For truly large cross-shard queries, use a separate analytics pipeline that aggregates data outside the transactional path.
Q: What happens if one shard receives disproportionately more traffic? A: This is a hotspot. First, verify the shard key has sufficient cardinality. If one key value is extremely popular (a celebrity user), split that shard or use secondary sub-sharding. Monitor shard-level metrics (CPU, query latency, storage) and set up alerts for imbalance. Use virtual shards that can be redistributed without data movement.
Q: How do you perform schema migrations across shards? A: Use online schema change tools (pt-online-schema-change for MySQL, pg_repack for PostgreSQL) that work on each shard independently. Roll out changes shard by shard, not all at once. Ensure the application code handles both old and new schemas during the migration window.
What's Next
With data properly distributed across shards, Part 8: API Design, Rate Limiting, and Authentication covers how to expose your distributed system through well-designed APIs while protecting it from abuse.
FAQ
How do I choose the right shard key for my database?
Pick a shard key with high cardinality that evenly distributes data and aligns with your most common query patterns. Avoid keys that create hotspots, like timestamp-based keys in write-heavy systems.
What happens when I need to add more shards?
Adding shards requires resharding, which moves data between nodes. Use consistent hashing to minimize data movement, or use virtual shards that can be reassigned to new physical nodes without full redistribution.
What is the difference between sharding and partitioning?
Partitioning splits data within a single database instance (vertical or horizontal), while sharding distributes partitions across multiple database instances on different machines for horizontal scalability.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.