Databases: SQL vs NoSQL and Choosing the Right One
Compare SQL and NoSQL databases for system design. Learn when to use PostgreSQL, MongoDB, Cassandra, or DynamoDB based on your data model and access patterns.
Tags
Databases: SQL vs NoSQL and Choosing the Right One
This is Part 4 of the System Design from Zero to Hero series.
TL;DR
Choose SQL databases when you need complex queries, transactions, and strong consistency. Choose NoSQL when you need flexible schemas, horizontal scaling, and high-throughput workloads with simpler access patterns. In practice, most production systems use multiple databases — a relational database for transactional core data, a key-value store for caching, and possibly a document store or search engine for specific use cases. The right answer is almost never "SQL or NoSQL" but "which database for which part of the system."
Why This Matters
In Part 3, we covered how load balancers distribute traffic across application servers. Those servers are stateless — they process requests and delegate data persistence to databases. The database you choose, and how you design your data model, has more impact on your system's performance, scalability, and maintainability than almost any other decision.
Database choice is also one of the hardest decisions to reverse. Migrating from PostgreSQL to MongoDB (or vice versa) after you have accumulated years of data and built your application logic around a specific query model is an enormous undertaking. Getting this decision right early saves months of pain later.
Core Concepts
ACID vs BASE
These acronyms describe two philosophies for how databases handle data integrity.
ACID (Atomicity, Consistency, Isolation, Durability) is the gold standard for relational databases:
- ›Atomicity: A transaction either completes entirely or not at all. If transferring money from Account A to Account B, both the debit and credit happen, or neither does.
- ›Consistency: The database moves from one valid state to another. Constraints, foreign keys, and triggers are always enforced.
- ›Isolation: Concurrent transactions do not interfere with each other. Two users buying the last item in stock will not both succeed.
- ›Durability: Once a transaction is committed, it survives crashes, power failures, and restarts.
-- ACID transaction example
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE id = 'alice';
UPDATE accounts SET balance = balance + 500 WHERE id = 'bob';
-- If Alice's balance would go negative, the constraint fails
-- and the ENTIRE transaction rolls back
COMMIT;BASE (Basically Available, Soft-state, Eventually consistent) is the philosophy behind most NoSQL databases:
- ›Basically Available: The system guarantees availability — it will respond, even if the response is stale.
- ›Soft-state: The system's state may change over time, even without new input, as data propagates between replicas.
- ›Eventually consistent: Given enough time without new writes, all replicas converge to the same value.
ACID is essential when incorrect data is unacceptable (financial systems, inventory, bookings). BASE is acceptable when brief inconsistency is tolerable (social media feeds, recommendation engines, analytics).
Database Categories
Relational Databases (SQL)
Relational databases store data in tables with predefined schemas. Rows represent records, columns represent attributes, and relationships are expressed through foreign keys and joins.
Strengths: Complex queries with JOINs, strong consistency, mature tooling, ACID transactions, well-understood optimization techniques.
Popular choices: PostgreSQL, MySQL, SQL Server, Oracle.
-- Relational schema example
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
total_amount DECIMAL(10, 2) NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL
);
-- Complex query with JOINs
SELECT u.name, COUNT(o.id) as order_count, SUM(o.total_amount) as lifetime_value
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at >= NOW() - INTERVAL '1 year'
GROUP BY u.id, u.name
HAVING SUM(o.total_amount) > 1000
ORDER BY lifetime_value DESC;Document Databases
Document databases store data as flexible JSON-like documents. Each document can have a different structure, and nested data is stored directly rather than across separate tables.
Strengths: Flexible schemas, natural fit for hierarchical data, fast reads when you fetch entire documents, good horizontal scaling.
Popular choices: MongoDB, CouchDB, Amazon DocumentDB.
// MongoDB document example
{
"_id": ObjectId("64f1a2b3c4d5e6f7a8b9c0d1"),
"email": "alice@example.com",
"name": "Alice Johnson",
"profile": {
"avatar": "https://cdn.example.com/avatars/alice.jpg",
"bio": "Software engineer",
"preferences": {
"theme": "dark",
"notifications": true
}
},
"orders": [
{
"order_id": "ORD-001",
"total": 149.99,
"items": [
{ "product": "Mechanical Keyboard", "qty": 1, "price": 149.99 }
],
"status": "delivered"
}
],
"created_at": ISODate("2024-01-15T10:30:00Z")
}The user, their profile, and their orders are stored together in a single document. Reading this data requires one query instead of three JOINs. But updating a product's name requires finding and updating every document that references that product — denormalized data trades write complexity for read simplicity.
Key-Value Stores
The simplest data model: a key maps to a value. The database does not understand or index the value's structure. Operations are limited to GET, SET, and DELETE by key.
Strengths: Extremely fast (sub-millisecond), predictable performance, excellent horizontal scaling.
Popular choices: Redis, Memcached, Amazon DynamoDB (also supports document model).
import redis
r = redis.Redis(host='localhost', port=6379)
# Session storage
r.setex('session:abc123', 3600, json.dumps({
'user_id': 42,
'role': 'admin',
'login_time': '2024-01-15T10:30:00Z'
}))
# Rate limiting with atomic increment
key = f"ratelimit:{user_id}:{current_minute}"
count = r.incr(key)
if count == 1:
r.expire(key, 60)
if count > 100:
raise RateLimitExceeded()
# Leaderboard with sorted sets
r.zadd('leaderboard', {'player_a': 2500, 'player_b': 3100, 'player_c': 1800})
top_10 = r.zrevrange('leaderboard', 0, 9, withscores=True)Column-Family Databases
Data is stored in column families (groups of related columns) rather than rows. Optimized for writes and reads over large datasets where you access specific column groups.
Strengths: Extremely high write throughput, excellent horizontal scaling, time-series and event data.
Popular choices: Apache Cassandra, HBase, ScyllaDB.
-- Cassandra schema for time-series sensor data
CREATE TABLE sensor_readings (
sensor_id UUID,
reading_time TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
pressure DOUBLE,
PRIMARY KEY (sensor_id, reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);
-- Efficient query: get latest readings for a sensor
SELECT * FROM sensor_readings
WHERE sensor_id = 550e8400-e29b-41d4-a716-446655440000
AND reading_time >= '2024-01-01'
LIMIT 100;Cassandra is designed to never have a single point of failure. Every node in the cluster is identical (no primary/replica distinction), and data is automatically distributed and replicated across nodes. The trade-off is limited query flexibility — you must design your tables around your query patterns.
Graph Databases
Store data as nodes (entities) and edges (relationships). Optimized for queries that traverse relationships, which would require expensive recursive JOINs in a relational database.
Strengths: Relationship-heavy queries (social networks, recommendation engines, fraud detection), path finding, network analysis.
Popular choices: Neo4j, Amazon Neptune, ArangoDB.
// Neo4j Cypher query: find friends of friends who like the same genre
MATCH (me:User {name: 'Alice'})-[:FRIENDS_WITH]->(friend)-[:FRIENDS_WITH]->(fof)
WHERE NOT (me)-[:FRIENDS_WITH]->(fof) AND me <> fof
MATCH (fof)-[:LIKES]->(genre:Genre)<-[:LIKES]-(me)
RETURN DISTINCT fof.name, COLLECT(genre.name) AS shared_genres
ORDER BY SIZE(shared_genres) DESC
LIMIT 10;PostgreSQL: The Swiss Army Knife
PostgreSQL deserves special mention because it has evolved beyond a traditional relational database. It supports:
- ›JSONB columns: Store and query JSON documents with indexing, giving you document database capabilities within SQL.
- ›Full-text search: Built-in text search with ranking, stemming, and multiple languages.
- ›Geospatial queries: PostGIS extension for location-based queries.
- ›Array and range types: Native support for complex data types.
- ›CTEs and window functions: Advanced analytical queries.
-- PostgreSQL JSONB: relational + document hybrid
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
metadata JSONB -- flexible attributes per product category
);
-- Index on JSONB field for fast lookups
CREATE INDEX idx_products_metadata ON products USING GIN (metadata);
-- Query JSONB data
SELECT name, price, metadata->>'color' AS color
FROM products
WHERE metadata @> '{"category": "electronics"}'
AND (metadata->>'weight')::numeric < 2.0;For many applications, PostgreSQL with JSONB columns eliminates the need for a separate document database entirely.
Indexing Strategies
Indexes are the single most impactful performance optimization for databases. Without an index, the database must scan every row (full table scan) to find matching records.
B-tree indexes (the default in most databases): Balanced tree structure, excellent for equality and range queries. This is what you get with CREATE INDEX.
B-tree for column "age":
[30]
/ \
[15,22] [45,60]
/ | \ / | \
[10][18][25][35][50][70]
Query: WHERE age BETWEEN 20 AND 40
Start at root, follow left branch to 22,
scan right through 25, 30, 35. Very fast.
Hash indexes: Map keys to buckets using a hash function. Extremely fast for equality lookups (WHERE id = 42) but useless for range queries (WHERE id > 42).
Composite indexes: Index on multiple columns. Column order matters significantly.
-- Composite index
CREATE INDEX idx_orders_user_date ON orders (user_id, created_at DESC);
-- This query uses the index efficiently (matches prefix)
SELECT * FROM orders WHERE user_id = 42 ORDER BY created_at DESC;
-- This query CANNOT use the index (skips leading column)
SELECT * FROM orders WHERE created_at > '2024-01-01';
-- For this query, you need a separate index on created_at
CREATE INDEX idx_orders_date ON orders (created_at DESC);The rule of thumb for composite indexes: put the most selective column first (the one that eliminates the most rows), and put columns used in WHERE clauses before columns used only in ORDER BY.
Normalization vs Denormalization
Normalization eliminates data duplication by splitting data across tables with relationships. A product's name is stored once in the products table. Orders reference the product by ID.
Denormalization intentionally duplicates data to avoid expensive JOINs at read time. The product name is stored directly in the order_items table alongside the product ID.
Normalized: Denormalized:
┌─────────────┐ ┌──────────────┐ ┌──────────────────────────┐
│ orders │ │ products │ │ order_items │
│─────────────│ │──────────────│ │──────────────────────────│
│ id │ │ id │ │ id │
│ user_id │ │ name │ │ order_id │
│ product_id ─┼▶│ price │ │ product_id │
│ │ │ │ │ product_name (duplicate!)│
└─────────────┘ └──────────────┘ │ product_price (duplicate)│
└──────────────────────────┘
Requires JOIN to get product name. No JOIN needed. But if product
Clean, no duplication. name changes, must update here too.
Normalize when: Write consistency matters. Data changes frequently. Storage efficiency is important. Your queries are flexible and unpredictable.
Denormalize when: Read performance is critical. Data is read far more often than written. Access patterns are well-known and stable. You can tolerate temporary inconsistency during updates.
Practical Implementation
Here is a practical example of choosing the right database for different parts of an e-commerce system:
┌──────────────────────────────────────────────────────────────┐
│ E-Commerce Platform │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ Elasticsearch │ │
│ │ │ │ │ │ │ │
│ │ - Users │ │ - Sessions │ │ - Product search │ │
│ │ - Orders │ │ - Cart data │ │ - Full-text queries │ │
│ │ - Inventory │ │ - Rate limits│ │ - Faceted filtering │ │
│ │ - Payments │ │ - Leaderboard│ │ - Autocomplete │ │
│ │ │ │ - Caching │ │ │ │
│ │ (ACID for │ │ (Speed for │ │ (Search for │ │
│ │ transactions│ │ ephemeral │ │ discovery) │ │
│ └─────────────┘ │ data) │ └─────────────────────┘ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘
# Database router for an e-commerce application
class DatabaseConfig:
"""Each database handles what it does best"""
# PostgreSQL: source of truth for transactional data
POSTGRES = {
'host': 'pg-primary.internal',
'port': 5432,
'database': 'ecommerce',
'tables': ['users', 'orders', 'order_items', 'inventory', 'payments']
}
# Redis: fast ephemeral data and caching
REDIS = {
'host': 'redis-cluster.internal',
'port': 6379,
'use_cases': ['sessions', 'cart', 'rate_limiting', 'cache']
}
# Elasticsearch: search and discovery
ELASTICSEARCH = {
'hosts': ['es-node1.internal:9200', 'es-node2.internal:9200'],
'indices': ['products', 'search_suggestions']
}
# Example: checking inventory (MUST be strongly consistent)
async def check_and_reserve_inventory(product_id, quantity):
"""Use PostgreSQL with row-level locking for inventory"""
async with db.transaction():
row = await db.fetchrow(
"SELECT quantity FROM inventory WHERE product_id = $1 FOR UPDATE",
product_id
)
if row['quantity'] < quantity:
raise InsufficientInventoryError()
await db.execute(
"UPDATE inventory SET quantity = quantity - $1 WHERE product_id = $2",
quantity, product_id
)
# Example: product search (eventual consistency is fine)
async def search_products(query, filters):
"""Use Elasticsearch for flexible full-text search"""
body = {
"query": {
"bool": {
"must": [{"multi_match": {"query": query, "fields": ["name^3", "description"]}}],
"filter": [{"range": {"price": {"gte": filters.get('min_price', 0)}}}]
}
}
}
return await es.search(index="products", body=body)Trade-offs and Decision Framework
| Use Case | Recommended Database | Reason |
|---|---|---|
| User accounts, orders, payments | PostgreSQL / MySQL | ACID transactions, complex queries |
| Product catalog with varying attributes | MongoDB or PostgreSQL (JSONB) | Flexible schema per category |
| Session storage, caching | Redis | Sub-millisecond reads, TTL support |
| Time-series data (logs, metrics) | Cassandra / TimescaleDB | High write throughput, time-based queries |
| Social graph, recommendations | Neo4j | Relationship traversal |
| Full-text search | Elasticsearch | Inverted index, relevance scoring |
| File/object storage | S3 / MinIO | Not a database problem |
Decision process:
- ›Start with PostgreSQL. It handles most use cases well.
- ›Add Redis for caching and session storage when latency matters.
- ›Add specialized databases only when PostgreSQL cannot efficiently serve a specific access pattern.
- ›Avoid premature polyglot persistence — each database you add is another system to operate, monitor, back up, and keep consistent.
Common Interview Questions
Q: When would you choose MongoDB over PostgreSQL? MongoDB excels when your data is naturally hierarchical (nested documents), your schema varies significantly between records (product attributes differ by category), and you need horizontal scaling with automatic sharding. PostgreSQL is better when you need transactions across multiple entities, complex analytical queries with JOINs, or when your data has strong relational structure. Note that PostgreSQL's JSONB support blurs this line — many use cases that once required MongoDB can now be handled by PostgreSQL.
Q: How would you handle a database that is running out of capacity? Follow this order: (1) Optimize queries and add missing indexes — this is almost always the first bottleneck. (2) Add connection pooling (PgBouncer) if connections are the limit, as discussed in Part 2. (3) Add read replicas for read-heavy workloads. (4) Vertically scale the database (more CPU/RAM). (5) Add application-level caching (covered in Part 5). (6) Shard the database as a last resort. Most applications never need to reach step 6.
Q: What is the difference between a B-tree and a hash index?
A B-tree index is a sorted tree structure that supports both equality (=) and range queries (>, <, BETWEEN, ORDER BY). A hash index maps keys to buckets and only supports equality lookups. B-tree is the default and correct choice for most columns. Use hash indexes only for exact-match lookups on columns that are never used in range queries or sorting — which is rare enough that most engineers never create one explicitly.
Q: Explain the CAP theorem with a practical example. The CAP theorem states that during a network partition, a distributed database must choose between Consistency (all nodes return the same data) and Availability (all nodes continue to accept requests). Consider a shopping cart stored in a distributed database across two data centers. If the network link between them breaks: a CP system refuses writes to prevent inconsistency (the cart becomes temporarily unavailable), while an AP system accepts writes on both sides and merges them later (the cart stays available but might temporarily show different items depending on which data center serves the request).
What's Next
Databases store your data, but reading directly from a database for every request is often too slow or too expensive at scale. Continue to Part 5: Caching Strategies where we explore how caching at multiple layers — browser, CDN, application, and database — can reduce latency by orders of magnitude.
FAQ
Can I use both SQL and NoSQL in the same system?
Yes, polyglot persistence is common in modern systems. For example, you might use PostgreSQL for transactional data, Redis for caching, and Elasticsearch for full-text search within the same application. The key is to use each database for what it does best. PostgreSQL handles your orders and payments where ACID guarantees matter. Redis handles sessions and rate limiting where sub-millisecond speed matters. Elasticsearch handles product search where full-text relevance scoring matters. The trade-off is operational complexity: each database is another system to maintain, monitor, back up, and keep in sync.
Is NoSQL always faster than SQL?
Not necessarily. NoSQL databases optimize for specific access patterns and can be faster for those use cases, but SQL databases with proper indexing and query optimization can match or exceed NoSQL performance. A PostgreSQL query hitting a well-designed index returns results in under a millisecond. A MongoDB query scanning a collection without an index can take seconds. The performance difference comes from data modeling and access pattern alignment, not from the database category itself. The real question is whether your access patterns match the database's strengths.
How do I decide between document, key-value, and columnar NoSQL databases?
Use key-value stores like Redis for caching and sessions — they are the fastest option for simple lookups by a known key. Use document databases like MongoDB for flexible nested data where each record may have different fields and you typically read or write entire documents. Use columnar databases like Cassandra for time-series and high-write-throughput workloads where you write massive amounts of data and read it back in time-ordered ranges. The deciding factor is your primary access pattern: lookup by key, read/write flexible documents, or scan large datasets by partition and time range.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.