Blog/System Design/Scaling Strategies: Horizontal vs Vertical Scaling
POST
May 02, 2025
LAST UPDATEDMay 02, 2025

Scaling Strategies: Horizontal vs Vertical Scaling

Compare horizontal and vertical scaling strategies for distributed systems. Understand when to scale up vs scale out and the trade-offs of each approach.

Tags

System DesignScalingHorizontal ScalingInfrastructure
Scaling Strategies: Horizontal vs Vertical Scaling
9 min read

Scaling Strategies: Horizontal vs Vertical Scaling

This is Part 2 of the System Design from Zero to Hero series.

TL;DR

Horizontal scaling adds more machines to handle load while vertical scaling upgrades existing ones. Vertical scaling is simpler but hits hardware ceilings fast. Horizontal scaling unlocks near-infinite capacity but forces you to rethink how your application manages state. Most production systems use a combination of both: vertically scale your database as long as possible, horizontally scale your stateless application servers from the start.

Why This Matters

In Part 1, we covered the fundamental trade-offs of system design. Now we address the most immediate practical question: what do you do when your single server cannot handle the load?

Every successful application eventually hits a scaling wall. Your response times start creeping up, your CPU sits at 90% during peak hours, or your database connections are maxed out. At that point, you have exactly two options: make your existing machine bigger (vertical) or add more machines (horizontal). Choosing the wrong strategy, or choosing the right one at the wrong time, can cost months of engineering effort and significant infrastructure spend.

Core Concepts

Vertical Scaling (Scaling Up)

Vertical scaling means upgrading the hardware of your existing server: more CPU cores, more RAM, faster SSDs, better network cards. It is the simplest scaling strategy because your application code does not change at all.

Before:                          After:
┌──────────────┐                 ┌──────────────────────┐
│  4 CPU cores │                 │  32 CPU cores        │
│  16 GB RAM   │        ──▶     │  256 GB RAM          │
│  500 GB SSD  │                 │  2 TB NVMe SSD       │
│              │                 │                      │
│  Your App    │                 │  Your App (unchanged)│
└──────────────┘                 └──────────────────────┘

Advantages: Zero code changes. No distributed systems complexity. Simpler monitoring, debugging, and deployment. Transactions remain straightforward because everything runs on one machine.

Limitations: Hardware has ceilings. As of today, even the largest cloud instances top out at a few hundred CPU cores and a few terabytes of RAM. Costs also grow non-linearly — a machine with twice the CPU often costs more than twice as much. And critically, a single machine is a single point of failure. If it goes down, everything goes down.

Vertical scaling is the right first move for databases. PostgreSQL on a beefy machine with 64 cores and 512 GB RAM can handle remarkable workloads. Many companies serve millions of users on a single well-tuned database instance far longer than you would expect.

Horizontal Scaling (Scaling Out)

Horizontal scaling means adding more machines and distributing the workload across them. Instead of one powerful server, you run ten (or a hundred, or a thousand) smaller servers behind a load balancer.

Before:                          After:
┌──────────────┐                 ┌──────────────┐
│              │                 │  Server 1    │
│  1 Server    │                 ├──────────────┤
│  (overloaded)│        ──▶     │  Server 2    │
│              │                 ├──────────────┤
└──────────────┘                 │  Server 3    │
                                 ├──────────────┤
                                 │  Server N    │
                                 └──────────────┘
                                      ▲
                                 Load Balancer

Advantages: No theoretical ceiling on capacity. Adding servers is fast (especially in the cloud). Built-in redundancy — if one server dies, the others continue serving traffic. Cost-efficient because you can use commodity hardware.

Disadvantages: Your application must be designed for distribution. Data consistency becomes harder. You need load balancing, service discovery, and health monitoring infrastructure. Debugging distributed failures is significantly more complex than debugging a single server.

Stateless vs Stateful Services

The most important architectural decision for horizontal scaling is whether your services are stateless or stateful.

A stateless service stores no client-specific data between requests. Every request contains all the information needed to process it. Any server in the pool can handle any request. This is what makes horizontal scaling straightforward.

A stateful service maintains client-specific data in memory (such as session state, WebSocket connections, or in-memory caches). If a user's session lives on Server 2, then that user's subsequent requests must go to Server 2. This constrains your load balancing and complicates failover.

python
# Stateful: session stored in server memory (BAD for horizontal scaling)
from flask import Flask, session
 
app = Flask(__name__)
app.secret_key = 'secret'
 
@app.route('/cart/add', methods=['POST'])
def add_to_cart():
    if 'cart' not in session:
        session['cart'] = []
    session['cart'].append(request.json['item_id'])
    return jsonify({'cart_size': len(session['cart'])})
 
 
# Stateless: session stored externally (GOOD for horizontal scaling)
import redis
 
app = Flask(__name__)
redis_client = redis.Redis(host='redis-cluster', port=6379)
 
@app.route('/cart/add', methods=['POST'])
def add_to_cart():
    user_id = get_user_from_token(request)
    cart_key = f"cart:{user_id}"
    redis_client.rpush(cart_key, request.json['item_id'])
    cart_size = redis_client.llen(cart_key)
    return jsonify({'cart_size': cart_size})

The second approach lets any server handle any request because the state lives in Redis, not in the server's memory. When you need more capacity, you add servers without worrying about session affinity.

Session Affinity (Sticky Sessions)

When you cannot make a service fully stateless, session affinity (also called sticky sessions) ensures that a client's requests consistently route to the same server. The load balancer typically uses a cookie or the client's IP address to maintain this mapping.

Client A ──▶ Load Balancer ──▶ Server 1 (always)
Client B ──▶ Load Balancer ──▶ Server 3 (always)
Client C ──▶ Load Balancer ──▶ Server 2 (always)

Session affinity is a pragmatic compromise, but it has real costs. If Server 1 goes down, Client A's session is lost. Traffic distribution becomes uneven because some servers accumulate more long-lived sessions than others. And you cannot freely scale down without disrupting users. Use it as a stepping stone toward fully stateless design, not as a permanent architecture.

Connection Pooling

When scaling horizontally, database connections become a bottleneck fast. Each application server maintains its own connections to the database, and databases have hard limits on concurrent connections (PostgreSQL defaults to 100).

With 10 application servers each holding 20 connections, you consume 200 database connections. Scale to 50 servers and you hit 1,000, which will overwhelm most databases.

Connection poolers like PgBouncer sit between your application servers and the database, multiplexing many application connections over fewer database connections:

Without pooler:                  With PgBouncer:
App Server 1 ──20 conn──┐       App Server 1 ──20 conn──┐
App Server 2 ──20 conn──┤       App Server 2 ──20 conn──┤
App Server 3 ──20 conn──┼──▶DB  App Server 3 ──20 conn──┼──▶PgBouncer──20 conn──▶DB
App Server 4 ──20 conn──┤       App Server 4 ──20 conn──┤
App Server 5 ──20 conn──┘       App Server 5 ──20 conn──┘
Total: 100 DB connections        Total: 20 DB connections
ini
# pgbouncer.ini configuration
[databases]
myapp = host=db-primary.internal port=5432 dbname=myapp
 
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
pool_mode = transaction        # release connection after each transaction
default_pool_size = 20         # max connections to database per pool
max_client_conn = 1000         # max client connections to pgbouncer
reserve_pool_size = 5          # extra connections for burst traffic

The transaction pool mode is key: it returns connections to the pool after each transaction completes, rather than holding them for the entire client session. This dramatically improves connection utilization.

Read Replicas

For read-heavy workloads (which is most web applications — reads typically outnumber writes 10:1 or more), read replicas let you scale reads horizontally while keeping writes on a single primary:

                    Writes
                      │
                      ▼
              ┌───────────────┐
              │   Primary DB  │
              │   (read/write)│
              └───┬───┬───┬───┘
        Replication│   │   │
          ┌────────┘   │   └────────┐
          ▼            ▼            ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐
   │ Replica 1  │ │ Replica 2  │ │ Replica 3  │
   │ (read-only)│ │ (read-only)│ │ (read-only)│
   └────────────┘ └────────────┘ └────────────┘
          ▲            ▲            ▲
          └────────────┼────────────┘
                  Read Queries
python
# Routing reads to replicas and writes to primary
import random
 
class DatabaseRouter:
    def __init__(self, primary_dsn, replica_dsns):
        self.primary = create_connection(primary_dsn)
        self.replicas = [create_connection(dsn) for dsn in replica_dsns]
 
    def get_connection(self, read_only=False):
        if read_only and self.replicas:
            return random.choice(self.replicas)
        return self.primary
 
# Usage
router = DatabaseRouter(
    primary_dsn="postgresql://primary:5432/myapp",
    replica_dsns=[
        "postgresql://replica1:5432/myapp",
        "postgresql://replica2:5432/myapp",
    ]
)
 
# Write operations go to primary
conn = router.get_connection(read_only=False)
conn.execute("INSERT INTO orders ...")
 
# Read operations go to a random replica
conn = router.get_connection(read_only=True)
results = conn.execute("SELECT * FROM products WHERE ...")

Be aware of replication lag: there is a delay (typically milliseconds, but potentially seconds under load) between a write on the primary and when that write appears on replicas. If a user creates an order and immediately views their order list, the read might go to a replica that does not yet have the new order. The common fix is "read-your-writes" consistency: route reads to the primary for a few seconds after a user performs a write.

Auto-Scaling

Cloud platforms let you automatically adjust the number of servers based on metrics like CPU utilization, memory usage, or request count:

yaml
# AWS Auto Scaling Group configuration (simplified)
AutoScalingGroup:
  MinSize: 2              # minimum instances (for availability)
  MaxSize: 20             # cost ceiling
  DesiredCapacity: 4      # starting point
 
  ScalingPolicies:
    ScaleOut:
      MetricName: CPUUtilization
      Threshold: 70        # add servers when CPU > 70%
      ScalingAdjustment: 2 # add 2 instances at a time
      Cooldown: 300        # wait 5 min before scaling again
 
    ScaleIn:
      MetricName: CPUUtilization
      Threshold: 30        # remove servers when CPU < 30%
      ScalingAdjustment: -1 # remove 1 instance at a time
      Cooldown: 600         # wait 10 min before scaling down

Two important tuning decisions:

  1. Scale-out aggressively, scale-in conservatively. Adding servers is low-risk; removing them while traffic may spike again is dangerous. Set a longer cooldown for scale-in.
  2. Use target tracking over step scaling when possible. Rather than defining thresholds and step adjustments, target tracking automatically adjusts to maintain a target metric (e.g., "keep average CPU at 60%").

Practical Implementation

Here is a complete example showing a stateless Node.js service designed for horizontal scaling:

javascript
// server.js - Stateless, horizontally scalable service
const express = require('express');
const Redis = require('ioredis');
 
const app = express();
const redis = new Redis(process.env.REDIS_URL);
 
// Health check for load balancer
app.get('/health', async (req, res) => {
  try {
    await redis.ping();
    res.json({ status: 'healthy', instance: process.env.HOSTNAME });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy' });
  }
});
 
// Rate limiting using Redis (shared state across all instances)
async function rateLimit(userId, limit = 100, windowSeconds = 60) {
  const key = `ratelimit:${userId}:${Math.floor(Date.now() / (windowSeconds * 1000))}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, windowSeconds);
  return count <= limit;
}
 
app.use(async (req, res, next) => {
  const userId = req.headers['x-user-id'];
  if (userId && !(await rateLimit(userId))) {
    return res.status(429).json({ error: 'Rate limit exceeded' });
  }
  next();
});
 
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`Server running on port ${PORT}, PID: ${process.pid}`);
});

Notice that the rate limiter uses Redis, not an in-memory counter. This means the rate limit is enforced correctly regardless of which server handles the request. This is the stateless pattern in action.

Trade-offs and Decision Framework

FactorVertical ScalingHorizontal Scaling
ComplexityLow — no code changesHigh — requires stateless design
Cost curveExponential (big machines cost more per unit)Linear (commodity machines)
Failure impactTotal outagePartial degradation
Scaling ceilingHardware limitsNear-infinite
Best forDatabases, legacy appsStateless web/API servers
Time to implementMinutes (resize instance)Days to weeks (redesign for statelessness)

Decision guidelines:

  • Start with vertical scaling. It buys you time with zero engineering effort.
  • Design application servers stateless from day one, even before you need horizontal scaling. It costs almost nothing upfront and saves enormous refactoring later.
  • Horizontally scale application servers first; they are the easiest to distribute.
  • Vertically scale your database as long as possible. Horizontally scaling databases (sharding) is one of the most complex operations in system design.
  • Use connection pooling before you horizontally scale databases.
  • Add read replicas when your read throughput exceeds what a single database can serve.

Common Interview Questions

Q: Your application is slow during peak hours. How do you diagnose whether you need vertical or horizontal scaling? Start with metrics. If a single server's CPU or memory is consistently maxed out, vertical scaling gives immediate relief. If your servers are individually fine but you are dropping requests because there are not enough servers, you need horizontal scaling. If your database is the bottleneck, check whether it is CPU-bound (vertical scale), connection-bound (add a connection pooler), or read-bound (add replicas).

Q: How do you handle sessions in a horizontally scaled environment? Move sessions out of server memory into an external store like Redis or a database. This makes servers stateless so any server can handle any request. If you must use sticky sessions as a short-term solution, use cookie-based affinity rather than IP-based, since users behind NAT share IP addresses.

Q: What is the difference between auto-scaling and load balancing? Auto-scaling adjusts the number of servers based on demand. Load balancing distributes traffic across the existing servers. They work together: the load balancer distributes traffic, and auto-scaling ensures there are enough servers to distribute to. You need both for a robust horizontally scaled system. We cover load balancing in detail in Part 3.

Q: Can databases be horizontally scaled? Yes, but it is significantly harder than scaling stateless application servers. Strategies include read replicas (for read scaling), sharding (partitioning data across multiple database instances), and using distributed databases like CockroachDB or Cassandra that are designed for horizontal scaling. Each approach has trade-offs in complexity and consistency. We explore databases further in Part 4.

What's Next

You now understand how to scale your application by adding machines or upgrading existing ones. But how does traffic actually get distributed across those machines? Continue to Part 3: Load Balancing and Reverse Proxies where we explore the algorithms and infrastructure that route requests to the right server.

FAQ

When should I choose horizontal scaling over vertical scaling?

Choose horizontal scaling when you need fault tolerance, handle unpredictable traffic spikes, or when your workload can be parallelized across multiple nodes easily. If your application is stateless (or can be made stateless), horizontal scaling offers a near-linear cost-to-capacity ratio and eliminates single points of failure. It is also the right choice when you have already maxed out vertical scaling or when the cost of larger machines becomes prohibitive.

What are the drawbacks of horizontal scaling?

Horizontal scaling introduces complexity in data consistency, requires load balancing, makes debugging harder, and demands stateless application design or external session management. You also need to handle distributed logging, coordinated deployments, and network partitions between nodes. The operational overhead of managing a fleet of servers is substantially higher than managing one, even with modern orchestration tools like Kubernetes.

Is vertical scaling cheaper than horizontal scaling?

Vertical scaling is initially cheaper and simpler, but it hits hardware limits quickly and creates a single point of failure. Horizontal scaling has higher upfront complexity but better long-term cost efficiency. For example, two 8-core machines typically cost less than one 16-core machine, and you get redundancy as a bonus. The crossover point depends on your specific workload, but for most web applications, the cost advantage of horizontal scaling becomes clear once you need more than two or three large instances.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

Design an E-Commerce Order Processing System
Jan 10, 202612 min read
System Design
E-Commerce
Saga Pattern

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Monitoring, Observability, and Site Reliability
Dec 10, 20259 min read
System Design
Observability
Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

CAP Theorem and Distributed Consensus
Nov 12, 202510 min read
System Design
CAP Theorem
Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.