Blog/System Design/Monitoring, Observability, and Site Reliability
POST
December 10, 2025
LAST UPDATEDDecember 10, 2025

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

Tags

System DesignObservabilityMonitoringSREDevOps
Monitoring, Observability, and Site Reliability
9 min read

Monitoring, Observability, and Site Reliability

This is Part 10 of the System Design from Zero to Hero series. This is the series finale.

TL;DR

Observability combines metrics, logs, and traces to answer not just what broke but why, and SRE practices like SLOs and error budgets turn reliability into a measurable engineering discipline. This final part covers the three pillars of observability, the Prometheus/Grafana stack, distributed tracing with OpenTelemetry, SLOs/SLIs/SLAs, and resilience patterns like circuit breakers, retries, and bulkheads.

Why This Matters

Over the past nine parts of this series, we have designed systems with horizontal scaling, load balancers, optimized databases, caching layers, message queues, sharded data stores, well-designed APIs, and consensus protocols. But none of this matters if you cannot tell when your system is broken, why it is broken, and how close it is to breaking.

Monitoring answers "is the system up?" Observability answers "why is this user seeing 500 errors on this specific endpoint at 3 AM?" The distinction matters because modern distributed systems fail in novel ways that you cannot predict and pre-configure dashboards for. You need the ability to ask arbitrary questions about system behavior in real time.

Site reliability engineering (SRE) wraps this in a disciplined framework: define what "reliable enough" means, measure it, and use the gap between current reliability and the target to make engineering prioritization decisions.

Core Concepts

The Three Pillars of Observability

1. Metrics

Metrics are numerical measurements collected over time. They are cheap to store, fast to query, and ideal for alerting and dashboards.

Key metric types:

  • Counters: Monotonically increasing values (total requests, total errors)
  • Gauges: Values that go up and down (current memory usage, active connections)
  • Histograms: Distribution of values (request latency percentiles)

The RED method for services:

  • Rate -- requests per second
  • Errors -- error rate as a percentage of total requests
  • Duration -- latency distribution (p50, p95, p99)

The USE method for infrastructure:

  • Utilization -- percentage of resource capacity used
  • Saturation -- queue depth, work waiting to be processed
  • Errors -- error count
python
# Prometheus metrics instrumentation with the Python client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
 
# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)
 
REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
 
ACTIVE_REQUESTS = Gauge(
    'http_active_requests',
    'Number of active HTTP requests',
    ['endpoint']
)
 
def middleware(request, call_next):
    endpoint = request.url.path
    method = request.method
 
    ACTIVE_REQUESTS.labels(endpoint=endpoint).inc()
    start_time = time.time()
 
    try:
        response = call_next(request)
        status = response.status_code
    except Exception:
        status = 500
        raise
    finally:
        duration = time.time() - start_time
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status_code=status).inc()
        REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(duration)
        ACTIVE_REQUESTS.labels(endpoint=endpoint).dec()
 
    return response
 
# Expose metrics endpoint for Prometheus to scrape
start_http_server(8000)  # /metrics endpoint on port 8000

2. Logs

Logs are discrete event records. Structured logging (JSON) is essential for searchability at scale.

python
import structlog
import uuid
 
# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ]
)
 
logger = structlog.get_logger()
 
def process_order(order_id: str, user_id: str):
    # Bind context that persists across all log calls
    log = logger.bind(
        order_id=order_id,
        user_id=user_id,
        trace_id=str(uuid.uuid4())
    )
 
    log.info("order_processing_started", amount=99.99)
 
    try:
        result = charge_payment(order_id)
        log.info("payment_charged", payment_id=result.payment_id)
    except PaymentError as e:
        log.error("payment_failed",
                  error_type=type(e).__name__,
                  error_message=str(e))
        raise
 
# Output:
# {"event": "order_processing_started", "order_id": "ord-123",
#  "user_id": "usr-456", "trace_id": "abc-...", "amount": 99.99,
#  "level": "info", "timestamp": "2025-12-10T14:23:01Z"}

Log levels matter: DEBUG for development, INFO for normal operations, WARN for concerning but handled situations, ERROR for failures requiring investigation, CRITICAL for system-threatening failures.

3. Traces

Distributed traces follow a request as it flows through multiple services. Each service creates a "span" with timing information, and spans are linked by a shared trace ID.

Trace: abc-123
├── Span: API Gateway (12ms)
│   ├── Span: Auth Service (3ms)
│   └── Span: Order Service (8ms)
│       ├── Span: Database Query (2ms)
│       ├── Span: Cache Lookup (0.5ms)
│       └── Span: Payment Service (4ms)
│           └── Span: Stripe API Call (3ms)

Without traces, debugging "why was this request slow?" in a microservices architecture is guesswork. With traces, you can see exactly which service and which operation caused the latency.

OpenTelemetry

OpenTelemetry (OTel) is the industry standard for instrumentation. It provides a single API for metrics, logs, and traces with exporters for any backend (Prometheus, Jaeger, Datadog, Grafana).

python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
 
# Configure the tracer
provider = TracerProvider()
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
 
tracer = trace.get_tracer(__name__)
 
# Auto-instrument frameworks
FastAPIInstrumentor.instrument()
RequestsInstrumentor.instrument()
 
# Manual instrumentation for custom business logic
def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
 
        with tracer.start_as_current_span("validate_payment"):
            validate(order_id, amount)
 
        with tracer.start_as_current_span("charge_card"):
            result = charge(amount)
            span.set_attribute("payment.status", result.status)
 
        return result

The Prometheus/Grafana Stack

A production observability stack typically includes:

  • Prometheus: Pull-based metrics collection and storage. Scrapes /metrics endpoints at regular intervals. Stores time-series data with a powerful query language (PromQL).
  • Grafana: Visualization and dashboarding. Connects to Prometheus, Loki, Tempo, and dozens of other data sources.
  • Loki: Log aggregation by Grafana Labs. Like Prometheus but for logs -- indexes labels, not full text.
  • Tempo: Distributed tracing backend. Stores traces and integrates with Grafana for visualization.
yaml
# docker-compose.yml for a complete observability stack
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
 
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
 
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
 
  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"
      - "4317:4317"  # OTLP gRPC receiver
 
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    volumes:
      - ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4318:4318"  # OTLP HTTP receiver
yaml
# prometheus.yml - scrape configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
rule_files:
  - "alerts.yml"
 
scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['api-service:8000']
 
  - job_name: 'order-service'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

SLOs, SLIs, and SLAs

SLI (Service Level Indicator): A quantitative measure of a service's behavior. Examples: request latency p99, error rate, availability percentage.

SLO (Service Level Objective): A target value for an SLI. Example: "99.9% of requests complete in under 200ms." SLOs are internal engineering goals.

SLA (Service Level Agreement): A contractual commitment to customers, typically with financial penalties. SLAs should be less aggressive than SLOs to provide a buffer.

SLI: p99 request latency
SLO: p99 latency < 200ms for 99.9% of requests (internal goal)
SLA: p99 latency < 500ms for 99.5% of requests (customer contract)

Error Budgets

An error budget is the inverse of your SLO. If your SLO is 99.9% availability, your error budget is 0.1% -- roughly 43 minutes of downtime per month.

Monthly error budget calculation:

SLO: 99.9% availability
Total minutes in 30 days: 43,200
Error budget: 43,200 * 0.001 = 43.2 minutes

If you've used 30 minutes this month:
  Remaining budget: 13.2 minutes
  Budget consumed: 69.4%
  Action: Slow down risky deployments

Error budgets create a powerful feedback loop:

  • Budget remaining: Ship features, take risks, deploy faster
  • Budget nearly exhausted: Freeze feature deployments, focus on reliability
  • Budget exceeded: Mandatory reliability sprint -- no new features until stability is restored

This gives product and engineering teams a shared, objective framework for balancing velocity and reliability.

Circuit Breaker Pattern

When a downstream service is failing, continuing to send requests makes things worse -- you consume resources waiting for timeouts and amplify the failure cascade. The circuit breaker pattern stops this cycle.

python
import time
from enum import Enum
from dataclasses import dataclass
 
class CircuitState(Enum):
    CLOSED = "closed"        # Normal operation, requests flow through
    OPEN = "open"            # Failures detected, requests are rejected immediately
    HALF_OPEN = "half_open"  # Testing if the service has recovered
 
@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    success_threshold: int = 3
 
    state: CircuitState = CircuitState.CLOSED
    failure_count: int = 0
    success_count: int = 0
    last_failure_time: float = 0
 
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time >= self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
            else:
                raise CircuitOpenError("Circuit is open, request rejected")
 
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
 
    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
        else:
            self.failure_count = 0
 
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
 
 
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
 
try:
    result = payment_breaker.call(payment_service.charge, order_id, amount)
except CircuitOpenError:
    # Return a graceful degradation response
    return {"status": "pending", "message": "Payment processing delayed"}

Retry with Exponential Backoff

When a transient failure occurs, retrying immediately often fails again and adds load to an already struggling service. Exponential backoff spaces out retries:

python
import random
import time
 
def retry_with_backoff(func, max_retries=5, base_delay=1.0, max_delay=60.0):
    """
    Retry with exponential backoff and jitter.
    Jitter prevents thundering herd when many clients retry simultaneously.
    """
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
 
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s...
            delay = min(base_delay * (2 ** attempt), max_delay)
 
            # Add jitter (randomness) to prevent thundering herd
            jittered_delay = delay * (0.5 + random.random() * 0.5)
 
            logger.warning(f"Retry {attempt + 1}/{max_retries} after {jittered_delay:.1f}s",
                          error=str(e))
            time.sleep(jittered_delay)

The jitter is critical. Without it, if 1,000 clients all fail at the same time, they all retry at 1 second, fail again, all retry at 2 seconds, and so on -- creating synchronized bursts that prevent recovery. Jitter desynchronizes the retries.

Bulkhead Pattern

The bulkhead pattern isolates components so that a failure in one does not cascade to others. Named after ship bulkheads that prevent a single hull breach from sinking the entire vessel.

python
import asyncio
from dataclasses import dataclass
 
@dataclass
class Bulkhead:
    """Limits concurrent calls to a resource to prevent cascade failures."""
    max_concurrent: int
    max_wait_time: float = 5.0
 
    def __post_init__(self):
        self._semaphore = asyncio.Semaphore(self.max_concurrent)
 
    async def call(self, func, *args, **kwargs):
        try:
            await asyncio.wait_for(
                self._semaphore.acquire(),
                timeout=self.max_wait_time
            )
        except asyncio.TimeoutError:
            raise BulkheadFullError(
                f"Bulkhead full: {self.max_concurrent} concurrent calls"
            )
 
        try:
            return await func(*args, **kwargs)
        finally:
            self._semaphore.release()
 
 
# Separate bulkheads for different downstream services
payment_bulkhead = Bulkhead(max_concurrent=20)
inventory_bulkhead = Bulkhead(max_concurrent=50)
notification_bulkhead = Bulkhead(max_concurrent=10)
 
# If the payment service is slow, it consumes only 20 threads,
# leaving inventory and notification services unaffected

Chaos Engineering

Chaos engineering proactively tests resilience by injecting failures in controlled environments. The philosophy: if you are going to have failures in production (and you will), it is better to practice handling them deliberately.

Principles:

  1. Start with a hypothesis: "If we lose one database replica, the system should failover within 30 seconds with no user-visible errors."
  2. Inject the failure: Kill the replica.
  3. Observe the system: Did it failover? Were there errors? How long did it take?
  4. Fix what you find: If the hypothesis was wrong, improve the system before the failure happens for real.

Common chaos experiments:

  • Kill a service instance (does the load balancer reroute?)
  • Inject network latency (do timeouts and circuit breakers fire correctly?)
  • Fill a disk (does the system alert before data loss?)
  • Simulate a DNS failure (do services use cached resolutions?)

Tools like Chaos Monkey (Netflix), Litmus (Kubernetes), and Gremlin provide frameworks for running these experiments safely.

Practical Implementation

Here is a Prometheus alerting rule that implements an error budget burn rate alert:

yaml
# alerts.yml - SLO-based alerting
groups:
  - name: slo_alerts
    rules:
      # Alert if we're burning through error budget too quickly
      # A 14.4x burn rate over 1 hour consumes 2% of monthly budget
      - alert: HighErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning too fast"
          description: "Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn rate. Monthly error budget will be exhausted in {{ printf \"%.0f\" (divf 100 (mulf $value 1000)) }} hours."
 
      # Slower burn rate alert for sustained issues
      - alert: SustainedErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Sustained error budget consumption"

Trade-offs and Decision Framework

ApproachComplexityCoverageCost
Basic health checksLowWhat is downMinimal
Metrics + dashboardsMediumPerformance trendsModerate
Full observability (metrics + logs + traces)HighRoot cause analysisHigher
SRE practices (SLOs + error budgets)HighBusiness-aligned reliabilityOrganizational investment

Start with metrics and structured logging. Add distributed tracing when you have more than a handful of services. Adopt SLOs when you need to make principled trade-offs between reliability and feature velocity.

Common Interview Questions

Q: Your service has a p99 latency spike. How do you investigate? A: Check metrics dashboards for correlated changes (deployment, traffic spike, dependency latency). Use distributed traces to identify which service or operation contributes the most latency. Check logs for errors or warnings around the spike. Look at infrastructure metrics (CPU saturation, memory pressure, GC pauses). Check if the spike correlates with a specific endpoint, user, or region.

Q: How do you decide between alerting on symptoms vs causes? A: Alert on symptoms first (high error rate, high latency) because they directly impact users. Investigate causes as part of incident response. Cause-based alerts (high CPU, disk full) are useful as early warnings but should not page on-call engineers unless they are about to impact users. This aligns with SLO-based alerting -- alert when the error budget burn rate is too high, not when a single node's CPU hits 80%.

Q: How would you implement observability for a system with 200 microservices? A: Use OpenTelemetry for standardized instrumentation across all services. Deploy a centralized observability stack (Prometheus + Loki + Tempo, or a managed solution like Datadog). Mandate structured logging with correlation IDs (trace IDs) propagated through all service calls. Use auto-instrumentation for common frameworks. Define SLOs for each service and aggregate into a reliability dashboard.

Q: What is the difference between a circuit breaker and a rate limiter? A: A rate limiter controls incoming traffic to protect your service from being overwhelmed (as we covered in Part 8). A circuit breaker controls outgoing calls to protect your service from a failing dependency. Rate limiters protect you from your callers. Circuit breakers protect you from your dependencies.

Series Conclusion

This is the final part of the System Design from Zero to Hero series. Let us recap what we covered across all ten parts:

  1. System Design Fundamentals: The building blocks -- clients, servers, networks, and the framework for thinking about distributed systems.
  2. Scaling Strategies: Vertical vs horizontal scaling, stateless services, and when to scale what.
  3. Load Balancing and Reverse Proxies: Distributing traffic, health checks, and L4 vs L7 load balancing.
  4. Databases: SQL vs NoSQL: Choosing the right database, ACID vs BASE, and data modeling.
  5. Caching Strategies: Cache invalidation, eviction policies, and multi-layer caching.
  6. Message Queues and Event-Driven Systems: Kafka, RabbitMQ, delivery guarantees, and async processing.
  7. Database Sharding and Partitioning: Shard keys, consistent hashing, and cross-shard queries.
  8. API Design, Rate Limiting, and Authentication: REST, GraphQL, rate limiting algorithms, and OAuth2.
  9. CAP Theorem and Distributed Consensus: The fundamental trade-offs, Raft consensus, and conflict resolution.
  10. Monitoring, Observability, and Reliability (this post): The three pillars, SRE practices, and resilience patterns.

These concepts are not isolated. A well-designed system uses all of them together -- sharded databases behind load balancers, with cached hot paths, connected by message queues, protected by rate limiters and circuit breakers, and observed through metrics, logs, and traces. The art of system design is knowing which trade-offs to make for your specific requirements.

FAQ

What is the difference between monitoring and observability?

Monitoring tells you when something is wrong using predefined checks. Observability lets you ask arbitrary questions about your system's internal state using metrics, logs, and traces without deploying new code.

What are the three pillars of observability?

The three pillars are metrics (quantitative measurements over time), logs (discrete event records), and traces (request flow across services). Together they provide a complete picture of system behavior.

How do I set meaningful SLOs for my service?

Start by measuring current performance baselines, then set SLOs slightly below your best performance. Focus on user-facing metrics like latency percentiles and error rates rather than internal system metrics.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

Design an E-Commerce Order Processing System
Jan 10, 202612 min read
System Design
E-Commerce
Saga Pattern

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

CAP Theorem and Distributed Consensus
Nov 12, 202510 min read
System Design
CAP Theorem
Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.

Design a Rate Limiter: Algorithms and Implementation
Nov 05, 202511 min read
System Design
Rate Limiting
Algorithms

Design a Rate Limiter: Algorithms and Implementation

Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.