Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
Tags
Monitoring, Observability, and Site Reliability
This is Part 10 of the System Design from Zero to Hero series. This is the series finale.
TL;DR
Observability combines metrics, logs, and traces to answer not just what broke but why, and SRE practices like SLOs and error budgets turn reliability into a measurable engineering discipline. This final part covers the three pillars of observability, the Prometheus/Grafana stack, distributed tracing with OpenTelemetry, SLOs/SLIs/SLAs, and resilience patterns like circuit breakers, retries, and bulkheads.
Why This Matters
Over the past nine parts of this series, we have designed systems with horizontal scaling, load balancers, optimized databases, caching layers, message queues, sharded data stores, well-designed APIs, and consensus protocols. But none of this matters if you cannot tell when your system is broken, why it is broken, and how close it is to breaking.
Monitoring answers "is the system up?" Observability answers "why is this user seeing 500 errors on this specific endpoint at 3 AM?" The distinction matters because modern distributed systems fail in novel ways that you cannot predict and pre-configure dashboards for. You need the ability to ask arbitrary questions about system behavior in real time.
Site reliability engineering (SRE) wraps this in a disciplined framework: define what "reliable enough" means, measure it, and use the gap between current reliability and the target to make engineering prioritization decisions.
Core Concepts
The Three Pillars of Observability
1. Metrics
Metrics are numerical measurements collected over time. They are cheap to store, fast to query, and ideal for alerting and dashboards.
Key metric types:
- ›Counters: Monotonically increasing values (total requests, total errors)
- ›Gauges: Values that go up and down (current memory usage, active connections)
- ›Histograms: Distribution of values (request latency percentiles)
The RED method for services:
- ›Rate -- requests per second
- ›Errors -- error rate as a percentage of total requests
- ›Duration -- latency distribution (p50, p95, p99)
The USE method for infrastructure:
- ›Utilization -- percentage of resource capacity used
- ›Saturation -- queue depth, work waiting to be processed
- ›Errors -- error count
# Prometheus metrics instrumentation with the Python client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status_code']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency in seconds',
['method', 'endpoint'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
ACTIVE_REQUESTS = Gauge(
'http_active_requests',
'Number of active HTTP requests',
['endpoint']
)
def middleware(request, call_next):
endpoint = request.url.path
method = request.method
ACTIVE_REQUESTS.labels(endpoint=endpoint).inc()
start_time = time.time()
try:
response = call_next(request)
status = response.status_code
except Exception:
status = 500
raise
finally:
duration = time.time() - start_time
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status_code=status).inc()
REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(duration)
ACTIVE_REQUESTS.labels(endpoint=endpoint).dec()
return response
# Expose metrics endpoint for Prometheus to scrape
start_http_server(8000) # /metrics endpoint on port 80002. Logs
Logs are discrete event records. Structured logging (JSON) is essential for searchability at scale.
import structlog
import uuid
# Configure structured logging
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
]
)
logger = structlog.get_logger()
def process_order(order_id: str, user_id: str):
# Bind context that persists across all log calls
log = logger.bind(
order_id=order_id,
user_id=user_id,
trace_id=str(uuid.uuid4())
)
log.info("order_processing_started", amount=99.99)
try:
result = charge_payment(order_id)
log.info("payment_charged", payment_id=result.payment_id)
except PaymentError as e:
log.error("payment_failed",
error_type=type(e).__name__,
error_message=str(e))
raise
# Output:
# {"event": "order_processing_started", "order_id": "ord-123",
# "user_id": "usr-456", "trace_id": "abc-...", "amount": 99.99,
# "level": "info", "timestamp": "2025-12-10T14:23:01Z"}Log levels matter: DEBUG for development, INFO for normal operations, WARN for concerning but handled situations, ERROR for failures requiring investigation, CRITICAL for system-threatening failures.
3. Traces
Distributed traces follow a request as it flows through multiple services. Each service creates a "span" with timing information, and spans are linked by a shared trace ID.
Trace: abc-123
├── Span: API Gateway (12ms)
│ ├── Span: Auth Service (3ms)
│ └── Span: Order Service (8ms)
│ ├── Span: Database Query (2ms)
│ ├── Span: Cache Lookup (0.5ms)
│ └── Span: Payment Service (4ms)
│ └── Span: Stripe API Call (3ms)
Without traces, debugging "why was this request slow?" in a microservices architecture is guesswork. With traces, you can see exactly which service and which operation caused the latency.
OpenTelemetry
OpenTelemetry (OTel) is the industry standard for instrumentation. It provides a single API for metrics, logs, and traces with exporters for any backend (Prometheus, Jaeger, Datadog, Grafana).
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Configure the tracer
provider = TracerProvider()
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Auto-instrument frameworks
FastAPIInstrumentor.instrument()
RequestsInstrumentor.instrument()
# Manual instrumentation for custom business logic
def process_payment(order_id: str, amount: float):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("payment.amount", amount)
with tracer.start_as_current_span("validate_payment"):
validate(order_id, amount)
with tracer.start_as_current_span("charge_card"):
result = charge(amount)
span.set_attribute("payment.status", result.status)
return resultThe Prometheus/Grafana Stack
A production observability stack typically includes:
- ›Prometheus: Pull-based metrics collection and storage. Scrapes
/metricsendpoints at regular intervals. Stores time-series data with a powerful query language (PromQL). - ›Grafana: Visualization and dashboarding. Connects to Prometheus, Loki, Tempo, and dozens of other data sources.
- ›Loki: Log aggregation by Grafana Labs. Like Prometheus but for logs -- indexes labels, not full text.
- ›Tempo: Distributed tracing backend. Stores traces and integrates with Grafana for visualization.
# docker-compose.yml for a complete observability stack
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200"
- "4317:4317" # OTLP gRPC receiver
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
volumes:
- ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- "4318:4318" # OTLP HTTP receiver# prometheus.yml - scrape configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-service:8000']
- job_name: 'order-service'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: trueSLOs, SLIs, and SLAs
SLI (Service Level Indicator): A quantitative measure of a service's behavior. Examples: request latency p99, error rate, availability percentage.
SLO (Service Level Objective): A target value for an SLI. Example: "99.9% of requests complete in under 200ms." SLOs are internal engineering goals.
SLA (Service Level Agreement): A contractual commitment to customers, typically with financial penalties. SLAs should be less aggressive than SLOs to provide a buffer.
SLI: p99 request latency
SLO: p99 latency < 200ms for 99.9% of requests (internal goal)
SLA: p99 latency < 500ms for 99.5% of requests (customer contract)
Error Budgets
An error budget is the inverse of your SLO. If your SLO is 99.9% availability, your error budget is 0.1% -- roughly 43 minutes of downtime per month.
Monthly error budget calculation:
SLO: 99.9% availability
Total minutes in 30 days: 43,200
Error budget: 43,200 * 0.001 = 43.2 minutes
If you've used 30 minutes this month:
Remaining budget: 13.2 minutes
Budget consumed: 69.4%
Action: Slow down risky deployments
Error budgets create a powerful feedback loop:
- ›Budget remaining: Ship features, take risks, deploy faster
- ›Budget nearly exhausted: Freeze feature deployments, focus on reliability
- ›Budget exceeded: Mandatory reliability sprint -- no new features until stability is restored
This gives product and engineering teams a shared, objective framework for balancing velocity and reliability.
Circuit Breaker Pattern
When a downstream service is failing, continuing to send requests makes things worse -- you consume resources waiting for timeouts and amplify the failure cascade. The circuit breaker pattern stops this cycle.
import time
from enum import Enum
from dataclasses import dataclass
class CircuitState(Enum):
CLOSED = "closed" # Normal operation, requests flow through
OPEN = "open" # Failures detected, requests are rejected immediately
HALF_OPEN = "half_open" # Testing if the service has recovered
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
recovery_timeout: float = 30.0
success_threshold: int = 3
state: CircuitState = CircuitState.CLOSED
failure_count: int = 0
success_count: int = 0
last_failure_time: float = 0
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time >= self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
else:
raise CircuitOpenError("Circuit is open, request rejected")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
try:
result = payment_breaker.call(payment_service.charge, order_id, amount)
except CircuitOpenError:
# Return a graceful degradation response
return {"status": "pending", "message": "Payment processing delayed"}Retry with Exponential Backoff
When a transient failure occurs, retrying immediately often fails again and adds load to an already struggling service. Exponential backoff spaces out retries:
import random
import time
def retry_with_backoff(func, max_retries=5, base_delay=1.0, max_delay=60.0):
"""
Retry with exponential backoff and jitter.
Jitter prevents thundering herd when many clients retry simultaneously.
"""
for attempt in range(max_retries):
try:
return func()
except TransientError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff: 1s, 2s, 4s, 8s, 16s...
delay = min(base_delay * (2 ** attempt), max_delay)
# Add jitter (randomness) to prevent thundering herd
jittered_delay = delay * (0.5 + random.random() * 0.5)
logger.warning(f"Retry {attempt + 1}/{max_retries} after {jittered_delay:.1f}s",
error=str(e))
time.sleep(jittered_delay)The jitter is critical. Without it, if 1,000 clients all fail at the same time, they all retry at 1 second, fail again, all retry at 2 seconds, and so on -- creating synchronized bursts that prevent recovery. Jitter desynchronizes the retries.
Bulkhead Pattern
The bulkhead pattern isolates components so that a failure in one does not cascade to others. Named after ship bulkheads that prevent a single hull breach from sinking the entire vessel.
import asyncio
from dataclasses import dataclass
@dataclass
class Bulkhead:
"""Limits concurrent calls to a resource to prevent cascade failures."""
max_concurrent: int
max_wait_time: float = 5.0
def __post_init__(self):
self._semaphore = asyncio.Semaphore(self.max_concurrent)
async def call(self, func, *args, **kwargs):
try:
await asyncio.wait_for(
self._semaphore.acquire(),
timeout=self.max_wait_time
)
except asyncio.TimeoutError:
raise BulkheadFullError(
f"Bulkhead full: {self.max_concurrent} concurrent calls"
)
try:
return await func(*args, **kwargs)
finally:
self._semaphore.release()
# Separate bulkheads for different downstream services
payment_bulkhead = Bulkhead(max_concurrent=20)
inventory_bulkhead = Bulkhead(max_concurrent=50)
notification_bulkhead = Bulkhead(max_concurrent=10)
# If the payment service is slow, it consumes only 20 threads,
# leaving inventory and notification services unaffectedChaos Engineering
Chaos engineering proactively tests resilience by injecting failures in controlled environments. The philosophy: if you are going to have failures in production (and you will), it is better to practice handling them deliberately.
Principles:
- ›Start with a hypothesis: "If we lose one database replica, the system should failover within 30 seconds with no user-visible errors."
- ›Inject the failure: Kill the replica.
- ›Observe the system: Did it failover? Were there errors? How long did it take?
- ›Fix what you find: If the hypothesis was wrong, improve the system before the failure happens for real.
Common chaos experiments:
- ›Kill a service instance (does the load balancer reroute?)
- ›Inject network latency (do timeouts and circuit breakers fire correctly?)
- ›Fill a disk (does the system alert before data loss?)
- ›Simulate a DNS failure (do services use cached resolutions?)
Tools like Chaos Monkey (Netflix), Litmus (Kubernetes), and Gremlin provide frameworks for running these experiments safely.
Practical Implementation
Here is a Prometheus alerting rule that implements an error budget burn rate alert:
# alerts.yml - SLO-based alerting
groups:
- name: slo_alerts
rules:
# Alert if we're burning through error budget too quickly
# A 14.4x burn rate over 1 hour consumes 2% of monthly budget
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning too fast"
description: "Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn rate. Monthly error budget will be exhausted in {{ printf \"%.0f\" (divf 100 (mulf $value 1000)) }} hours."
# Slower burn rate alert for sustained issues
- alert: SustainedErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001)
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained error budget consumption"Trade-offs and Decision Framework
| Approach | Complexity | Coverage | Cost |
|---|---|---|---|
| Basic health checks | Low | What is down | Minimal |
| Metrics + dashboards | Medium | Performance trends | Moderate |
| Full observability (metrics + logs + traces) | High | Root cause analysis | Higher |
| SRE practices (SLOs + error budgets) | High | Business-aligned reliability | Organizational investment |
Start with metrics and structured logging. Add distributed tracing when you have more than a handful of services. Adopt SLOs when you need to make principled trade-offs between reliability and feature velocity.
Common Interview Questions
Q: Your service has a p99 latency spike. How do you investigate? A: Check metrics dashboards for correlated changes (deployment, traffic spike, dependency latency). Use distributed traces to identify which service or operation contributes the most latency. Check logs for errors or warnings around the spike. Look at infrastructure metrics (CPU saturation, memory pressure, GC pauses). Check if the spike correlates with a specific endpoint, user, or region.
Q: How do you decide between alerting on symptoms vs causes? A: Alert on symptoms first (high error rate, high latency) because they directly impact users. Investigate causes as part of incident response. Cause-based alerts (high CPU, disk full) are useful as early warnings but should not page on-call engineers unless they are about to impact users. This aligns with SLO-based alerting -- alert when the error budget burn rate is too high, not when a single node's CPU hits 80%.
Q: How would you implement observability for a system with 200 microservices? A: Use OpenTelemetry for standardized instrumentation across all services. Deploy a centralized observability stack (Prometheus + Loki + Tempo, or a managed solution like Datadog). Mandate structured logging with correlation IDs (trace IDs) propagated through all service calls. Use auto-instrumentation for common frameworks. Define SLOs for each service and aggregate into a reliability dashboard.
Q: What is the difference between a circuit breaker and a rate limiter? A: A rate limiter controls incoming traffic to protect your service from being overwhelmed (as we covered in Part 8). A circuit breaker controls outgoing calls to protect your service from a failing dependency. Rate limiters protect you from your callers. Circuit breakers protect you from your dependencies.
Series Conclusion
This is the final part of the System Design from Zero to Hero series. Let us recap what we covered across all ten parts:
- ›System Design Fundamentals: The building blocks -- clients, servers, networks, and the framework for thinking about distributed systems.
- ›Scaling Strategies: Vertical vs horizontal scaling, stateless services, and when to scale what.
- ›Load Balancing and Reverse Proxies: Distributing traffic, health checks, and L4 vs L7 load balancing.
- ›Databases: SQL vs NoSQL: Choosing the right database, ACID vs BASE, and data modeling.
- ›Caching Strategies: Cache invalidation, eviction policies, and multi-layer caching.
- ›Message Queues and Event-Driven Systems: Kafka, RabbitMQ, delivery guarantees, and async processing.
- ›Database Sharding and Partitioning: Shard keys, consistent hashing, and cross-shard queries.
- ›API Design, Rate Limiting, and Authentication: REST, GraphQL, rate limiting algorithms, and OAuth2.
- ›CAP Theorem and Distributed Consensus: The fundamental trade-offs, Raft consensus, and conflict resolution.
- ›Monitoring, Observability, and Reliability (this post): The three pillars, SRE practices, and resilience patterns.
These concepts are not isolated. A well-designed system uses all of them together -- sharded databases behind load balancers, with cached hot paths, connected by message queues, protected by rate limiters and circuit breakers, and observed through metrics, logs, and traces. The art of system design is knowing which trade-offs to make for your specific requirements.
FAQ
What is the difference between monitoring and observability?
Monitoring tells you when something is wrong using predefined checks. Observability lets you ask arbitrary questions about your system's internal state using metrics, logs, and traces without deploying new code.
What are the three pillars of observability?
The three pillars are metrics (quantitative measurements over time), logs (discrete event records), and traces (request flow across services). Together they provide a complete picture of system behavior.
How do I set meaningful SLOs for my service?
Start by measuring current performance baselines, then set SLOs slightly below your best performance. Focus on user-facing metrics like latency percentiles and error rates rather than internal system metrics.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.
Design a Rate Limiter: Algorithms and Implementation
Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.