What is Observability

Observability is the ability to understand the internal state of a system by examining its external outputs — metrics, logs, and traces. It originates from control theory and has become a foundational discipline in modern software engineering, enabling teams to diagnose problems, understand behaviour, and improve reliability.

Monitoring vs Observability

These terms are often used interchangeably, but they represent different approaches:

Aspect	Monitoring	Observability
Approach	Predefined checks and thresholds	Exploratory, ask arbitrary questions
Focus	Known failure modes	Unknown unknowns
Signals	Dashboards and alerts	Metrics, logs, traces — correlated
Question	"Is the system up?"	"Why is it behaving this way?"

Monitoring tells you when something is wrong. Observability helps you understand why.

The Three Pillars

Observability is commonly described in terms of three pillars:

1. Metrics

Numeric measurements collected over time — CPU usage, request latency, error rates. Metrics are cheap to store and ideal for alerting.

http_requests_total{method="GET", status="200"} 14523

2. Logs

Timestamped, immutable records of discrete events — application errors, audit trails, debug output.

{
  "timestamp": "2025-03-15T10:23:45Z",
  "level": "ERROR",
  "service": "payment-api",
  "message": "Failed to charge card",
  "trace_id": "abc123def456"
}

3. Traces

Records of a request's journey through a distributed system, showing timing and dependencies across services.

[Frontend] → [API Gateway] → [Payment Service] → [Database]
   12ms          3ms              45ms               8ms

Beyond the Three Pillars

Modern observability extends beyond metrics, logs, and traces:

Profiling — continuous profiling of CPU, memory, and goroutines in production
Events — deployment markers, feature flag changes, incident annotations
Real User Monitoring (RUM) — client-side performance data from browsers and mobile apps
Synthetic monitoring — automated checks that simulate user journeys

Why Observability Matters

In modern distributed systems (microservices, Kubernetes, serverless), failures are:

Inevitable — hardware fails, networks partition, code has bugs
Complex — a single request may traverse dozens of services
Emergent — novel failure modes arise from service interactions

Without observability, you are flying blind. With it, you can:

Detect issues before users are impacted
Diagnose root causes quickly during incidents
Understand system behaviour under load
Optimise performance and resource usage
Validate changes after deployments

Tip: Observability is not a product you buy — it is a property of your system that you build into your architecture from the start.

Key Terminology

Term	Definition
Telemetry	Data emitted by a system about its behaviour (metrics, logs, traces)
Instrumentation	The code that generates telemetry data
Cardinality	The number of unique values for a label or tag — high cardinality increases storage costs
Dimensionality	The number of labels or tags attached to a data point
SLI	Service Level Indicator — a quantitative measure of service behaviour
SLO	Service Level Objective — a target value for an SLI
MTTD	Mean Time to Detect — how long it takes to notice a problem
MTTR	Mean Time to Resolve — how long it takes to fix a problem

The Observability Landscape

The observability ecosystem includes many tools and standards:

Category	Examples
Metrics	Prometheus, Datadog, Grafana Mimir, InfluxDB
Logging	Elasticsearch (ELK), Loki, Splunk, Fluentd
Tracing	Jaeger, Zipkin, Tempo, Honeycomb
All-in-one	Datadog, New Relic, Dynatrace, Grafana Cloud
Standards	OpenTelemetry, StatsD, Prometheus exposition format

OpenTelemetry (OTel) is emerging as the industry standard for instrumentation, providing vendor-neutral APIs, SDKs, and collectors for all three pillars.

Summary

Observability is the practice of understanding system behaviour through telemetry data — metrics, logs, and traces. Unlike traditional monitoring that checks for known issues, observability lets you explore unknown problems in complex distributed systems. In the following lessons, we will dive deep into each pillar, learn the tools of the trade, and build a complete observability stack.