Skip to content

You are viewing a free preview of this lesson.

Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.

What is Observability

What is Observability

Observability is the ability to understand the internal state of a system by examining its external outputs — metrics, logs, and traces. It originates from control theory and has become a foundational discipline in modern software engineering, enabling teams to diagnose problems, understand behaviour, and improve reliability.


Monitoring vs Observability

These terms are often used interchangeably, but they represent different approaches:

Aspect Monitoring Observability
Approach Predefined checks and thresholds Exploratory, ask arbitrary questions
Focus Known failure modes Unknown unknowns
Signals Dashboards and alerts Metrics, logs, traces — correlated
Question "Is the system up?" "Why is it behaving this way?"

Monitoring tells you when something is wrong. Observability helps you understand why.


The Three Pillars

Observability is commonly described in terms of three pillars:

1. Metrics

Numeric measurements collected over time — CPU usage, request latency, error rates. Metrics are cheap to store and ideal for alerting.

http_requests_total{method="GET", status="200"} 14523

2. Logs

Timestamped, immutable records of discrete events — application errors, audit trails, debug output.

{
  "timestamp": "2025-03-15T10:23:45Z",
  "level": "ERROR",
  "service": "payment-api",
  "message": "Failed to charge card",
  "trace_id": "abc123def456"
}

3. Traces

Records of a request's journey through a distributed system, showing timing and dependencies across services.

[Frontend] → [API Gateway] → [Payment Service] → [Database]
   12ms          3ms              45ms               8ms

Beyond the Three Pillars

Modern observability extends beyond metrics, logs, and traces:

  • Profiling — continuous profiling of CPU, memory, and goroutines in production
  • Events — deployment markers, feature flag changes, incident annotations
  • Real User Monitoring (RUM) — client-side performance data from browsers and mobile apps
  • Synthetic monitoring — automated checks that simulate user journeys

Why Observability Matters

In modern distributed systems (microservices, Kubernetes, serverless), failures are:

  • Inevitable — hardware fails, networks partition, code has bugs
  • Complex — a single request may traverse dozens of services
  • Emergent — novel failure modes arise from service interactions

Without observability, you are flying blind. With it, you can:

  • Detect issues before users are impacted
  • Diagnose root causes quickly during incidents
  • Understand system behaviour under load
  • Optimise performance and resource usage
  • Validate changes after deployments

Tip: Observability is not a product you buy — it is a property of your system that you build into your architecture from the start.


Key Terminology

Term Definition
Telemetry Data emitted by a system about its behaviour (metrics, logs, traces)
Instrumentation The code that generates telemetry data
Cardinality The number of unique values for a label or tag — high cardinality increases storage costs
Dimensionality The number of labels or tags attached to a data point
SLI Service Level Indicator — a quantitative measure of service behaviour
SLO Service Level Objective — a target value for an SLI
MTTD Mean Time to Detect — how long it takes to notice a problem
MTTR Mean Time to Resolve — how long it takes to fix a problem

The Observability Landscape

The observability ecosystem includes many tools and standards:

Category Examples
Metrics Prometheus, Datadog, Grafana Mimir, InfluxDB
Logging Elasticsearch (ELK), Loki, Splunk, Fluentd
Tracing Jaeger, Zipkin, Tempo, Honeycomb
All-in-one Datadog, New Relic, Dynatrace, Grafana Cloud
Standards OpenTelemetry, StatsD, Prometheus exposition format

OpenTelemetry (OTel) is emerging as the industry standard for instrumentation, providing vendor-neutral APIs, SDKs, and collectors for all three pillars.


Summary

Observability is the practice of understanding system behaviour through telemetry data — metrics, logs, and traces. Unlike traditional monitoring that checks for known issues, observability lets you explore unknown problems in complex distributed systems. In the following lessons, we will dive deep into each pillar, learn the tools of the trade, and build a complete observability stack.