Pillar	What It Captures	Question It Answers
Logs	Discrete events	"What exactly happened?"
Metrics	Numerical measurements over time	"How is the system doing?"
Traces	Request flow across services	"What did the request do?"

Level	When to Use
`DEBUG`	Detailed development info (off in prod)
`INFO`	Normal operations worth recording
`WARN`	Something unexpected but recoverable
`ERROR`	Something failed, needs attention
`FATAL`	Process cannot continue

Type	What It Tracks	Example
Counter	Cumulative total (only goes up)	`http_requests_total`
Gauge	Current value (goes up and down)	`cpu_usage_percent`
Histogram	Distribution of values	`request_duration_seconds`

Category	Tools
Collection	Prometheus, Datadog, CloudWatch
Storage	Prometheus TSDB, InfluxDB, Cortex
Visualization	Grafana, Datadog Dashboards
Alerting	Prometheus Alertmanager, PagerDuty

Category	Tools
Instrumentation	OpenTelemetry (OTEL), Jaeger client
Collection	OTEL Collector, Jaeger, Zipkin
Storage & Visualization	Jaeger UI, Grafana Tempo, Datadog APM

Incident Timeline: A Bad Deploy

 t=0min   Deploy v2.1 goes out
 t=2min   Metrics: error rate crosses 5% threshold
 t=2min   Automated rollback triggers ← no human needed
 t=5min   Traffic back on v2.0, error rate returns to normal
 t=7min   If things still look bad, page human
 ─────────────────────────────────────────────────
 t=next day  Engineer investigates asynchronously:
             • Logs: what errors were users hitting?
             • Traces: which service/dependency failed?
             • Fix the bug, deploy v2.2 with confidence

Metrics handle the emergency automatically.
Logs and traces enable the fix — on your schedule, not at 3am.

Monitoring & Alerting

Agenda

The Problem: Flying Blind

The Three Pillars of Observability

Logs

Logs: Event Records

Structured vs Unstructured Logs

Log Levels

The Cost of Logs

Log Collection Architecture

Metrics

Metrics: Time Series Data

Common Metrics Types

Metrics Collection: Pull vs Push

Percentiles Matter

Common Metrics Tools

Traces

The Problem Traces Solve

What is a Trace?

How Tracing Works

Tracing Tools

Sampling: You Can't Trace Everything

Collecting and Shipping Data (Sidecars)

The Collection Challenge

The Sidecar Pattern

Sidecar Use Cases

Putting It All Together

The On-Call Experience

Building Good Alerts

Automated Rollbacks

Monitoring for Continuous Deployment

Deployment Health Signals

Automated Rollback

Incident Timeline: A Bad Deploy