Monitoring & Alerting

Agenda

  1. Logs
  2. Metrics
  3. Traces
  4. Collecting and Shipping Data
  5. Putting It All Together: On-Call & Automated Deployments

The Problem: Flying Blind

  • Your app is deployed. Is it working?
  • Users report "it's slow" — where's the bottleneck?
  • Something crashed at 3am — what happened?

Without monitoring, you're guessing.

The Three Pillars of Observability

Pillar What It Captures Question It Answers
Logs Discrete events "What exactly happened?"
Metrics Numerical measurements over time "How is the system doing?"
Traces Request flow across services "What did the request do?"

Each pillar gives you a different lens on your system. You need all three.

Logs

Logs: Event Records

{"timestamp": "2024-11-25T10:32:15Z", "level": "INFO",
 "service": "auth", "message": "User login successful",
 "user_id": "12345", "duration_ms": 45}
  • A log is a timestamped record of a discrete event
  • High cardinality — unique values per event (user IDs, request IDs, payloads)
  • Great for answering "what exactly happened?"

Structured vs Unstructured Logs

Unstructured

2024-11-25 10:32:15 INFO
User 12345 logged in (45ms)
  • Easy to produce
  • Hard to search and filter

Structured (JSON)

{"timestamp": "...",
 "level": "INFO",
 "user_id": "12345",
 "duration_ms": 45}
  • Machine-parseable
  • Queryable by any field

Always prefer structured logs in production systems.

Log Levels

Level When to Use
DEBUG Detailed development info (off in prod)
INFO Normal operations worth recording
WARN Something unexpected but recoverable
ERROR Something failed, needs attention
FATAL Process cannot continue

Choosing the right level matters — too many ERRORs and you get alert fatigue, too few and you miss real problems.

The Cost of Logs

  • Logs are expensive at scale
    • A busy service can produce GBs of logs per hour
    • Storage, indexing, and querying all cost money
  • Common strategies:
    • Retention policies: Delete old logs after N days
    • Log levels: Set minimum level per environment
    • Structured logging: Makes filtering cheaper than full-text search
    • Sampling: Only log a percentage of events

Log Collection Architecture

  • Apps write logs to a dedicated logfile or stdout/stderr
  • A collector (Fluentd, Filebeat, Vector) picks them up
  • Logs are shipped to a central store (Elasticsearch, Loki, CloudWatch)
  • Query and visualize with Kibana, Grafana, or CloudWatch Insights

Note that the collector needs to run as a separate process alongside your application, why is this paradigm good?

log architecture

Metrics

Metrics: Time Series Data

http_requests_total{method="GET", status="200"} 15234
http_request_duration_seconds{quantile="0.99"} 0.25
node_cpu_usage_percent{instance="web-01"} 72.3
  • A metric is a numerical measurement captured over time
  • Low cardinality — aggregated counts, gauges, histograms
  • Cheap to store, fast to query

Common Metrics Types

Type What It Tracks Example
Counter Cumulative total (only goes up) http_requests_total
Gauge Current value (goes up and down) cpu_usage_percent
Histogram Distribution of values request_duration_seconds

Metrics Collection: Pull vs Push

Pull Model (Prometheus)

  • Central server scrapes /metrics endpoints
  • Server controls collection rate
  • Easy to tell if a target is down (scrape fails)
  • However, needs service discovery

Push Model (StatsD, OTLP)

  • Services send metrics to a collector
  • Better for short-lived jobs (batch, cron)
  • Requires collector to be available
  • Cannot easily tell if a particular node/pod goes down

Percentiles Matter

Average latency is 50ms. Everything looks fine, right?

p50  =   30ms   ← Half of users see this
p95  =  200ms   ← 1 in 20 users waits this long
p99  = 1500ms   ← 1 in 100 users waits 1.5 seconds
  • Averages hide outliers — a few slow requests disappear in the mean
  • Always look at p95 and p99 for latency
  • If you have 1M requests/day, p99 = 10,000 bad experiences

Common Metrics Tools

Category Tools
Collection Prometheus, Datadog, CloudWatch
Storage Prometheus TSDB, InfluxDB, Cortex
Visualization Grafana, Datadog Dashboards
Alerting Prometheus Alertmanager, PagerDuty

Traces

The Problem Traces Solve

You have a slow API response. Where is the time spent?

User → API Gateway → Auth Service → User Service (e.g. Chat) → Database
                                  → Cache
  • Logs tell you what happened in each service
  • Metrics tell you overall latency is high
  • But which service is the bottleneck? Which call is slow?

What is a Trace?

A trace follows a single request as it travels across services.

Trace ID: abc-123
├─ API Gateway          [0ms ─────── 250ms]
│  ├─ Auth Service      [10ms ── 60ms]
│  ├─ User Service      [65ms ──────── 200ms]
│  │  ├─ DB Query       [70ms ── 120ms]
│  │  └─ Cache Lookup   [125ms ─ 140ms]
│  └─ Serialize Response[205ms ─ 245ms]

Each block is a span — one unit of work with a start time, duration, and parent.

How Tracing Works

  1. First service generates a trace ID and span ID
  2. Trace ID is propagated in HTTP headers to downstream services
  3. Each service creates its own spans, linked by the trace ID
  4. All spans are sent to a collector and reassembled into a trace
# Propagated header
traceparent: 00-abc123-span456-01

tracing

Tracing Tools

Category Tools
Instrumentation OpenTelemetry (OTEL), Jaeger client
Collection OTEL Collector, Jaeger, Zipkin
Storage & Visualization Jaeger UI, Grafana Tempo, Datadog APM

OpenTelemetry is the emerging standard — it covers logs, metrics, and traces with a single SDK.

Sampling: You Can't Trace Everything

  • A trace generates spans in every service a request touches
  • At high throughput, storing every trace is expensive
  • Head sampling: Decide at the start (e.g., trace 1% of requests)
  • Tail sampling: Decide after the fact (e.g., keep traces with errors or high latency)

Collecting and Shipping Data (Sidecars)

The Collection Challenge

  • You have 50 containers across 10 nodes
  • Each produces logs, metrics, and traces
  • Each pillar has its own tools: Fluentd for logs, Prometheus for metrics, Jaeger for traces
  • That's 3 collectors per service, each with its own config, format, and destination
  • Multiply by every service in your cluster...

Managing all of this separately is a maintenance nightmare.

The Sidecar Pattern

Idea: Run collectors alongside the app in parallel (sidecar)

spec:
  containers:
  - name: app
    image: myapp:v1
  - name: otel-collector       # Sidecar!
    image: otel/opentelemetry-collector:latest
  • Observability data does not affect application performance
  • Using Helm charts we can easily configure collectors for all services

Sidecar Use Cases

  • Log shipping: Collect and forward logs (Fluentd, Vector)
  • Metrics export: Convert app metrics to Prometheus format
  • Trace collection: OTEL collector sidecar
  • Service mesh: Envoy proxy for traffic management + observability

Putting It All Together

The On-Call Experience

It's 3am. Your pager goes off. What do you need?

  1. Metrics tell you something is wrong (error rate spike, latency increase)
  2. Logs tell you what is happening (error messages, stack traces)
  3. Traces tell you where it's happening (which service, which dependency)

Good observability turns a 2-hour mystery into a 10-minute diagnosis.

Building Good Alerts

Not every metric needs an alert. Bad alerts cause alert fatigue.

  • Alert on symptoms, not causes
    • Good: "Error rate > 5% for 5 minutes"
    • Bad: "CPU > 80%" (maybe that's fine under load)
  • Alert on user impact
    • Good: "p99 latency > 2s"
    • Bad: "One pod restarted" (Kubernetes will handle it)
  • Include runbooks — What should you check if this alert goes off?

Automated Rollbacks

Monitoring for Continuous Deployment

You just deployed v2.1. Is it working?

  • How do you know if the deploy succeeded?
  • How quickly can you detect problems?
  • When should you roll back?

Metrics will help you answer all three of these questions.

Deployment Health Signals

Key signals to watch after a deploy:

  • Error rate (5xx responses) — compare to pre-deploy baseline
  • Latency (p50, p95, p99) — did we introduce a regression?
  • Request throughput — is traffic being served?
  • Resource usage (CPU, memory) — any leaks?

Automated Rollback

# Argo Rollouts example
analysis:
  successCondition: result < 0.05  # Error rate < 5%
  failureLimit: 3
  metrics:
  - name: error-rate
    provider:
      prometheus:
        query: |
          sum(rate(http_errors_total[5m])) /
          sum(rate(http_requests_total[5m]))

Incident Timeline: A Bad Deploy

 t=0min   Deploy v2.1 goes out
 t=2min   Metrics: error rate crosses 5% threshold
 t=2min   Automated rollback triggers ← no human needed
 t=5min   Traffic back on v2.0, error rate returns to normal
 t=7min   If things still look bad, page human
 ─────────────────────────────────────────────────
 t=next day  Engineer investigates asynchronously:
             • Logs: what errors were users hitting?
             • Traces: which service/dependency failed?
             • Fix the bug, deploy v2.2 with confidence

Metrics handle the emergency automatically.
Logs and traces enable the fix — on your schedule, not at 3am.

Note: Add diagram showing app → collector → storage → dashboard