Incident Timeline: A Bad Deploy
t=0min Deploy v2.1 goes out
t=2min Metrics: error rate crosses 5% threshold
t=2min Automated rollback triggers ← no human needed
t=5min Traffic back on v2.0, error rate returns to normal
t=7min If things still look bad, page human
─────────────────────────────────────────────────
t=next day Engineer investigates asynchronously:
• Logs: what errors were users hitting?
• Traces: which service/dependency failed?
• Fix the bug, deploy v2.2 with confidence
Metrics handle the emergency automatically.
Logs and traces enable the fix — on your schedule, not at 3am.