Self-Healing
Our auto-remediation killed unhealthy pods and restarted them. A bad config made every pod fail just after taking traffic, so it killed and restarted thousands of times a minute, all night, and never paged a soul.
read the tale →A haunted campfire for observability war stories. The pages that woke you at 3am, the dashboards that lied, and the alerts that never fired.
gather round ↓What is this place
o11yhorrors is a campfire for those stories. The outages that taught us something true about observability, told plainly, with the technical detail intact and the lesson left where you can find it.
Observability is the practice of understanding what a system is doing from the signals it emits: its metrics, its logs, and its traces. These are the tales of what happens when those signals go quiet, go stale, or go wrong at exactly the moment you need them most. Every one is true. Only the service names were changed to protect the on-call.
Pull up a log
A few of the stories that still keep their tellers awake. Each one ends with what it taught them.
Our auto-remediation killed unhealthy pods and restarted them. A bad config made every pod fail just after taking traffic, so it killed and restarted thousands of times a minute, all night, and never paged a soul.
read the tale →Someone added one label to a counter so we could break traffic down per user. There were eight million users. Each became a time series, and the metrics backend fell over and took monitoring for the whole company with it.
read the tale →A new hire's first on-call. A critical page at 3:47 AM. Forty-five minutes of cold-sweat investigation before learning the alert had fired every night for two years and everyone just slept through it.
read the tale →We had a disk-full alert. We had tested it. The night the database died, it stayed silent, because the metric pipeline that fed it died right along with the disk.
read the tale →Every graph was green while checkout was completely down. The dashboard was not showing healthy traffic. It was showing the last thing it ever saw.
read the tale →Know your monsters
The same failures haunt every team. Learn to recognize them before they visit you.
Pages that never fire, and pages that never stop.
#alertingGreen graphs over a burning house.
#dashboardsOne innocent label, eight million time series.
#cardinalitySaving money by discarding the only request that mattered.
#samplingSelf-healing loops that quietly heal nothing.
#automationThe 3am page, the runbook that lied, the silence.
#on-callAdd to the fire
Share the incident you cannot forget. Submissions are read by a human, and the good ones get published with whatever name you choose, including none at all.