The Alert That Never Fired

We had an alert for the disk filling up. We were proud of it. We had tested it in staging. It had paged us before, months earlier, and we had fixed the problem and gone back to sleep feeling like competent adults.

The night the primary database died, the alert stayed silent.

What we expected

Our mental model was simple. Disk usage climbs slowly over days as data grows. At eighty percent we get a warning. At ninety percent we get paged. Plenty of runway to add capacity or clean up. This is how it had always worked, so this is how we assumed it would always fail.

What actually happened

The disk did not fill gradually. A runaway analytical query spilled to temporary space and the disk went from sixty percent to one hundred percent in under a minute.

Our alert evaluated every five minutes. By the time the next evaluation cycle came around, the disk was already full, the database process had already crashed, and the metrics agent that reported disk usage had crashed right along with it because it too needed to write to that full disk. There was no metric to evaluate. There was no signal to threshold. The thing that was supposed to tell us about the death had died of the same cause, at the same moment, for the same reason.

You cannot be paged by a corpse.

What we changed

We added a heartbeat alert that fires when a host stops reporting metrics entirely, on the theory that silence is itself a symptom. We moved the critical evaluation interval down to thirty seconds for fast-moving resources. And we wrote it on the runbook in plain words: the most dangerous failures are the ones that also take out your ability to see them.