Self-Healing

“It never paged anyone. Why would it? It was handling it.”

We were so proud of the auto-remediation. The logic was clean and satisfying. A pod fails its health check, the orchestrator kills it, a fresh one comes up to take its place. No human in the loop. No 3 AM page for a problem that a restart would have fixed anyway. It had quietly saved us from countless interruptions, and we pointed to it in planning meetings as a model of operational maturity.

The bad config

Then a bad config shipped. Every pod started failing its health check on startup. Not immediately, which is the part that made this a horror story rather than a simple outage. Each pod came up, passed its health check for exactly long enough to be added to the load balancer and receive live traffic, and then failed and died a few seconds later.

The loop

The remediation did its job flawlessly. It saw an unhealthy pod, so it killed it and started a fresh one. The fresh one came up, passed its check, took traffic, failed, and died. So the remediation killed it and started another. And another. And another.

Thousands of times a minute, faster and faster, a perfect and tireless loop of execution. The system was, by its own internal definition, working exactly as designed. Pods that failed were being replaced. The replacement rate just happened to be the entire fleet, several times over, every minute, all night long.

And it never paged anyone. Why would it? Paging was for problems the automation could not handle, and this was a problem the automation was handling with tremendous enthusiasm. It handled it all night. We found out in the morning, from the customers.

What we changed

We capped the remediation rate so that beyond a threshold it stops acting and starts paging instead. We added an alert on the remediation itself, firing when it runs far more often than normal, because a spike in self-healing is a symptom, not a comfort. And we learned to be suspicious of any automation that only ever reports success. A system that is constantly, silently fixing itself is a system that is constantly, silently broken.