3:47 AM

First week on the rotation. I had read the onboarding docs twice. I had the runbook bookmarked. I told myself I was ready, the way you tell yourself you are ready for something you have never actually done.

The page came at 3:47 AM. CRITICAL. checkout-svc p99 latency above 30 seconds.

The investigation

Heart pounding, I opened the laptop in the dark, joined the incident bridge alone, and started digging. I checked the service dashboards. I checked the upstream dependencies. I checked the database, the cache, the load balancer, the recent deploys. Everything was healthy. The latency the alert was screaming about did not appear on any graph I could find.

After forty-five minutes of cold sweat and increasingly frantic queries, I did what the docs said to do when you are stuck. I escalated. I woke my lead. My lead, half asleep and clearly confused, woke their lead.

The truth

At 4:40 AM the senior engineer logged on, looked at the alert for about ten seconds, and muted it.

The alert fired every single night at 3:47 AM. A nightly batch job warmed its caches at that exact time and produced a brief, harmless latency spike that tripped the threshold. It had been firing for two years. Everyone on the team knew. Nobody had written it in the runbook. Nobody had told me. They had simply, over time, learned to sleep through their phones, and the silence I had mistaken for a healthy rotation was actually a team that no longer believed its own alerts.

What we changed

We added a suppression window around the batch job so the alert could not fire on a known, harmless spike. We audited every alert nobody could explain and deleted the ones that no longer meant anything. And we adopted a rule: if an alert is safe to ignore, it is not safe to keep. Either it is worth waking a human, or it does not exist.