The Clock That Drifted

Errors were happening in the future.

The graph showed a spike of failures at a timestamp that had not occurred yet, several minutes ahead of the present moment. The first thing you assume, when your monitoring shows you something physically impossible, is that the monitoring is broken. So that is what we assumed, and we very nearly closed the investigation right there.

The hunt

The future spike kept appearing and kept moving forward, always a fixed distance ahead of now. That fixed distance was the clue. If the monitoring were simply buggy, the error would have been random. A constant offset is not a bug. A constant offset is a measurement of something real.

We tracked it to a single node. Its clock had drifted eleven minutes ahead of every other machine in the fleet. The NTP daemon on that host had silently failed weeks earlier, and nobody noticed, because a wrong clock does not throw an error. It just confidently reports the wrong time to everything that asks.

The damage

That one node stamped its metrics, its logs, and its traces with timestamps from eleven minutes in the future. Scattered ahead of the present like footprints leading away from a body.

Every dashboard that aggregated across the fleet was quietly, subtly wrong. Averages were smeared across time windows that did not line up. Alerts evaluated windows that mixed real data with data from the future. Traces spanning that node showed child operations that appeared to finish before their parents began. None of it threw an error. All of it was wrong.

What we changed

We added direct monitoring of clock drift on every host and made meaningful skew a paging condition. We treat NTP failure as a real incident, not a cosmetic one. And we say it now like a small prayer before sleep: keep the clocks in sync, so the timestamps stay where you left them.

The hunt

The damage

What we changed

More from the dark