Cardinality

Someone added a label. Just one. They added user_id to the request counter, so that we could break traffic down per user. It was a small, well-intentioned change in a routine pull request, and it was approved in about ninety seconds because it was, on its face, obviously harmless.

The explosion

There were eight million users.

In a metrics system, a counter is not a single number. It is one separate time series for every unique combination of label values. Add a label with two possible values and you double your series. Add a label with eight million possible values and you create, over time, up to eight million separate time series where there used to be one.

The metrics backend ingested them all. It was built to be reliable, so it tried very hard to keep up, accepting series after series, consuming more and more memory to track them, right up until the moment it could not. At 2 AM it fell over. And because it was the central metrics backend for the entire company, when it fell over it took monitoring for every team and every service down with it.

The darkness

For six hours, while a roomful of engineers fought to bring the metrics system back, we were completely blind. Not blind to one service. Blind to all of them. Every dashboard, every alert, every graph across the company went dark at once, and the thing that blinded us was the very label we had added to see more clearly.

What we changed

We put a hard limit on series ingestion per metric, so a single bad label sheds load instead of toppling the whole system. We added a check in code review for high-cardinality labels. And we made it a rule that user IDs, request IDs, and full URLs belong in logs and traces, never in metric labels. We say the incident’s name in hushed tones now. We do not add labels lightly.

The explosion

The darkness

What we changed

More from the dark