Know the dark

Lessons from the fire

The concepts behind the tales, explained plainly. Each one is a kind of monster you will eventually meet on call. Here is how to recognize it.

What is observability?

Observability is the ability to understand what a system is doing internally based only on the signals it emits from the outside: its metrics, logs, and traces. A system is observable when you can answer new questions about its behavior without shipping new code to ask them.

What are the three pillars of observability?

Metrics are numeric measurements aggregated over time, such as request rate or error count. Logs are timestamped records of discrete events. Traces follow a single request across every service it touches. Each answers a different question, and most real incidents need more than one.

What is stale monitoring data, and why is it dangerous?

Stale data is monitoring data that has stopped updating while still being displayed. A dashboard holding its last received value looks identical to a healthy flat line. Without a freshness check, you cannot tell the difference between a calm system and a dead pipeline.

Read the tale: The Dashboard That Lied

What is alert fatigue?

Alert fatigue is what happens when alerts fire so often, or so often without meaning, that responders learn to ignore them. An alert everyone has been trained to dismiss is worse than no alert, because it teaches the team to sleep through the one that matters.

Read the tale: 3:47 AM

Why do some alerts fail to fire during an outage?

An alert that depends on the failing system cannot warn you about that system failing. If the metric pipeline dies with the host, there is no signal left to threshold. Reliable monitoring also watches for the absence of healthy signals, not only the presence of bad ones.

Read the tale: The Alert That Never Fired

What is a cardinality explosion in metrics?

In a metrics system, every unique combination of label values becomes its own time series. Adding a high-cardinality label such as a user ID or request ID can turn one series into millions, exhausting the metrics backend. Keep unbounded identifiers in logs and traces, never in metric labels.

Read the tale: Cardinality

What is the difference between head and tail sampling?

Sampling keeps only a fraction of traces to control cost. Head sampling decides at the start of a request, before you know if it will fail, so rare errors are usually discarded. Tail sampling decides after the request finishes, so you can always keep the slow and failed requests that matter most.

Read the tale: The Silent Sampler

How does clock skew corrupt observability data?

Clock skew is when a machine's clock drifts away from real time. A skewed host stamps its metrics, logs, and traces with the wrong time, silently poisoning aggregates and breaking trace ordering. A wrong clock throws no error, so drift must be monitored directly.

Read the tale: The Clock That Drifted

What is a cascading failure?

A cascading failure is when one component's failure overloads another, which fails and overloads the next, until the whole system collapses. Observability tooling can cause its own cascade when it consumes the very resource it monitors, such as disk or memory.

Read the tale: It Was the Logs All Along

When does self-healing automation become a problem?

Auto-remediation fixes known failures without a human, such as restarting an unhealthy instance. It becomes dangerous when it hides a symptom in a tight loop while never resolving the cause, and never pages anyone, because by its own definition it is succeeding. Cap the rate and alert when it runs more than usual.

Read the tale: Self-Healing