<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>o11yhorrors: observability war stories</title><description>True observability horror stories from on-call engineers: dashboards that lied, alerts that never fired, cardinality explosions, sampling blind spots, and the lessons each one taught. Read the tales and submit your own.</description><link>https://o11yhorrors.com/</link><language>en-us</language><item><title>Self-Healing</title><link>https://o11yhorrors.com/tales/self-healing/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/self-healing/</guid><description>Our auto-remediation killed unhealthy pods and restarted them. A bad config made every pod fail just after taking traffic, so it killed and restarted thousands of times a minute, all night, and never paged a soul.</description><pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate><category>automation</category><category>remediation</category><category>feedback-loops</category></item><item><title>The Clock That Drifted</title><link>https://o11yhorrors.com/tales/the-clock-that-drifted/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/the-clock-that-drifted/</guid><description>Errors were happening in the future. One node&apos;s clock had drifted eleven minutes ahead, and its metrics arrived stamped from a time that had not occurred yet, quietly poisoning every aggregate.</description><pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate><category>time</category><category>metrics</category><category>clock-skew</category></item><item><title>Cardinality</title><link>https://o11yhorrors.com/tales/the-cardinality-incident/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/the-cardinality-incident/</guid><description>Someone added one label to a counter so we could break traffic down per user. There were eight million users. Each became a time series, and the metrics backend fell over and took monitoring for the whole company with it.</description><pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate><category>metrics</category><category>cardinality</category><category>monitoring-outage</category></item><item><title>The Silent Sampler</title><link>https://o11yhorrors.com/tales/the-silent-sampler/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/the-silent-sampler/</guid><description>We turned on one percent trace sampling to save money. Three months later a rare bug appeared, and the traces that could have explained it had statistically never been recorded.</description><pubDate>Wed, 04 Mar 2026 00:00:00 GMT</pubDate><category>tracing</category><category>sampling</category><category>cost-control</category></item><item><title>It Was the Logs All Along</title><link>https://o11yhorrors.com/tales/it-was-the-logs-all-along/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/it-was-the-logs-all-along/</guid><description>A slow service logged more to help us debug. The extra logs filled the volume, the agent buffered to memory, the OOM killer struck, and the restart logged even more. We built a machine that described its own death.</description><pubDate>Sat, 21 Feb 2026 00:00:00 GMT</pubDate><category>logging</category><category>cascading-failure</category><category>resource-exhaustion</category></item><item><title>3:47 AM</title><link>https://o11yhorrors.com/tales/the-3-47-am-page/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/the-3-47-am-page/</guid><description>A new hire&apos;s first on-call. A critical page at 3:47 AM. Forty-five minutes of cold-sweat investigation before learning the alert had fired every night for two years and everyone just slept through it.</description><pubDate>Sun, 08 Feb 2026 00:00:00 GMT</pubDate><category>on-call</category><category>paging</category><category>alert-fatigue</category></item><item><title>The Alert That Never Fired</title><link>https://o11yhorrors.com/tales/the-alert-that-never-fired/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/the-alert-that-never-fired/</guid><description>We had a disk-full alert. We had tested it. The night the database died, it stayed silent, because the metric pipeline that fed it died right along with the disk.</description><pubDate>Thu, 29 Jan 2026 00:00:00 GMT</pubDate><category>alerting</category><category>thresholds</category><category>monitoring-gaps</category></item><item><title>The Dashboard That Lied</title><link>https://o11yhorrors.com/tales/the-dashboard-that-lied/</link><guid isPermaLink="true">https://o11yhorrors.com/tales/the-dashboard-that-lied/</guid><description>Every graph was green while checkout was completely down. The dashboard was not showing healthy traffic. It was showing the last thing it ever saw.</description><pubDate>Mon, 12 Jan 2026 00:00:00 GMT</pubDate><category>dashboards</category><category>metrics</category><category>stale-data</category></item></channel></rss>