Part 3/5: Dear Operations EngineersBy Charity Majors | Last modified on April 18, 2022
It’s time to shrug off the last vestiges of that martyr complex we’ve been trudging around with since the bad old days of the BOFH. We’ve got better things to do with our lives than being assholes to everyone.
Stop trying to predict every possible failure — you can’t, anyhow – and stop toiling away half your life creating dashboards for people, dashboards of dashboards, and ways to auto-generate dashboards and metrics (that nobody can ever seem to find when they most need them).
Honeycomb grew out of the best of the operations and data disciplines. We are grounded in the fervent belief in the power and necessity of raw events, in the belief that it’s better to be whip-fast, interactive and exploratory — and “close-enough” than to claim or aim for 100%.
Systems engineers should have nice things
Honeycomb was built by ops engineers, for ops engineers. Because we love you, and we want your lives to be better.
Business Intelligence teams have had nice things for years, because the fiscal consequences have always been clear. You could always draw a straight line from “more business intelligence” to “making more money.”
We believe this is increasingly clear for systems too, and we can start arguing for this convincingly. How much money gets lost when a site is down or shopping carts conversion rates are failing for 5 minutes, or when a page takes two seconds to load? A lot, actually.
Old way: Over-page yourselves on symptoms, because you don’t trust yourselves to debug complex problems without paging on a lot of symptoms, which are inherently flappy and unreliable.
New way: Create two lanes for alerts – things that are worth waking people up for at 3 am, and things that need to be dealt with eventually … like when you roll in to work at 11 am after a good night’s sleep. Have the confidence in yourselves and your debugging tools that you don’t need to wake up and judge every flap for yourself; align your pain with customer pain.
Operations teams are chronically over-paging themselves and burning themselves out. Often this is because they are paging on symptoms rather than end-to-end code paths or top-level metrics. But engineering pain should be strongly aligned with customer pain; if customers are unaffected, engineers shouldn’t be woken up either.
Honeycomb solves this by giving you the confidence to debug complex problems in a fraction of the time. Teams over-page themselves because they lack confidence in their tooling. They have to throw the dice and monitor for symptoms because they don’t or can’t trust tools to page them only when customers are affected.
We experienced this firsthand at Parse, with Facebook’s Scuba. By getting our events into datasets that let us focus on ad hoc queryability, we were able to systematically eliminate category after category of unreproducible errors that collectively added up to a huge impact on our reliability. And we were able to debug practically anything in seconds or minutes, not minutes or hours.
Drop MTTR With One Weird Trick
We know that most systems problems happen as a result of a human taking action upon that system. That’s why we’ve made vertical markers one of the earliest and best features of Honeycomb.
Old way: Spend a lot of time puzzling over an unexpected spike or change that has no apparent cause; hours later track it down to a human event, almost inevitably.
New way: Draw colorful, dotted vertical lines any time a person runs a command or a script gets run from cron, etc. Wrap your command lines and cron jobs so it’s easy to trace down human actions.
Any time a script runs from cron, any time you canary a deploy, any time you run a one-off — Honeycomb can draw a vertical link with a URL to the code tarball, include your name, post it to a slack room, whatever.
This is really helpful for distributed teams, too. Honeycomb contains a lot of stuff aimed for helping distributed teams collaborate on a debugging problem, share their work, hand off between oncall shifts, etc.
On November 18, between 00:50 and 00:56 UTC, an update was deployed that improved Honeycomb’s business intelligence (BI) telemetry available from our production operations environment....