Why Honeycomb? Black Swans, Unknown-Unknowns, and the Glorious Future of DoomBy Charity Majors | Last modified on April 1, 2022
Hello friends! We need to talk – about Honeycomb, you, and the future.
We’ve built this thing to help ourselves and one another deal with the future of software. It isn’t Yet Another Monitoring Tool, or Another Metrics Tool, or Another Log Aggregator. Frankly the world doesn’t need any more of those. The world does need Honeycomb, and rather badly.
We spend a fairly large percentage of time obsessing over the future of technology and how we can help people futureproof their systems with better tools. We are somewhat opinionated (ha!), and believe quite modestly that Honeycomb is better than anything else out there for preparing your services and your teams to meet the future.
But it doesn’t really do anyone else much good to have all this locked away in our heads. So … let’s get swinging. :)
What is Honeycomb?
Honeycomb is an event-driven observability tool for debugging systems, application code, and databases. Honeycomb uses structured data and read-time query aggregation to support ultra-rich datasets, no indexes or schemas, and a fast, interactive interface.
Honeycomb is for debugging systems the way gdb or pprof are for debugging code. Only instead of stepping through lines and between modules, you’re now instrumenting services and tracing the full life-cycle of events between services and code and systems and storage layers.1
“Systems: we have a problem.”
Software is becoming exponentially more complex. On the infra side we have convergence of patterns like microservices, polyglot persistence, containers that continue to decompose monoliths into agile, complex little systems. Great for products; hard on humans.
On the product side we have an explosion of platforms and creative ways for empowering humans to do cool new stuff. Great for users, hard to make using stable, predictable infra components.
Our team has worked at Google, Facebook, Dropbox, and other platforms that were consistently years ahead of the pack. We’ve seen the future, and frankly it looks a lot like a freight train of complexity bearing down on you just around the next curve. And we can help.
Honeycomb is designed to help your team answer unpredictable new questions — quickly, accurately, painlessly. It provides real-time, interactive introspection for your data at a scale that would drown or bankrupt other apps.
Because the fundamental difference between predictable systems and complex systems is the number of new questions you need to craft and answer on a regular basis. 2
Old Way: With a classic LAMP stack, you might have one big database, an app tier, web layer and a caching layer, with software load balancing. You can predict most of the failures, and craft a few expert dashboards that will answer nearly every performance root cause analysis you have over the course of a year. Great! Your team isn’t going to spend a lot of time chasing unknown unknowns, and that’s what matters.
New Way: With a platform, or a microservices architecture, or millions of unique users or apps, you may have a long, fat tail of unique questions to answer every week. They may be variations on a theme (e.g. a user writes in and incorrectly reports “the site is down”), but you need the ability to drill down and answer unpredictable questions accurately and rapidly without expending a lot of cognitive energy.
At Honeycomb, all of us have been through that rocketship growth phase into uncharted territory and managed chaos repeatedly, so we know how to bridge the gap with tooling and techniques. 3
A key factor in our success — and yours — is how you adjust your mental model from reliance on a fixed set of questions and checks (“monitoring”) to the more fluid approach of systems observability.
What is “Observability”?
“Observability” is an awesome concept, borrowed from control theory. It describes the kind of tooling we all need once systems outpace our ability to predict what’s going to break.
“In control theory, observability is a measure for how well internal states of a system can be inferred by knowledge of its external outputs. The observability and controllability of a system are mathematical duals.”
Observability is what we need in a world where most problems are either caused by humans or black swan events, the convergence of three, five, 10+ different things failing at once. Platforms that incorporate multiple components will always produce a long, fat tail of new questions to ask your systems about on a regular basis.
Let’s compare it to some classic options such as monitoring, metrics, and log aggregation.
- “Monitoring” is an umbrella term for operational visibility. It generally means you have a set of automated checks (often centralized) that run against systems to ensure none of those things that signify trouble are happening (in any of the ways you predicted). Monitoring and alerting are things that Honeycomb can couple with in lots of ways.
- “Metrics” are usually a tick or datapoint, often a number, with optional tags. Metrics are usually bucketed by rollups over intervals, which sacrifices precious detail about individual events in exchange for cheap storage. Most companies are drowning in metrics, most of which never get looked at again. You cannot track down complex intersectional root causes without context, and metrics lack context.
- “Log aggregation” is the most like Honeycomb, because “Logs” are clumsy little linear stories about events. But log aggregation involves a lot of string processing (not getting any faster), regexps (not getting more maintainable), and the need to predictively index on anything you might want to search on (or you’re straight back to distributed grep). Structured data is what gives Honeycomb its unparalleled flexibility and performance.
These can all be nice things, but they’re not what we mean by observability. Observability is about building predictable systems that you can reason about.
Observability: not your mama’s monitoring
In the future, instrumentation is just as important as unit tests. Running complex systems means you can’t model the whole thing in your head, and you shouldn’t even try because it’s a crutch that’s becoming impossible anyway. Instead, focus on making every component consistently understandable.
Yes, of course you should have dashboards. Your dashboards must be flexible and interactive, focused on helping you tease out breadcrumbs and follow the trail. If your dashboards lock you into a rigid set of pre-defined lanes, they’ve cut off your creative problem-solving superpowers.
And to provide this kind of ad hoc questioning, you need rich, wide, event-driven data stores that incentivize and empower you to store as much context as possible for each event.
Just about anything can help you find the known unknown problems; observability tooling helps you tease out the unknown-unknowns. Results aren’t very interesting either; what’s interesting is how you got there from the problem statement, preserving the context of the run, and carefully stashing that nugget of wisdom away in your library for future-you to learn from it.
The future is awesome. Welcome.
- “But what about Zipkin?? Have you heard about opentracing.io or LightStep???” Yeah!! Big fans! We love power tools that focus on tracing unique request IDs first. It’s a really neat tool for some scenarios. The request ID tracing methodology is inherently depth-first, whereas ours is breadth-first.
- Something to think about: why are you still building your own metrics? It’s harder than ever to attract and retain world class engineers. If you have gotten engineers to join your team, why would you waste them on projects that are ancillary to your core business value? Why waste your team’s team building out Yet Another Dashboard when you can outsource the job to someone else who can do it better and cheaper? We gave up maintaining our own postfix and imap systems a decade ago, and the same transition is underway for metrics, albeit more slowly.
- Does it sound like we’re promising impossible magical things? No, there are tradeoffs. This is already quite meaty though, we’ll have to give you a look under the hood separately.
Iterating Our Way Toward a Service Map
For a long time at Honeycomb, we envisioned using the tracing data you send us to generate a service map. If you’re unfamiliar, a service...
Autocatalytic Adoption: Harnessing Patterns to Promote Honeycomb in Your Organization
When an organization signs up for Honeycomb at the Enterprise account level, part of their support package is an assigned Technical Customer Success Manager. As...
Surface and Confirm Buggy Patterns in Your Logs Without Slow Search
Incidents happen. What matters is how they’re handled. Most organizations have a strategy in place that starts with log searches—and logs/log searching are great, but...