Why Honeycomb

Honeycomb is a tool for introspecting and interrogating your production systems. We can gather data from any source—from your clients (mobile, IoT, browsers), vendored software, or your own code. Single-node debugging tools miss crucial details in a world where infrastructure is dynamic and ephemeral. Honeycomb is a new type of tool, designed and evolved to meet the real needs of platforms, microservices, serverless apps, and complex systems.

 

“We’re all distributed systems engineers now.” –Charity Majors, CEO and co-founder

Fast queries on raw, high-cardinality data

In a traditional monitoring system, engineers are frequently pinned between two or more undesirable alternatives: they can either pre-aggregate data and lose precious detail, or keep lots of data but suffer very slow querying and large costs.

Honeycomb, by contrast, hits a sweet spot between these two and is backed by a blazing fast columnar store which can run queries on many millions of rows in seconds. It encourages a fluid workflow and rapid iteration speed in answering questions to solve problems, and still exposes the raw collected data to analyze in as much detail as desired. Because engineers are able to get both high- and low- level information, they become better at solving problems more rapidly.

Having the raw data available is helpful in this case since you don’t need to know which IP addresses or user IDs to count ahead of time – only that those fields might be of interest to you later.

Example: Your website is suddenly receiving a lot of traffic and it’s not clear whether it’s “good” traffic (e.g., you’re going viral somewhere) or “bad” traffic (someone is attacking you by flooding the site with requests, or a bot has gone crazy). Using Honeycomb you can:

  • BREAK DOWN by the high-cardinality fields within the requests – user ID, IP address, and more – to quickly see if the sources originate from few or from many places using Honeycomb.
  • If it’s a bad actor, blacklist that user and protect the website.
  • If it’s desirable traffic, spin up more servers to deal with it.

Proactive, not just reactive

Example: A high priority customer writes in to let you know that they are very unhappy as some pages are loading so slowly that they cannot use them. Using Honeycomb you can:

  • BREAK DOWN by customer ID and URL with a calculation to quickly spot where in time, and on which page(s), the latency affected that particular customer.
  • Fix the issue using contextual information from the raw data in this query.
  • See if other users were affected by the issue, so you can proactively reach out to customers who were affected and might be unhappy but silent.

Honeycomb can help you figure out which specific customers are affected by (or even causing) a particular issue in production. This allows you to not only detect when something is wrong, but to rapidly deduce why and take steps to proactively mitigate its impact on the business, or even spot potential issues before they happen.

Honeycomb is able to do this because it was specifically designed and architected to handle “high-cardinality” data. High-cardinality data has a lot of distinct values (such as a customer ID, of which there could be thousands or millions), and many existing monitoring systems do not handle it well because trying to do so can create explosive complexity.

Democratized debugging

Large technology companies like Facebook and Google use systems like Scuba to debug code and understand how it runs in production—but to date those tools have not been available to engineers outside of those companies. Honeycomb changes all of that, empowering individual engineers, teams, and organizations to explore and understand their systems, in production, at scale, and in real-time, finding the "needles in a haystack of needles" that yesterday's toolsets routinely miss.

New School vs. Old School

There are no tools quite like Honeycomb on the market. Here is how our approach, feature set, and value compare to some traditional categories.

 

Monitoring and Metrics

Examples: Datadog, SignalFx, Graphite, influxDB, Kibana, statsd, Prometheus, Ganglia

 

A metric is essentially a “dot” of data—e.g., statsd.increment("api.requests") is the statsd command to increase the “api.requests” metric by one. Newer time series data stores try to approximate the context of an event with tags or dimensions. You typically are allowed a limited number of tags (because of the write amplification factor), and you can slice and dice your metrics by tags.

The primary method of interacting with metrics is by constructing dashboards. A dashboard is a view of one or several metrics displayed over time, and it may be generated manually or programmatically.

Honeycomb is different from traditional monitoring and metrics tools because it is event-driven and interactive. We accept arbitrarily wide events with no schema, so you may have hundreds or more keys in a dataset, and you may ask questions that look more like business-intelligence queries. For example:

 

“Some users are reporting elevated latency. Latency does appear to be elevated ... but only for write endpoints, and only for requests hitting replica sets with a primary in AWS availability zone 'us-east-1b' on the r3 instance family, and only for nodes using PIOPS. There seems to be network saturation between storage and instances for those nodes.”

 

You can't get that out of a dashboard unless it was handcrafted in advance for that specific, EXACT question. You could try to pre-generate dashboards for every possible combination of factors, but you shouldn't: asking questions is a much better model that helps you perform real data-driven debugging, not passive eyeball-scanning.

Honeycomb provides collaborative “Boards” where you can bookmark and save any interesting entry points to your data. Teams can share Boards with amongst themselves to propagate useful information such as quick links for folks on call, examples for onboarding new teammates, and easy visibility into newly-released features or recent deployments.

Honeycomb has no limits on the number of attributes (tags) you can have (hundreds of millions or more) or combine in a query.

 

Log Aggregation

Examples: Splunk, Sumo Logic, ELK, Papertrail, Loggly, Graylog, etc.

 

Logs are closer on the evolutionary tree to us than metrics, because logs are proto-events.  Logs are strings, however, and Honeycomb accepts only structured data. This means far less costly storage and processing has to be done on the server side.

Log aggregation tools typically rely on regular expressions, which are slow; transport layers like rsyslog or logstash; and some sort of schema or indexes you have to predict and choose. Indexes are expensive to maintain, and write perf degrades across the board if you write too many of them (not to mention the physical cost of storage).

Honeycomb accepts JSON objects. You can turn your strings into JSON however you wish. We have lots of helpers to get you started—e.g., honeytail (which understands most common log formats and can run either from cron or as a lightweight agent), SDKs for most major programming languages, and even helpers for databases that do high throughput sniffing over the wire and reconstitute your transactions.

Honeycomb is a homegrown column store and has no schema or indexes. We aggregate at read-time and can do horizontal sharding indefinitely, letting us achieve lightning fast interactive performance at "web scale" (sic).

Many log aggregation tools are very mature and rich in helpers and features and drop-in connectors to every type of software under the sun. Those are wonderful for known-unknowns. Honeycomb shines at unpredictable workloads and helping you find unknown-unknowns.

 

Application Performance Monitoring

Examples: New Relic, Dynatrace, AppDynamics

 

APM tools are typically backed by one of the other two storage backends discussed above (metrics or log aggregators) to collect and present data from the perspective of the application itself, as well as surfacing language internals. They often do clever things to sift out the most important data automatically and present it with very little work.

This is terrific! It's a great shortcut for getting started. Some of them also let you define custom triggers or questions at the application level. However, at the presentation layer they have the same shortcoming: you can't ask a new question, or you can't break it down by *just one user* (out of tens of millions of users) or *just one app* and then ask all the same questions as before. When systems break these days, they tend to do so idiosyncratically, so the kind of data that APM tools weed out is exactly the kind you need.

Honeycomb handles these high-cardinality and high-dimensionality cases flawlessly. APM tools often make it easy to find the “top 10” of something; Honeycomb makes it as trivial to find #100,001 as #10.

Honeycomb includes native SDKs, so instrumenting your code is as easy as adding a comment. You don't get as much pre-baked stuff done for you, but you can insert any data you want in the form of k/v pairs and query on it later.

 

Exception Trackers

Examples: Sentry, Airbrake, Rollbar

 

Exception trackers have some overlap with APM tools but cover an even more specific use case: your application hits an unexpected situation, and you'd like to be notified ASAP—at least, the first time it happens. The next ten, hundred, or thousand times? Maybe not so much.

These tools are absolutely necessary to any developer workflow. Exception trackers have a lot of magic built into deduplicating stack traces, and often have some fantastic product thinking around an issue-resolution workflow, but they can fail to surface subtler problems in your system's health.

If exceptions are only thrown when an error is hit, you lose the ability to understand when things get worse-but-not-broken. Increased latencies or poorly-balanced loads only surface in an exception tracker when they hit some sort of threshold. To ensure a robust system, execption trackers are (often) necessary but (more often) insufficient.

 

Tracing

Examples: Zipkin, Jaeger, Lightstep

Standalone tracing tools are useful for following a single request in-depth but do not offer guidance on which requests to trace, or which traces to look at when troubleshooting. Intelligent search and aggregation is Honeycomb's top priority—a necessary part of working well with traces, zooming out, and identifying the next set of traces to look for.

Honeycomb lets you quickly identify traces worth investigating (“breadth-first search”), then dig into all of the individual hops along a trace (“depth-first search”). Many other tracing tools have rudimentary breadth-first search and simply present you with a list of traces (or a single high-level, un-customizable aggregate view).