Metrics vs Events: A Conversation About Controlling Volume

Metrics vs Events: A Conversation About Controlling Volume

 

If I’m used to metrics, how should I think about events in Honeycomb?
This question cuts to the heart of how Honeycomb is different from other vendors in the APM and metrics space who claim to provide tools that help teams achieve observability, and we hear variations on it fairly often.

bee with a question mark

Our Honeycomb Community Slack, Pollinators, is frequently hopping with activity; people ask and answer questions, share their successes and get help with their challenges. A few days ago, one of our newer users, Steven E. Harris (aka seh) asked a series of thoughtful and incisive questions, which Ben Hartshorne (aka ben) answered as part of a longer thread. I came upon this discussion a few hours later and thought it was so great that I wanted to share it with anyone looking for answers:

seh:
How does Honeycomb’s event model compare with the popular metrics instrumentation model, where programs have a set of (usually atomic) counters in memory that they increment very frequently, and only “publish”—whether by being scraped for or by pushing a snapshot of those counters—occasionally?

The Honeycomb model seems at first blush to be too expensive to use for such metrics; a naive take on the client library is that it incurs a growing backlog of events to send, with storage cost proportional to the number of events accumulated since the last publication. Contrast that with the set-of-numbers model, where the storage cost remains fixed regardless of how frequently the program publishes or gets scraped.

ben:
I think your question is actually a little bit different from how we normally answer the question of “events! metrics! logs!” because the standard answer comes through as “metrics can’t handle cardinality!“, and what you’re talking about is volume and poll interval instead.

seh:
Well, I’m assuming metrics of low cardinality, such that the number of counters (for example) declared in a program is fairly tightly bound to the number of metrics that will arise.

ben:
In any high volume service, you’re going to have to compact the data in some way in order to effectively manage sending it to an instrumentation service of any sort. There are two common compression methods used – aggregation and sampling. With an aggregated source, you can poll at a regular interval and the volume of instrumentation data does not change as a function of the volume of data being measured – instead it varies with other features (tags on the metrics, number of hosts, etc.)

When using sampling to control volume, you don’t aggregate many requests into single reported numbers; you keep all the distinct attributes of the thing being measured and rely on the statistics of volume to ensure that you get a good aggregate measurement. By sending a small portion of the events to your visualization service, you can still control volume.

The algorithm you use to choose your samples is up to you – one could easily implement a sampling algorithm that says “send 10 events every second, regardless of incoming volume” and thereby get the same feature you mentioned before – constant outbound instrumentation traffic independent of incoming volume.

seh:
Yes, I suppose I’m thinking of the “normal” model as what you’re calling aggregation. What I’m wondering here is whether it’s possible to use Honeycomb for that kind of data. Will the event queue overflow in a very busy server? Does the Honeycomb client compress these events somehow so that their space in memory is less than N times the size of an individual event?

ben:
Aggregation is definitely the normal model when you’re looking at a metrics service. (statsd is my most frequent go-to for an example.) I don’t think it’s the normal model when looking at event, trace, or logging systems. There is definitely a different data model involved.

seh:
What if you’re trying to measure, say, a request rate? Keeping an increasing counter allows using subtraction and division to compute an observed rate. If you’re sampling, though, you can’t compute that rate.

What I’m trying to understand is whether Honeycomb’s model sits next to but apart from this “normal” kind of metrics, such that we might want both, or whether Honeycomb’s model subsumes this “normal” model.

ben:
Ah, that’s a much fuzzier question. Each model has its strength. In my experience (with my background being more ops than dev) I have found myself still wanting both, but using the metrics system way less than I ever did before. Most questions are better answered, faster, and with more clarity, from an event-based system than a metrics system. There are still classes of instrumentation where I continue to use metrics — they are best categorized as high throughput and low differentiation in workload. I wouldn’t suggest using an event-based system for watching a switch or a router.

> If you’re sampling, though, you can’t compute that rate.
Calculations post-sampling are possible by using the sample rate as a part of the calculation. For example, to calculate the number of requests that came through in a minute, assuming one event per request, you sum the sample rates for each event recorded during that minute.

If it helps, we actually actively use both metrics and events to understand how our own production systems are running. The things that best lend themselves to our metrics system: Kafka, system metrics, and some things that AWS publishes about its services. Everything else (actually looking at the behavior of each service) is almost entirely event-based.

There are a few gauge-like measurements that we take and throw on to each event as it goes out the door – amount of memory used by the current process and the number of active goroutines are two examples. Those aren’t strictly related to each event as its being processed, but they’re super useful to be able to display along side the other queries we run.

> Will the event queue overflow in a very busy server?
That’s a super interesting question – and yes, it can. Sometimes. The normal approach to managing that is that the sampling is done in-process, so only a small fraction of events that are handled by the service get their instrumentation-events shoved into the outgoing queue for transmission to Honeycomb. With sampling tuned right, you can avoid that problem.

I will admit that this is one of my favorite topics, and I’d love to continue the conversation again. Please share whichever things you read that you find are the most illuminating!

If I had to do the one-sentence version of an answer to what I think is the main thrust of this conversation:
Yes, you can use Honeycomb for metrics. You will wind up replacing 95% of your metrics use with Honeycomb use, and keep around the last 5% for the things it best suits. (But that answer loses all subtlety, and I’m bad at making blanket statements, hence the wall of text that is this thread. 😉


At Honeycomb, we welcome the difficult questions. Sign up for a free trial and get an invite to our Community Slack so you can join the discussion!