Conference Talk

Case Study: Sampling With Refinery at GOAT



Kevan Carstensen [Backend Engineer, GOAT]:

Hi there. I’m Kevan Carstensen. I’m a backend engineer here at GOAT Group. Today I’m here to tell you about Refinery, Honeycomb’s sampling solution. 

As a general outline of what I’m talking about. I’ll give some background on GOAT, telling you about our company, our systems, and our observability tooling. Then I’ll talk a little bit about sampling in general: why you would wish to sample, why you might not wish to sample, and why we chose to sample. Then I’ll talk specifically about Refinery: mainly about our process of deploying and managing that. And then I’ll finally wrap up with some concluding remarks: things that we’ve learned, and things that we’d like to improve about this going forward.

I’ll start with just a brief introduction to GOAT. GOAT Group represents the leading platforms for authentic sneakers, apparel, and accessories. We operate three distinct brands: GOAT, Flight Club, and alias. We have a global community of over 30 million members across 164 countries, and we are hiring across all technology and non-technology roles. If you’re at all interested in what we do or about what I’m talking about here, do feel free to check out the careers page.

Digging into some of our systems, I will focus on backend systems here, just because for now that’s primarily what we have hooked up to Honeycomb. We have a variety of different tech stacks, but primarily we use either Ruby on Rails or GoLang. A lot of our older code lives in Ruby on Rails, primarily in a single monolithic application. A lot of our newer backend services are written in GoLang and in more of a microservices pattern.

Taken together, that’s dozens of different services. Those services deal with a variety of different traffic patterns. Some of them are extremely high-volume consumer-facing services that are customer-facing applications to fulfill sales or to deliver content to mobile applications. Kind of on the other end of that, we have services and backend systems that are focused on our warehouses or on a handful of our internal stakeholders. They may do hundreds of requests in a day, or maybe even less than that. 

There’s also a variety of different types of requests. Some of these are REST APIs or GRPC services that are dealing with client traffic directly. Some of them are backend jobs that run in a queue or another system. There’s a good amount of variety there.


Similarly, on our team, there’s also a good amount of variety. We have engineers kind of running the gamut both in terms of experience in general and in terms of tenure specifically at GOAT. We have a number of engineers who have just recently graduated college, maybe have not had to think about observability or use very many tools around that. Then at the other end, we have very senior people, more than a decade of experience, with deep experience running systems at scale at similar companies. The tools we have to use to keep track of our system as they run need to be comprehensible and usable by all of these engineers, and they need to deliver good insights across these different tech stacks, these different service types, and these different traffic patterns.

We have a variety of tools to help us do that. We use Bugsnag which is an exception reporting and grouping tool. It embeds in your application. It groups unhandled errors and sends notifications about those errors to PagerDuty or whichever other tooling you have. We have a number of Statsd and CloudWatch metrics, tracking CPU utilization and memory utilization for a number of containers running for a particular service.  The error rates across services generally provide very good visibility into the kinds of metrics that you can observe from the application from the outside. 


Getting more into things that come from inside the application, we have the standard elastic search, Grafana, Elk, Kibana pipeline; as well as some metrics, logs, and alerts that are distinct from Statsd and Cloudwatch. These tend to come from within the application and can show us metrics that developers have chosen to add to the application. Finally, you have Honeycomb which gives us tracing and the ability to visualize a single request. We have a number of alerts and SLOs that are keyed off Honeycomb data. This is really useful for visualizing bottlenecks and is key for engineers who are new to the industry, new to the company, or both. It’s a great tool for those folks to learn more about the application. They can learn what goes on when a certain API is called in a way that’s maybe a bit more approachable or comprehensible than reading through the code.

That’s kind of a very high-level and breezy introduction to GOAT and our systems, and kind of the context in which we’re thinking about sampling. Now let’s talk about sampling itself. 

Should I sample? This comes up in the Pollinators Slack every so often, and it turns out to be a tricky question to answer. Certainly, there’s a lot of good reasons why you shouldn’t sample, and many best practice documents out there will tell you don’t sample by default. I think that’s a good default for many people because there are a number of downsides to sampling. 

In general, sampling behavior — and this isn’t specific to Refinery, this applies to anything that samples metrics from your application — is often surprising and unintuitive. If you’re investigating an issue you’ve been paged for at 3:00 in the morning, your first thought isn’t going to be, “Gee, are these metrics being sampled? Am I actually seeing the true frequency of how often this problem is happening?” 


Even if you’re familiar with sampling, that may not come to mind. It’s more cognitive load that developers have to keep in their heads when understanding a system as compared to an environment with no sampling when you can assume if something happens, it’s in your dataset, and you can look for it with a query.

Sampling infrastructure. If sampling is implemented outside of your application, it’s another thing for your team to maintain. This is maybe less of an issue for very large companies that have infrastructure teams who focus on this 100% of the time, but I think it’s always an issue. Having to maintain things just requires time from people, no matter how large the organization is, and that isn’t something to commit to without thinking about it. For a small company like GOAT, that’s especially important where a lot of things like sampling infrastructure are owned by people who have a number of other things to do. That doesn’t mean you shouldn’t have sampling infrastructure, and not that you shouldn’t spin up new services, but it is something to think about before doing it. 


Then finally, sampling configuration itself: the decision of what to sample, and how aggressively to sample it. Those decisions are, I’d argue, necessarily backward-looking. You’re looking at past incidents, past traffic patterns, and you’re making a decision about what is interesting, what is not interesting, and tuning sampling rules to fit that conclusion. You may be right a lot of the time, and that may be a good decision for future issues, but there’s always a possibility that you’ll see an issue that’s just really poorly served by the sampling decisions you’ve made. 

For example, there might be some events that you didn’t even think about when deciding what is interesting and what is not interesting that turned out to be really interesting. Maybe you discover this when you get paged at 3:00 in the morning and the data that you want to be there aren’t there. That’s a risk, and it’s frankly kind of hard to work around, so it’s a pretty substantial downside for sampling. 

On the other hand, sampling can make sense in some cases. If you have a very high event volume — and that might be due to a large number of traces or very dense traces, and what I mean by that is traces with a large number of spans — or both, then it uses a large number of events. Maybe you have events that are in general 95% of the time uninteresting or irrelevant. An example that comes to mind from our application is that we have a good deal of instrumentation on how Rails catches things. And for the one out of a hundred incidents where this is important, it’s really cool to have it. It helps us visualize whether our caching is working, or not working. But it’s also incredibly noisy. There are a lot of events that come from these, and, in most cases, you’re not going to be looking at them. In that example, like those events which are very frequently uninteresting, there would be an argument to sample those. Obviously, it’s a very complex topic, and we’re not going to fully cover it in one slide. But those are some pros and cons for sampling.


Given all of that, why does GOAT sample? And the sort of short answer is, we have a very high event volume. Our higher volume applications can generate 150 million or more traces per day that we send to Honeycomb. A lot of these applications are very well-instrumented: we have great visibility into database access, into cache access, and this is great. We can see so much of what’s going on in our application when we’re looking for issues, but it also means that each trace can have dozens, even hundreds of events. Sending each and every one of these to Honeycomb and having them retained in our dataset are cost-prohibitive, so we need to sample.

Given that we need to sample, how do we sample, and why should we use Refinery?

Refinery was a good fit for our sample needs for a few reasons. It does trace-aware sampling, which means a sampling decision will apply to every event within a trace and not just on individual events. If you are sampling with Honeycomb, this, in my view, is the way to do it. The traces that you see are effectively the complete traces that your application generated, and you don’t have to worry about missing spans or certain types of events that didn’t make it into the trace. 

I know the Beelines offer ways to sample events within a trace. We chose not to explore this just because it’s sort of an implication of my earlier point that sampling is unintuitive. If you are sampling at all, having to reason about not only whether a trace makes it to the dataset but whether certain events make it to the dataset, makes for more cognitive load. It’s more likely to cause misunderstandings during issues. Just in general, we prefer trace-aware sampling, and Refinery offers that.

It’s also very easy to integrate into our existing applications. They’re already using Beelines. Beelines generally make it easy to just drop in another output: rather than having it send events to Honeycomb, send events to Refinery. It’s able to handle our event volume requirements, and the sampling options are a good fit for our needs. We know this because it powered the legacy Refinery product behind the scenes at Honeycomb, and this is something we used for a good long while before it was deprecated, and we know it worked well for us. So that’s in a nutshell why Refinery was attractive to us.


Let’s just briefly talk about deploying. I think this will be different for everyone. Everyone has a different internal infrastructure. I’m just briefly going over what this was like for us at GOAT. Very briefly, we wanted to adopt Refinery to run on our internal platform as a service that is maintained by our DevOps team and then tune it to meet the requirements. The first two bullet points are adapting Refinery to run on that platform as a service. We needed to Dockerize it, and we ported our legacy Refinery rules to our hosted Refinery because we had that past experience of using legacy Refinery. We had a good basic set of sampling rules that we could build off.

The hard part, and there’s a lot of work encapsulated in that third bullet point, was load testing and changing our replication and provisioning parameters for Refinery within our internal platforms. So just for context, one of the goals we had for this project was to have Refinery running in our internal infrastructure and having our highest volume event-generating applications going through Refinery by holiday season 2020. We are kind of in the e-commerce space, and after Thanksgiving is our highest traffic period of the year. 


Not just for Refinery but as a company, one of the things we do to prepare for that is a load testing initiative. We need to identify paths through the application that we think high volume users will take. This could be a response to promotions, or a response to push notifications from marketing — anything that would drive traffic through our application. We have teams that build tools that allow us to load-test these paths across all of our different services: to identify bottlenecks, understand what our system is capable of delivering, and help us tune and optimize and generally put ourselves in a good place for our holiday season.

We took advantage of this to help tune Refinery. We had these applications under load tests and sent the event volume to Refinery. We were hoping to get Refinery provisioned in a way that it could deal with during the load tests. Since these load tests tend to be representative of what our applications actually see during peak load time, if Refinery can deal with the load that they generate, then it should be fine for production use. We worked pretty closely with the team that was working on those load tests just to monitor Refinery while they were going on to deploy configuration changes, provisioning changes, and things like that. We ended up in a good spot after that.

Finally, post that process, having everything in production and running through Refinery, we monitored sample rates, event volume, and other metrics both in Honeycomb and in our own infrastructure just to evolve our sampling strategy. 

An example of this might be looking at the sample rate in usage mode and making sure that it’s reasonable. Once we found we were sampling at 100% of the requests going to Honeycomb. That’s not good: that’s essentially us not sampling at all, and that has cost considerations. That’s an ongoing process. That’s something we do even now after it’s been running for six months. Just using usage mode, using other tools within Honeycomb to keep an eye on this, to identify gaps, and to tune as necessary to plug those.


This is pretty ECS-specific, but this is just an idea of where we ended up with that. We run our entire infrastructure for most of our applications based on these. We gave them 2048 CPU allocation, 2048 memory allocation, and they run on m5.large containers.  If you’re not familiar with ECS, this is, I believe, half of the CPU resources of that m5.large and a good deal less than the total memory of it. There’s probably room to tune this. We’ve been happy enough with these settings to just leave them alone. 

In terms of replication and scaling, we have it configured to do between 15 and 45 running Refinery tasks. It would never run fewer than 15, and would never run more than 45. The former is essentially the amount that would be needed to deal with the load that we would see during a reasonably high volume period, and the latter is the highest we’ve seen in testing and we’re confident it would run well there. In practice, it’s usually around 15. It rarely gets above that. We have seen it get up to mid-30s during exceptionally high volume periods, but that’s pretty rare.

Finally, lessons learned. 

What went well? This first point is the one I’m personally the happiest about since I maintain the service. It’s very low maintenance once it’s running. It’s very stable. It doesn’t have issues that require investigation. It just keeps working. When writing these slides, I think it may have been the first time in months I actually had to take a look at the Refinery graphs or the Refinery service metrics. That’s a fantastic plus point for a service like this, particularly for a small company like us where folks that are responsible for these things have a lot of other things to do. 


Between usage mode and the fact that Refinery metrics have their own dataset in Honeycomb by default, those are super helpful for tuning. I can see the sample rate very easily. I can see whether Refinery is exhausting some memory buffers or something that points to more provisioning that’s needed. It’s very helpful and easy to get visibility into the system as it runs. 

As for sampling: given that we have an event quota, it’s great having rules that allow us to very precisely indicate the rate of samples that we want. It lets us get the most of that event quota, and it lets us make sure that we’re retaining all the events from those lower volume endpoints which are meant to serve internal stakeholders I mentioned earlier. They’re not getting crowded out of the dataset by high volume consumer-facing traffic.

What about things that were challenging? There are more sample rule types in Refinery than are documented. This feels like a bad thing to complain about. Why would I complain about more features? But we did have to read the source code to find these, and I might not have thought to look for them if I hadn’t used legacy Refinery previously and known that some of these were exposed in the UI. 

As mentioned earlier, sampling as a concept — and a separate service — can be surprising to engineers, and that’s kind of hard to work around. I think the best we can do is to document that’s a thing and help people understand what the sample rates are and where they would go to change them.


Finally, adoption and advocacy for sampling is just an ongoing process within GOAT. And we currently have most of our higher volume event generating services going through Refinery. Some of the other ones go directly to Honeycomb still and it’d be great to have those also go through Refinery.

Then briefly, since I’m almost out of time, some things we’re hoping to continue to work on. I’d love to have CI-style checks for sample rule configuration. Like, guarding against people accidentally turning off sampling and sending all of our event volume to Honeycomb. Right now the way we would detect this is by looking in usage mode after deploying the sampling change, and that’s not ideal. As mentioned previously, I’d love to use Refinery across all of our services rather than just the high-volume ones. This is more of a feature request and maybe not something we’ll see immediately, but just if there were real-time sample rate monitoring or alerting in Honeycomb, that would be awesome because we could proactively alert on that if it got out of spec. 

And that’s it. Thank you for listening. I’m happy to answer any questions you have.


Ben Hartshorne [Engineering Manager at Honeycomb]:

Kevan, what a great tale.

Kevan Carstensen:

Thank you.

Ben Hartshorne:

Thank you for that wonderful compliment. It just runs. After it’s set up, it’s stable. That was an unexpected pleasure to hear.

Kevan Carstensen:

Well, we appreciate you giving us stable software.

Ben Hartshorne:

So there’s a question coming from Liz in Pollinators, who is — as she calls it — “the #discuss-arm64 channel in Pollinators” broken record. She asks: did you consider Graviton2 for a 40% cost savings?

Kevan Carstensen:

We did not. But that’s a really compelling sales pitch, and it’s definitely something I’m going to look into after this.

Ben Hartshorne:

Oh, great. Yeah. There’s more detail in the channel there and in blog posts. We’ve had a lot of fun migrating to Graviton2.

You know, you talked about setting up Refinery and sampling. What was your path towards winding up at a stable collection of using the rules-based sampling, and what were the tradeoffs you walked as you wandered between dynamic and rule-based sampling?


Kevan Carstensen:

We started as Refinery beta users. For anyone not familiar with it, this was essentially the current Refinery but hosted by Honeycomb and with a nice UI on Honeycomb that we could use. Just from that, we had a pretty good set of rules. We knew they were a good fit for our data. We knew they worked for us. 

That said, we ended up adopting Refinery before the rules-based sampler was in the hosted code, and so starting with the dynamic sampling was kind of forced out. That was the sampling available to us. We ended up with a rule set that it did pretty well. 

But ultimately, Refinery eventually gained that rule-based sampler back again. Rather than spend more time trying to craft a sampler key configuration and other configuration that would provide us what the sampling we needed while fitting our business needs, we chose to port those existing rules that we knew worked to our configuration and kind of run with those. I feel like we probably didn’t get as much bang for the buck out of dynamic sampling as we could have, but it was kind of a decision based on time based on the people that were available to tune that. It’s better to just go with what we know works.

If you see any typos in this text or have any questions, reach out to