CircleCI: The Unreasonable Effectiveness of a Single Wide Event
Using only a single event for each valuable business transaction, and packing it with fields, we can reduce event volume at scale so you don’t have to sample them. In this talk, Glen covers how their team made use of this technique at CircleCI to great effect, with real examples of questions this approach has allowed them to answer.
Glen Mailer [Senior Staff Software Engineer|CircleCI]:
Hi everyone. My name is Glen Mailer, and you can find me on Twitter and on GitHub. And I’m going to be talking to you today about some work we did at CircleCI. Now to kind of set the scene and give you some background, I am a huge fan of Honeycomb. I’ve really drunk the Kool-Aid. And I’ve found it to really fill a gap I’m experiencing with my existing tools. And I can imagine if you’re watching this talk, I suspect you’re also hooked. And what you’re likely to be hooked on is seeing trace graphs like this. This sort of insight into what’s going on inside applications seems to be driving a lot of Honeycomb adoption but also a lot of application improvements. As you look at the graphs and go, my application is doing what? And then go ahead and fix it.
Another application architecture (getting increasingly Byzantine) — and people are crying out for better insight into what’s going on across their systems. As we carry out these huge microservice architectures with hundreds or maybe even thousands of services, we often create a span for each client call, each server handler, every database call to every one of our many databases, and this starts to add up. This insight comes at a cost. So Honeycomb quite rightfully charges us per event processed at the back-end. And you might end up with a usage page that looks like this one. It’s a vicious or maybe virtuous cycle. So the more value you get out of it, the more services and events you want to wire up. The more events you have, the more value you get out of it, and it keeps on snowballing. But Honeycomb does have an answer for us. The standard response is Dynamic Sampling.
To explain, I’ve got a diagram. So imagine each of the gray boxes is an event. Without sampling, we’ll keep every event that’s been produced. The idea with Dynamic Sampling is to identify events that contain some interesting properties — maybe keep every event that contains an error or lasts more than a certain threshold. And we throw the rest away. And Honeycomb has a neat feature here where it keeps track of how many were thrown away so it can work out averages, percentiles, things like that. I never understood how it does it. But there’s a paper, and I trust it, I think.
We end up with gaps in our distribution; we’ve sampled out some events but kept others, so we get partial traces. We’ve lost some of the valuable contexts of what events caused other events and how these overlay together.
But luckily, once again, Honeycomb has an answer for us. The tool here is called Refinery. Now the idea with Refinery is that instead of having all of our services talk to Honeycomb directly, we introduce Refinery in between as a buffering proxy. We deploy it on our own infrastructure and it will collect those events it’s receiving, it will apply the sampling rules to a whole trace at once, and then it will (if the rules are passed) send a bundle of events to Honeycomb. That means we don’t end up with the gaps in our traces anymore.
To see what that looks like, we take our demonstration model — and I’ve drawn boxes to represent how these are related in traces. Now, something’s interesting happened here; in the original diagram action, I started with gray boxes and colored some in at random. And when I applied the boxes around, if you look closely, you will see every single one of the boxes has an interesting event in it. This means we’re going to keep all of the gray boxes that we were going to throw away last time.
Now we’re back to keeping all of our data which is not what we wanted when we introduced the sampling. And the only real way to deal with this is to change our definition of interesting. We change the criteria, or we decrease the sample rate so that more events are filtered out, and now we have some boxes that are fully gray, and once again we can throw that away and keep some. So we’ve now gained the ability to keep traces together. We’re not getting gaps, but we had to throw more data away or more interesting data away to get there.
To think about what this looks like as a model, if we apply this and think about it carefully, let’s say we keep 1% of all traces but then we keep 50% of slow traces and 100% of errored traces. So this sort of thing is probably a decent balance between cost and benefit in terms of being able to keep what’s probably interesting and throw away what isn’t.
But what if we want to query something that we’re sampling on, for example, how many traces were initiated from bot users? The answers we get from this, after Honeycomb has multiplied the sample rates back to us, will be an estimate. And how accurate that will be will depend on whether the property, in this case, a bot user, whether that’s evenly distributed across the axes of sampling (slow and errors). So if for instance if it’s very rare that bot users produce an error, we will get an answer that doesn’t quite make sense — maybe? I think. I always find this a mind-bender, I can’t quite get it to fit in my head, okay, when I’m asking a question like this and I am using sampling, how much can I trust my answer?
We’ve taken the trace waterfalls that everyone liked and drove all that adoption, but then because of scale and cost issues — which are good problems to have — we ended up operating a new software component, we’re delaying events by holding them in a buffer, we’re throwing away a bunch of data that we hope we didn’t need. And in some cases, we’re lacking intuition on how trustworthy our query results are. And this is starting to sound like the setup we adopted Honeycomb to get away from in the first place.
This talk isn’t actually about any of the problems I’ve just described. That was really just me trying to set the scene. Hopefully, some of those things ring true with your own experiences of adopting Honeycomb. But anyway.
You’re now hopefully on the same page as me for the problems I was thinking about. One solution to dealing with the problems is instead of trying to layer more software on more complexity, take a real step back. And you might be surprised to hear that before Summer 2018, Honeycomb didn’t really support rendering traces at all, but it was still useful.
If we don’t have traces, what do we have instead? Events. The single-wide event.
We got our application and we have it send a bundle of data related to an important business transaction, and I use the term “transaction” loosely in this sense — a thing. And we send that to Honeycomb in one go. And we keep 100% of the events. If we have other services involved in servicing this transaction, we get them to pass their information on to the application that’s doing the sending. We will store that information in memory or maybe if it’s a longer running action, put it in the database and pull it back later. And over time, we keep adding more and more stuff. It has almost a gravitational pull.
And you might notice, it’s quite similar to the Refinery model earlier. The arrows along the top tend to be bespoke, we’re not just using standardized tracing protocols. And rather than sending a bundle of events over to Honeycomb, we’re sending the one big one. So this probably sounds vague and not particularly useful. So let me try to make it more concrete.
At CircleCI, we start sending a single wide event for each task we execute on the system — the Task End event. We also have distributed tracing in this system, but this is separate from that. An addition. The tracing is subject to dynamic sampling and all the problems I described earlier.
This one, we keep 100% of them and we don’t set a trace ID, so it doesn’t participate in the trace. But there’s still enough context to match them up to traces if we did keep a trace for some reason. Now executing tasks is the core part of what our product is. And what we charge people for. So any cost we incur from keeping 100% of the events is directly proportional to the revenue we’re charging to customers. And it means we don’t have to worry about “can we afford to keep all of these?” We know we can, because there’s just one, and this is part of the core revenue stream.
We keep a lot of facets about the build, how long did it take to run, how long did it queue, did it pass? And a lot of environmental factors like where did it run, what versions of things, and so on. Really anything we can come up with that we think is useful. And we kept adding to this over time. The more we use it, the more we think, “yeah, this will be handy”. And this allows us to answer a lot of questions with a single event.
For example, we deployed a canary on a particular AMI. And we said, okay, let’s have a look at any customer builds that failed on that particular version. Would they normally have failed? Did they fail because we’ve done something wrong or fail because it’s a standard test failure? And we’re able to pull that out quickly.
We had a report from another customer saying they were experiencing some surprising slowness. So we were able to use their build URL, filter by events, and see, yeah, they had an infrastructure failure, but it wasn’t a widespread issue and wasn’t affecting loads of customers, just this one person.
A few months back, there was a Docker Hub outage. In a partial state of failure for a good sort of hour or so. And we can see quite clearly on the graph that it’s really caused our overall build success rate to take a hit. And it did come back steadily afterward. And that was really easy to see on this graph very quickly.
Recently we made some changes to the checkout step. So when you do a Git checkout, and when we canaried that, we were able to do a side-by-side comparison over the course of a few days. And we can see yes, we did improve tail latency by a significant amount. This is all using the same event.
And we can answer some more business- and product-focused questions. So we launched a new ARM-based executor using the ARM CPUs and could track which customers are using that executor. Like what sorts of things are they running on. See their builds.
And another thing we did recently, how much data a customer’s storing using our built-in S3-based step. So this is when you store caches, workspace, artifacts. And like how do customers compare to each other? Are a few customers doing all of the data storage or is it even? And we were able to identify some outliers and reach out to the customers and help them optimize. And this is once again, bundling more and more fields into the same event and getting used to slicing and dicing against it.
There’s been an uptick in crypto mining across a number of CI platforms. And we can use this event to say, are there suspicious-looking builds that run for a long time or use loads of CPU that we can dig into and say are they legit or not?
Let me do a worked example of something a bit more involved.
A couple of months ago, we have this concept in CircleCI called an infrastructure failure. And this is a build failure, and it was CircleCI’s fault, not your fault. And we saw a spike in infrastructure failures. We saw a spike of errors in the category of “prepare executor error”. So in the process of preparing an executor, something went wrong. And we got this spike. Now we can pull up a Heatmap. So what I have here say Heatmap showing a zero, there was no infrastructure failure, and one for yes, there was an infrastructure failure, and then using the BubbleUp tab, draw a box around it. And Honeycomb will highlight anything interesting about that box.
So get a look at the facets here. Basically, the yellow bars are things that were inside my box and the blue bars are things outside my box. And the first couple rows here are basically various different ways of saying, yes, that did fail. That failed for infrastructure failure. However, the first interesting one is really that bottom right one there which is the executor. And it’s — this event was able to show us that 95% of all the failures were happening on the Mac OS. And after a bit more investigation, we were able to identify that it was a networking issue with our Mac OS data center provider. We got in touch with them. While the Mac team was working with them, we were able to use the same event to get the list of affected customers. Modify the query to say “give me every build URL where it was executed on Mac OS and there was an infrastructure failure.” And I have a list of affected customers that can hand off to the support team and they can do proactive outreach or be prepared when tickets come in from these customers.
Now I use the term “single” because they’re not part of a big trace, not connected to a lot of other events. But that doesn’t necessarily mean we’re only allowed one event per transaction. We can have as many events as we deem worthwhile. And if we tie them to important actions in our system, it’s easy to predict how many we will get and therefore how much it will cost. When you’re dealing with traces, it’s quite difficult to think, okay, I’ve got one request, how many sub spans is that going to create as it ripples its the way through our microservices system? It’s hard to predict ahead of time. The single event, okay, I might get 10,000 registrations. I get a hundred registrations a day, therefore, I’ll get a hundred of these events and it’s really good for when you’re trying to model how much you’re going to pay Honeycomb and how much you’re going to pay next month.
So that task end event has an associated task start event. Which is like how long has a build queued for. So we have two events happening from the same transaction at different parts of the life cycle but still, it’s 2N. You know, it’s not going to grow exponentially as our system scales.
Another event we added to our system is the Docker pull event. So each time out, our core system does a Docker pull on behalf of a customer, we store an event. Now a single task can have zero or many Docker pulls so that’s why we couldn’t just stuff them into the task-end event. If you use the machine executor, for example, then you won’t have any Docker pulls that we do for you. Whereas if you make use of the secondary containers to run databases, you can run two, three, four, five additional containers next to your tests. And the other reason not to put into the existing task-end events is that task-end only happens when a task finishes, so if a task is going to run for five, 10, 15 minutes, we want to release some of that information earlier. So if we’re seeing errors around Docker pulls, this event will show those up sooner than if we’re buffering them. And that’s a situation you can run into with Refinery. It’s buffering stuff up until the trace finishes, but if the traces are taking a while to complete, you’re not seeing the events from the start of that buffer.
Now once again, similar to the task-end event, this isn’t connected to the trace graph, but I have enough fields on here that we can make that link directly. One thing we’ve been meaning to look at for a while is using, I think they’re called event links. There’s some sort of feature in OpenTelemetry that allows you to link one event to another without making it part of a trace. Not something we’ve explored yet. But as well as which build is this from, we store information about the image, the registry, how long did it take to pull — which is generally proportional to network issues — and how long did it take to extract. So Docker images get downloaded which have to be unpacked and that is dependent on CPU and disk speeds so that can highlight a bunch of different issues. And then as before, we store a bunch of information around the environments like the resource class, the instance, the AZ and the versions of the software involved so we can hopefully give BubbleUp plenty of juicy stuff that it can facet against.
And this allows us to answer a bunch of different questions. We initially did this work to help support work around Docker start-up performance. We’ve seen a steady increase in complaints from customers about builds taking a long time to start up. This allows us to really nail down where that was coming from and why. And then as we made changes and launched canary versions, we could see the impact on this event. And then shortly after we added it, Docker announced that they were going to start rate-limiting pull on unauthenticated builds. And that affected a lot of customers. We were able to put together a partnership with Docker to handle the majority of cases, but we still had a few customers directly affected. This event highlighted those directly and we were able to reach out.
And then we add — well, we looked at that field for download speed and were trying to say how do we speed up downloads in our system. But we took a step back and said, “why are downloads slow?” And the insight we had here was to — we made a derived field for the registry which included the AWS region. We discovered that a number of customers in the Asia Pacific regions were affecting our SLI because they were pulling, and the AWS network speeds really drop off significantly. So if you’re in the same data center, 100 megabits per second. On the other side of the U.S., you drop down to 20. And out to Asia, you’re down to 5 megabits per second which is a huge drop-off. So this is affecting our SLI. And we could just exclude these events and get to numbers better but that’s not really improving the customer experience. So maybe it’s a docs problem — we should be more forceful to warn users “Hey, you’re pulling from a region that’s inefficient. Maybe you should do something else about this.”
The other interesting thing about this data which I still can’t get my head around, if you do a Docker pull from AWS, where the source image is in Google’s container registry, that is somehow faster than pulling from the AWS region you’re in. I still can’t explain that.
Anyway, and in conclusion, distributed tracing is really great. I love distributed tracing. However, a single-wide event can answer a lot of questions. A lot of questions. And when you tie them to business transactions, they can be extremely cost-effective. Knowing okay, here’s a revenue-generating event and storing one thing at the time and paying the small per event charge, keeping 100% of those events should be easy to reason about. Whereas when you’re trying to deal with dynamic call graphs of distributed tracing, that’s a lot harder to get your head around.
But the key thing here really is everything is a trade-off. In an ideal world, we simply keep every single event in every single trace. And if you’re at the scale where you can afford to keep every single scale in every single trace, I really recommend you do that. But that’s unlikely to be viable for any system with significant traffic. And the strategy today is mostly driven by Honeycomb currently works and currently charges. So maybe in the future, as Honeycomb adds more features or different approaches become viable, then we’ll be able to make a different trade-off, but this is what I’ve got today.
Well, thank you very much. I am Glen Mailer. And if you have any questions, now’s the time.
Ben Darfler [Engineering Manager|Honeycomb]:
Let’s give it up for Glen. Glen, that was great. I really love that approach and how it plays to the strengths of Honeycomb and our platform and how it works and manages the costs but still gives you the power. BubbleUp really shines in that use case. I’m curious though, like, you said there’s tracing in other places or other areas, like what do you see is the trade-off between the two? When would you pick up tracing versus this approach?
Fundamentally, it’s really just cost. And this is what leads people to tracing in the first place. Metrics are costing too much, or logs are costing way too much. You’re at web-scale and you’ve got terabytes of logs a week and you’re finding a way to handle this. And I think what I came back to at the end, there is no universally correct answer apart from like if you have infinite storage somewhere. So yeah, you have to pick the trade-off depending on what you’re trying to do at any one time. So if you can, I think the key thing, when it’s related to a business transaction. When it’s something really important. You probably don’t want a single wide event 100% sampled for every API call. But you probably do want one for every log-in or every registration certainly. So yeah, I guess to take your Honeycomb product example, you probably don’t have many because it’s a very sort of, like lots of small transactions, but in CircleCI, we run a lot of builds. Like millions a week. But that’s not actually that much when you turn it into a per-second number.
Yeah, yeah, I love that idea of tracking it back to business. We’re here to have the business value, have it connected back to actually, yeah, being here for what we’re trying to do, serving customers and making business. Makes a ton of sense.
Remind people to drop Q&A in the Slack channel if they’ve got it. I’m hoping for more Q&A. Yeah. I don’t know if I’ve got others up ahead.
For us, it was that — I had this picture in my mind when I was putting the slides together of this graph, like, a bit like the diagram, you have all these calls together. And you’re like, it’s a bit like the old Joe Armstrong quote, you want a banana but that connects to the gorilla and it connects to everything else. You’ve got to keep all of it because that’s the way the system’s wired. I think there’s a middle ground between these two things which I want to investigate more.
There’s a guy on the Honeycomb Slack called Pedro and he talked about this once. This idea of like, at the moment dynamic sample, you keep a whole trace or throw away a whole trace. And I like this idea of, what if I have like half a trace on purpose? Like what if I keep the top-level span from all of the services that were involved but none of the sub-spans but then on like 10% of traces I get the full fat, the big ones. And just what can we sample against other than just keep the whole trace or not the whole trace. And I think there’s so much to explore in that space.
Yeah, yeah, that is fascinating. That you’ve got this, right, I don’t know how to separate it, but at the edge of each service, you’ve got something higher valuing, maybe you turn that on when you have a performance issue, and you need to see the smaller-grained steps — but you can filter that out at different times too.
And each step has a different level of cost. I’m storing it in memory for the duration of the request. That’s cheap. Well, it costs memory. And now I’m sending it to my collector and maybe the collector on my node and my Refinery and cluster deployment, and it’s going to bundle together and then to Honeycomb, and maybe like Honeycomb can throw some away later (your old Refinery product was in that space). There’s — each step has an additional cost but also gives you more things to look at once. Right. Again I’m going to trace every single database call but I only care when I’ve got a slow request — maybe.
Right. Totally. To skip to a different idea, you said in one of the threads in Slack already, if you enabled JSON in packing and accidentally send something huge, you could get fun results. I would love to know about the fun results.
This wasn’t actually me, somebody came in one time and was like, I’m getting like a rejected payload from your API. And then after some digging, it turned out they had a deeply nested JSON object with maybe the unpacking unpacks arrays as well so, like a hundred web requests or something so, again, blew the 10k field limits, which doesn’t happen very often.
Yeah, that’s great. Got a question here, you drank the Kool-Aid but when did you take the first sip? How did you discover Honeycomb? What problems were you trying to solve?
Really good question. So I think the thing that got me hooked, at Circle where we have fairly infrequent but like — we do a lot of builds it’s not like hundreds a second. And they all come through the same pipe and all look completely different. So this is a perfect high-cardinality example. But I think I got interested in Honeycomb for like a couple of years before that, following the company since it was quite early. And I think part is what’s led me to stick to this wide event thinking because like I used Honeycomb or at least I watched some Honeycomb videos and tried demos before you had tracing support. When it was just all you have is events and you can look at one event at a time and that better be good enough. And it was still really powerful with only that. And I think in the excitement to jump on the tracing bandwagon, I think we’ve lost the fact that it actually worked well before tracing.
Yeah. Yeah, and then to follow on, how did you get to this, like this particular use case here? Like again, going back, you said there is tracing. So what drove you here?
Yeah, it’s sort of, like we had a structured log that did something similar. And it just wasn’t as easy to query. Using like a self-hosted gray log and it was very expensive and couldn’t index. And treated everything like a string because there was some Java interface in between that only had a map of string to string. So yeah, Honeycomb was a chance to give us a fresh start and design our logs from scratch in a way that indexes the things we care about. And we basically just wrote custom code. Okay, we’re not going to use — we use the Beeline at the end, but there’s no auto instrumentation. Just I’m going to think about my application. And I’m going to make a map of string to stuff. And I’m going to send it to Honeycomb and get it back out later.
So the question on that, this person said, might have missed it, how did you instrument your code to get the data out, like the events out the make that major one single event?
Yeah, so generally, we sort of have an implementation where there’s a per-request, like a singleton almost — it’s a map. And then throughout the code, there’s like the abstraction allows us to bundle stuff into that and flush it at the end. And in a few places where there are cross-service boundaries, we have custom implementations in each spot to say here’s some data, do what you want with it, and the service bundles it all together and we’re able to grab tags from AWS and grab results of the build and the inputs and all these things. So it’s not like, it’s like the opposite of OpenTelemetry but it’s really powerful.
Thank you, Glen. We’re going to wrap up here.
If you see any typos in this text or have any questions, reach out to email@example.com.
Ep. #29, Testing in Production with Glen Mailer of CircleCI
In episode 29 of o11ycast, Charity and Shelby are joined by Glen Mailer of CircleCI. They discuss testing in production and rethinking socio-technical systems from the ground up.