Guides Observability Engineering Best Practices

How to Use a Single Wide Event to Its Fullest Extent

15 minute read

When running at scale, the typical small trickle of observability data can quickly become a flood. In those cases, sampling your trace data is recommended. But according to Glen Mailer, Senior Staff Software Engineer at CircleCI, not everything can or should be solved with tracing and sampling. As he explained in his talk at hnycon, the solution to production data volume issues can also be found using a single wide event as an alternative to how most tracing is currently done.

Initially, the benefits of using Glen’s “single wide event” approach were presented as mostly budgetary—it’s easy to wrap your head around how much an event costs (and therefore keep costs down) rather than trying to deal with dynamic call graphs of distributed tracing. But as his talk went on, a larger point became clear: using flexible tools like Honeycomb that reward this kind of out-of-the-box thinking allows you to experiment and find your own solutions.

How a single wide event works

When you use a single wide event, you gather all the data related to a specific action into one event, which you then send to Honeycomb. Glen called them “single wide  events” because they’re not connected to a big, multi-part trace, which means they’re not connected to a lot of other events. It’s just one event with a lot of fields.
CircleCI bundles all data related to a business transaction into a single event, which they then send from their application to Honeycomb in one go. Glen said they keep 100% of these events.

If there are other services involved in servicing the transaction, CircleCI will send relevant transaction data from each service to the application, which then bundles it all up into one event to send to Honeycomb. Glen said they’ll either store that information in memory or in a database to pull later if it’s a longer transaction.

There are two major unique aspects to CicleCI’s single wide event process when compared to sampling with tools like Refinery:

  1. Data from other services (arrows from A, B, C, etc) tend to be bespoke and are typically not standard tracing protocols. 
  2. Rather than sending a bundle of events over to Honeycomb, you’re sending one big one.

Why a single wide event is useful

Compared to dynamic tracing and sampling, budget for single wide events is easier
to forecast and they can still answer a lot of production questions.

In an ideal world, you would store every single event for every single trace. For a system with significant traffic, this can be too expensive, so businesses typically turn to dynamic sampling.

But sampling and tracing can fail to reduce a budget if you’re not careful. As your system scales, you’ll need to add tracing and refine sampling capabilities alongside that growth. With wide events, as Glen put it, “[Two wide events are] not going to grow logarithmically as our system scales.”

In other words, it’s just those two events. What grows as you scale are the fields you add to those events. Glen said that with this process, instead of managing tracing and sampling, you’re just adding more fields to an event and getting used to slicing and dicing it.

CircleCI’s task-end event

CircleCI sends a single wide event for each task they execute on the system called the
“task-end event.” This event includes fields like:

  • Build ID
  • Build URL
  • Customer
  • Resource Class
  • Executor
  • Queue Time
  • Run Time
  • Pass/Fail
  • Infrastructure Failure

CircleCI keeps 100% of task-end events, which are kept separate from dynamic sampling and don’t participate in the trace because CircleCI doesn’t set a trace ID. If they need to match later task-end events to the trace, they can do so with the Build ID and Build URL.

That approach aligns particularly well with their business needs. CircleCI’s core product is executing tasks—it’s what they charge people for. This direct relationship makes it easy to justify the cost of keeping 100% of task-end events. “Any cost we incur from keeping 100% of the events is directly proportional to the revenue we’re charging to customers,” Glen explained. “This means we don’t have to worry about ‘Can we afford to keep all of these?’ We know we can, because there’s just one, and this is part of the core revenue stream.”

Using task-end to fix infrastructure failures

At CircleCI, an infrastructure failure is when a build goes down and it’s not the user’s fault. Glen described a situation where there was a spike in infrastructure failures—specifically, a spike of errors in the process of preparing an executor.

By looking at task-end, CircleCI was able to see into each transaction and identify that 95% of these errors originated from MacOS. Once the problem was identified, it was a simple matter of speaking with their MacOS data provider. An added benefit of task-end is that CircleCI was able to isolate each affected build and then manage the customer relationship to keep them happy.

Adding more single wide events

CircleCI adds single wide events to the most important actions in their system. There are two others Glen mentioned that fill in gaps task-end can’t address.

  1. Task-start event: Each task-end event has an associated task-start event with similar fields. Together, they tell how long a build has been queued for.
  2. Docker-pull event: This event is added because task-end only happens as a task finishes, which isn’t enough information to address issues around speed. If a transaction takes a while, it’ll be hard to understand what’s happened between task-start and task-end. So, CircleCI uses docker-pull to check in while a transaction is running and keep tabs on transaction speed.

A docker-pull event includes some of the fields task-end does, like Build ID and Build URL, and also includes unique fields like:

  • Container Image
  • Container Registry
  • Pull Size, Duration, Speed
  • Extract Size, Duration, Speed
  • Create Time
  • Errored?
  • Resource Class
  • Docker Version
  • Etc.

Using docker-pull to fix speed issues

Using the docker-pull event, CircleCI was able to deal with an issue where clients in Asia were seeing very slow speeds. With docker-pull, they found out that many of these clients were using North American data centers, which was not a problem with CircleCI’s infrastructure. They used this information to update CircleCI’s documentation to warn users when they’re pulling from a region that’s inefficient.

Finding solutions doesn’t have to be expensive

These wide events are just one tool in CircleCI’s toolkit. Glen explained that they still use traces and this wide event solution is based on current Honeycomb pricing and processes. His ultimate point is that Honeycomb affords businesses a lot of flexibility in how they approach observability—a little out-of-the-box thinking is not only encouraged, but rewarded with a manageable budget and a unique tool for solving production issues.

Download the PDF to read more Download PDF