Building Observability in Your CircleCI Deploy
In this talk, you’ll learn how Honeycomb keeps its CircleCI workflow duration at about 10 minutes per build through parallelizing build steps, using native container builders per architecture, and tracing execution of the build to know where to optimize.
Liz Fong-Jones [Principle Developer Advocate|Honeycomb]:
Hi. I’m Liz Fong Jones, and I’m a Principal Developer Advocate at Honeycomb. And I’m excited to be joined today by Ryan.
Ryan Pedersen [Senior Solutions Engineer|CircleCI]:
Hi. I am Ryan Pedersen. I am a Senior Solutions Engineer at CircleCI and very excited to be here and talk about CircleCI and Honeycomb.
So the reason that we invited Ryan and CircleCI here is because we believe very firmly in the idea that you should have a 10 to 15-minute build pipeline or bust. That is really something that delivers compounding benefits with observability. So let’s make sure that we talk about the basics first. Ryan, what is a CI pipeline? How does the CI pipeline help you as a developer?
Yeah. In the olden days, what really would happen is someone would wait they might have a 30-minute pipeline to get validation on their code, run the tests they need to run, build the Docker images, do whatever they have to do to get something in a deployable state. And, really, we want it to always be in a deployable state so every time there’s a change, some sort of trigger, we want to go through all those validation steps based on some of the other parameters like the branch and make sure that developer is getting feedback as quick and possible. They don’t have to walk out, get coffee, go take a walk to get that feedback and validation. They want to get it immediately and know if something broke.
Yeah. And I’ve even seen environments where people were deploying binaries that they built from hand on their own machines, right, which scares me. It scares me a lot. So, we can do so much better than that and I think this is why we’re really excited to be a CircleCI customer. So would you like to share, maybe, about how this actually works in practice?
Yes, I would love to talk a little bit about how CircleCI works in practice and how we stop people from having to say, “that worked on my machine.” So we’ll drop into CircleCI. So at CircleCI, our mission is to empower technology-driven companies to do their best work. We do this by making engineering teams more productive through intelligent automation. When we say “idea to delivery,” we really mean anything from that ideation stage until the time it hits the hands of the customers. The bug fixes, implementations, big features, new products, we help developers and teams focus on what they’re hired to do, and we handle that automation piece.
CircleCI has been around for over a decade. Still growing really fast. We have customers in all types of industries, different sizes, different points in their trajectories, anything from single developers building a product to unicorns to Fortune 100. We work with thousands of customers. We run millions of builds per day. This means we’ve seen it all, highly flexible in terms of things that one can do on the platform. In terms of the themes that we see users at CircleCI looking to solve for, these are some of the major ones that pop up; speed, reliability, flexibility, determinism, and scalability.
How we quantify some of those major themes is with the DORA metrics. We’re very big on those. We just want to focus on the types of important benchmarks that we see high-performing teams use. Now that we have set the stage on those underlying themes, let’s define some terms and talk about how CircleCI actually works. We’ll start with the configuration hierarchy from top to bottom. The highest level is pipeline. When we say pipeline at CircleCI, you can think of it as all the workflows triggered on your project with the relevant parameters.
Workflows coordinate a set of related work from start to finish. It’s the orchestration of jobs, really which job is running and in which order. A job is a collection of steps that are running the execution and environment that you specify, and a step is the smallest segment of work, just a command run during a job. I mentioned that each job is a collection of steps in an execution environment. With CircleCI, there are a lot of options for you to choose from. You can choose from one of our cloud environments, Docker, Mac, Windows, VM, or it can be your own custom host runner. All those resources that we see, those environments, we see a workflow using them all together over on the left.
On top of that, for our cloud resources, we have a large variety of resource classes available. We see our example Docker executor resources on the right, anywhere from 20 CPUs, 40 gigs of RAM, a huge amount of vertical scale, all the way down to one CPU, 2 gigs of RAM. It really lets you find your perfect Goldilocks zone. We talked about writing config. You can absolutely write out all of the commands you want in the environment that you specify, but you can also utilize orbs. Orbs are CircleCI’s package manager. They are reusable configurations as code. Think best practices, custom scripting, all of these things bundled up into a few or single lines of configuration.
A few examples from a registry of partners and CircleCI orbs to help you accomplish whatever goals you have. But, in particular, what I want to highlight is the Honeycomb orb. You can use CircleCI to send build events to Honeycomb. It includes best practices built-in. Really, just think of it as plug-and-play. Now that we’ve covered the basics, let’s get on to the demo with Liz.
All right. So this is what the CircleCI UI looks like for us at Honeycomb. So you can see here that we have one pipeline which is called Hound. And the workflow is called build hound. This is essentially what we do every time someone pushes a commit to any branch inside of our repo. And you can see over here that this is taking roughly about 14, 15 minutes to run, which is at the edge of what we consider acceptable. So in practice, we can go ahead and dive into what’s happening inside of one of these steps to give you an idea of what this is doing under the hood.
So for instance, we’re primarily a Go shop. If I go into the Go build, you can see that we go ahead and attach our existing workspace, which allows us to share different files between different areas of the project that have run before. And then we can go ahead and actually run the Go build, and it will go ahead and install the, ah, commands that we need, and then we’ll go ahead and actually run, inside of here, the go install command in order to produce the binaries that we might expect.
And then, finally, we persist the files that we need up to the workspace. That way they’ll be available for later steps in our job such as packaging for deployment and releasing all the way out to production. So when we package it up for deployment, we are currently building tarballs that we upload to AWS S3. And, again, this is something where we simply go ahead and retrieve the files that we need, construct all of the executables, make extra sure that there actually are executables in here because this has been missed in the past when we released buggy code. And then after that, we go ahead and upload the release to S3. And then after that, we can go ahead and have something that we can promote to production.
But maybe that’s a little bit hard to understand just from looking at that list, right? After all, it’s a list. So this is where the CircleCI orb for Honeycomb comes in. The Honeycomb Buildevents orb enables you to very quickly and easily export all of your jobs and steps from a CircleCI pipeline into Honeycomb where you can visualize them in the same interface you’re used to from visualizing your traces from your application. So let’s go in and look at the Honeycomb data set for build events. So our build events data set covers all of our trace executions for the Hound repository.
I can see here immediately things like: How long are all of our builds typically taking? And you can see that a majority of our builds are taking less than 15 minutes, except for this particular cluster here, which I can run BubbleUp on, for instance, to find out some of the properties about it. But in this case, it happens to all be at about the same time, and it’s on Tuesday, December 7th, between 8:00 a.m. and 1:00 p.m. So I can probably say we’re pretty confident what’s happening here is an AWS outage that’s preventing us from being able to access some of our dependencies that we need to do our job.
But let’s go ahead and examine in more detail one of those failing events that happened during the AWS outage. So we can see here that we were stuck waiting a very, very, very long time trying to actually get this data uploaded into the build process, right? Like you can see that we actually built all of these artifacts but that it timed out waiting to even go ahead and send that into AWS S3 because S3 really wasn’t that available for us. We weren’t able to get the credentials to push to S3. But let’s look at a more typical build that’s passing, for instance, one from just this morning.
So you can see here this is a build that takes less than 15 minutes, and we can identify all the constituent components of it. So I’m going to go ahead and first collapse all spans in the step so you can see the durations like you might see in the CircleCI view, right, showing, you know, eight minutes, eight minutes, four minutes, et cetera.
But let’s go ahead and dive in a little bit into just one of these things, and let’s talk about why we’re seeing four different instances of Go test here when there’s only one instance of go test in this list. Ryan, what might be going on here?
This is one of my favorite things, umm. So what we’ve done is, if you have a really long test suite and you need to, like, reduce that workflow duration, you can split up the test app over parallel containers. And so it’s test splitting and parallelism at work. Umm, so that’s a great way to reduce that time in that feedback loop to get under that 10, 15 minutes that we’re talking about and really speed up those lengthy test suites.
Yeah. So the way that this works is that you basically configure something to tell CircleCI what your list of tests is, and then CircleCI will indicate that some tests should run only in container zero, container one, container two, container three. And what you’re doing is making a trade-off of spending a little bit more in CircleCI credits and getting back a lot in terms of developer productivity.
Because, as you can see, each of these go tests, right, each of these steps would have taken three minutes on its own. So, combined, it would have taken 12 minutes, right, and we wouldn’t have time to do anything else like packaging and deployment. But by parallelizing it, it enables the go test to not be this slow step.
So if I zoom in, right? I can go ahead and dive all the way down into individual steps within the job, right, and individual commands within that step. So what I can see is I can see that we’re spending a lot of time, oh dear, that we are linting even before we start the test process. If we wanted to make this faster, we might want to kick off the test process in parallel with or sooner than running the lint in order to avoid having this stuff queue and wait for a while.
So by being able to visualize this as a trace rather than just as a list of steps or a list of jobs, this really helps us get a good understanding of where a build is slow and how to optimize it. But it’s definitely pretty well optimized already, thanks to CircleCI’s capabilities.
But there’s another really cool thing that we haven’t yet shown you, which is that there are a number of attributes that we’ve set on our trace that is not just the duration. So, for instance, let’s go back to one of these steps, and let’s go ahead and look at just the tarball. Let’s suppose I wanted to find out how large the tarball size is over time and whether our tarball is getting more and more gigantic the more and more time goes on. So we’ve created a variable that’s called asset size bites for the ARM 64 architecture because we’re huge users of Graviton2 at Honeycomb, and therefore we want to make sure that we are deploying binaries that are both deployable on Graviton2 as well as binaries that are deployable on traditional S86 hardware.
So what I’m going to do is I’m going to show a heatmap of this property, asset size bites for the ARM 64 architecture where this parameter exists. And let’s go ahead and look back over seven days. And let’s go ahead and see. So, we’ve had a number of binaries that are taking roughly about 350 let’s have a look at this I might want to do a max of asset size as well. Right? This is the standard Honeycomb query language. So I can see here that we’re having our binaries be about 340 megs. So now I can go back and look at, let’s say, 60 days of data; right? And I can see that here, right here, on November 9th or 10th, something happened that caused our binary size to increase. Right? And we also did something on October 20th that caused our binary size to increase.
So being able to plot any parameter that you can extract from your build and to be able to visualize it inside of the Honeycomb UI is super powerful, in our opinion. So let’s see how that’s actually set up in practice. So this here is our CircleCI config file, and we’ve imported a number of orbs, not just the Honeycomb buildevents orb but also other things that we commonly need. For instance, the, uh, CircleCI supplied Slack work that lets us notify our team on Slack if a build job fails. As well as some other things. For instance, interactions with Kubernetes and ECR, for the AWS CLI for uploading to S3, right? Like, all these things are things we have available to us as a library of components.
We also have a bunch of executors here, right, and these executors are used for performing certain tests. For example, for tests, we use the Go MySQL executor, and then for our base image, we just use the CircleCI standard base set of tools. But let’s go ahead and have a look for buildevents. So, the key things that you need to do for buildevents, are you need to set up two specific jobs. In your setup job, you’re going to start your trace with a step called buildevents_start_trace. And then, after that, you can go ahead and create a watch job whose job is just to wait for all the other steps in the build to finish. That way, it can report on the status of all of them. So we’ll go ahead and say, you know, watch_build_and_finish, and then it’ll return success after it’s done and send the trace for the root span off to Honeycomb.
That’s so cool.
But you also might, for instance, want to track the status of individual commands within a single step. So this is where the automatic installation of the Buildevents binary inside of the Buildevents orb can help because now you can run buildevent command, buildevent cmd, pass in the environment variables that are automatically populated by the orb, and then you give it the name. What do you want that span to be called?
And then, after that, you can just push two dashes and then the regular command that you might have run as part of the step anyways, and the nice thing about this is that it lets you track things more granularly all the way down to a single command within a shell script level, and that allows you to have much more granular resolution. Right?
If I were to go and look at one of these traces, so what we’re looking at, in practice, is we’re going to go ahead and have a look at what each step did, right, like how long each step took and how long within each step, right? You can see each of these things, like run migrations, ah, you know, and then run the tests, right? Like each command is broken out as its own trace span inside of Honeycomb. So, in summary, that’s basically how we think about doing building at Honeycomb.
So, why don’t we go ahead and zoom back out to the big picture? So that’s how CircleCI helps us at Honeycomb with being able to get 10 to 15 minute builds or bust and that way people never feel tempted to skip the tests when they’re checking something in because they know it’s always going to give them feedback within 10 to 15 minutes of pushing a commit up.
Additionally, we think that it’s really powerful, not just for us, but for everyone who is both a CircleCI and Honeycomb user to be able to use Buildevents to optimize their build pipeline and figure out when they should be making use of those CircleCI features like parallelism and container size. What do you think, Ryan? Do you think that we’re kind of in the more advanced set of CircleCI users?
That was incredible. I’m already going to put some of that into some of my TypeScript migrations, honestly, because I love visualizing that. So that was, like, amazing to see, and I can’t wait to see what other people are up to in terms of, like, speeding things up.
Yeah. But definitely, you can get started very easily just by adding that CircleCI orb and then a couple of lines of code to create the Buildevents jobs and to annotate each of your jobs with their build durations. And, after that, it’s a matter of taste adding in additional capabilities as you get more sophisticated. So, that’s what we have to share with you today, and we’d encourage you to check out the CircleCI Buildevents orbs on the list of CircleCI orbs. Thank you very much for your attention today.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.
Ep. #33, Information Accessibility with Katy Farmer of CircleCI
In episode 33 of o11ycast, Charity and Shelby are joined by Katy Farmer of CircleCI. They discuss learned helplessness, understanding complicated systems through direct experience, and championing devs to fail gracefully.