Pipelines and Buildevents
Your build pipelines can also have observability. Learn how to debug your CI/CD workflows with Honeycomb.
Pierre Tessier [Director, Solution Architects|Honeycomb]:
Building better builds. Using observability to understand what’s going on when you’re building the software that you run in production.
I’m Pierre Tessier. You can find me as @puckpuck on Twitter. I run the Solution Architect Team for Honeycomb. Let’s get into what’s important for developers when we’re talking about builds. And before we get into that, I just wanted to start off with, what do developers do when we’re building software?
Well, we were goofing around; right? Classic one right here. I love this xkcd comic. Hey, get back to work. But we’re compiling. Oh, yeah. Carry on. Got to build that stuff.
And this is what happens. Because when we’re building software, we’re waiting for it to finish. Is it too long? Is it doing the right things? Are my tests failing? I’ve got to try it again, kick-start it because I got a flapping test.
What does a build do, really? Well, in a nutshell, it runs a lot of shell commands. This is a drastic oversimplification of a build platform, I get it, build platforms do a lot more and this is a simplification. But let’s go over it.
They set up your isolated environment. They’re going to run those shell commands for you. Whenever one of those commands fails, we’re going to want to stop on that. At the end, we’re going to want to record all these results.
But what we’re going to focus on right here for this session is running those several shell commands. Because that’s really the meat and potatoes of your build platform. This is what really matters, and these are the things that you change and modify. I got to run this test and pull in these dependencies, do this build, do this kind of test next. These are all the steps that you care about. And sometimes you want to know what’s going on with those steps.
Well, what if we took the concept of distributive tracing that we use to observe our production applications and use that in our builds? And at Honeycomb, we did just that.
This is what it looks like when you instrument your builds, and you gain observability on it. You can know how long the entire build did take. What commands did we run and how long did each one of those commands run? Did they succeed or fail? And then from that, we could gain the knowledge that we need to learn to optimize these builds.
But enough showing screenshots. Let’s actually go through this to see what it’s like.
Right here we’re looking at one of Honeycomb’s very first builds that we instrumented. We set a journey. Our builds are taking some time. We want to understand what was going on with those builds. We wanted–can we optimize it? So we instrumented it and we looked at it. And this is what we got.
And we learned that Go test takes about 111 seconds and Poodle build, the next longest one, takes about just under 90 seconds. And when we looked at the data, we really said to ourselves, this is fine. This is what we probably expected. Nothing out of the ordinary. And that’s probably okay.
Sometimes when you go down a journey and ask yourself questions, your answer might be: it’s good. Let’s go focus on other things for now and we’ll get back to this perhaps later on in life.
And that’s exactly what we did at Honeycomb. And we let this go. And we let it go for a year. And when you look at a year’s worth of builds, you could see we started off really low there even down to almost four minutes, perhaps even seven minutes average or so, climbing all the way up to about 13, 14, even 15 minutes’ time to run these builds. So our build times doubled.
Maybe it was time we actually started looking at it and doing something about it. When we looked at our builds, we were doing everything with a very serial type of platform. And as luck would have it, we were looking at a different build CI platform provider anyways, so we decided to move to a platform that gave you more parallelization and it dockerized a lot of your build steps.
And this is what happened right here. You could see our builds really changed a lot. Got much more variable in those build times, but overall did we save? Well, because we’re instrumenting our builds, we could check this out really fast as well. Let’s run an average line across it and see what that looks like.
Oh, yeah. We saved a little bit of time for sure. Kind of stabilized that upward trend as well. Really helped us out, took advantage of parallelization.
But let’s look at the overall thing. You could see when we were instrumenting those builds, we were catching a lot of data. We could take that data and learn more about it. Like maybe I want to group this by name, the type of things that we were doing.
We were able to understand the save in some steps while other ones continue to do what they’re doing. And overall we gained. We won. And we were able to make these decisions because we were instrumenting and observing our build. Really gaining that deep knowledge that you can gain from your production applications, but from your CI platform.
If you’re like Honeycomb, you’re running hundreds of builds a day. So really when you add this all up, two minutes here, two minutes there, it really adds up at the end of the day to really help you out.
Now, as time would go, we started doing more and more with this. We decided to start putting deployment markers as well. So now inside of our CI platform for the ones that actually do a deploy, when a build’s done, it’s going to go ahead and drop a window marker and it’s going to extend that window out until a deploy is finished. And we had them all right here and we could even link back to them any time we want. So I’ll go ahead and click on this, it’ll pull up my CircleCI platform and we’ll see what’s going on inside that build.
Now, there was a time we did another change to our build itself. And we actually gained a lot of good speed performance out of it. Again, because we are instrumenting our builds, we’re looking at it and we’re saying, wow. These disk intensive tasks were taking a lot more time. What can we do to speed that up? And it turns out that we could do it all in RAM and use RAM Disk instead to optimize the builds one step further. And we did just that.
We even put a marker right here because it was drastic enough to us to take a note of that. This leads over to the GitHub commit that took care of doing this. And Dean, one of our fine engineers on the platform team, went out and wrote some stuff to optimize our builds. And you could see it right here. Clearly, we’ve optimized those builds and we’ve got some gains out of it.
But this is an observability platform. It’s a lot more than just looking at a waterfall trace and grouping a couple things. This is Honeycomb. We do all kinds of great things with Honeycomb. I can even do a BubbleUp. If you’ve used Honeycomb before, you’ll love the BubbleUp tool. You just draw a yellow box where it matters on a heatmap, and we’ll go ahead and go through all those attributes and figure out what’s different. And here we can even find out that, oh, wow. We got specific… look at that. Toshok was really busy in February doing a lot of great stuff for us working on our new annotations framework. I can see he was the one doing about 17% of the builds in our platform.
And these are some of the things you learn. And it’s not just people. It could be different build IDs, different variables that you have in a platform. And they all help you kind of put that together, taking the power of observability and putting it on your CI platform for you.
Now, this is available through something we know as Build Events. And you are certainly able to go get it. You can run it yourself. But at Honeycomb we didn’t want to just give you the tool and say, hey, go do it. It works with Jenkins, Buildkite, name the platform. It’s behind there. But we’ve also gone out and built some specialized modules just for CircleCI or GitHub Actions. Or you can go check out this wonderful blog post on how to do this all on GitLab.
That’s all we wanted to chat about today. Hope you enjoyed building better builds.
If you see any typos in this text or have any questions, reach out to email@example.com.
How Tracing Uncovers Half-truths in Slack’s CI Infrastructure
Frank Chen shares how traces gave us a critical and compounding capability to better understand where, when, how, and why faults occur for our customers in CI. We share how shared tooling for high-dimensionality event traces (using SlackTrace and SpanEvents) could significantly increase our velocity to diagnose code in flight and to debug complex system interactions.