Building with Observability: Using CircleCI and Honeycomb to achieve production excellence
Do you know exactly what your builds are doing at every step of the way to prod and after they’ve been deployed? A key part of what lets you ship code to production often and quickly is having observability in your builds. Together, CircleCI and Honeycomb can help you get both speed and quality when shipping code to production. In this webinar, we'll not only examine how CircleCI and Honeycomb work well together, we'll also look at how Honeycomb used both products together to identify changes that impacted their build times and reduced them by 25%. Successful CI/CD asks demanding questions and with observability, you can answer them! Attendees will learn: - How to watch their build pipelines with traces in Honeycomb to identify performance optimizations - How to improve build times with CircleCI orbs - How to integrate Honeycomb and CircleCI
Ryan Pedersen [Solutions Engineer|CircleCI]:
You’re here for the Building With Observability Webinar: Using CircleCI and Honeycomb to Achieve Production Excellence. I’m Ryan Pedersen. I’m a solutions engineer here at CircleCI, which means I get the fun opportunity to work with customers both pre and post-sales, help get onboarded at top CI/CD and do some more advanced configuration and things in their CI/CD pipeline. I’ll hand it off to Pierre.
Pierre Tessier [Solutions Architect|Honeycomb]:
I’m Pierre Tessier. I’m a solutions architect with Honeycomb. Like Ryan, I also help out customers with their observability journey using Honeycomb, both presales as well as post-sales, to ensure that their journey is optimal and they continue to gain a better knowledge of their systems. So with that, let’s talk a little bit about what it is to have lead performing teams and what it is to achieve production excellence, if you will, with your engineering teams. There was a DORA data report that they did earlier this year. They released it, and they came out with some really important kinds of things that really highlight what high-performance teams do versus the not so high-performance ones. Certainly one of the key factors here was deployment frequency. They do multiple deploys per day, and that’s key because that allows you to minimize a change. It allows you to be more adaptable and get you to where you need to go, and, ultimately, the lead time to change is lower, and, if something does break, restoring their services is much less than an hour.
It’s at the top of the class, and their failure rates also go down, and this is because we do small, incremental changes. But the benefits aren’t just from being a high-performance team. You get better outcomes from this, better business outcomes, higher market share, greater profitability. Your employees are happier because they’re not always on call. They’re not always frustrated with their systems. They have speed and stability together. And, ultimately, you have a better culture within your organization. We’re going to show you today, through this presentation, with CircleCI and Honeycomb how we help organizations achieve this higher deploy frequency and gain better observability within their platforms. We’re going to show you examples of how we do it with each other’s platforms ourselves. So, with that, let’s look at what today’s webinar is going to be all about.
Okay. So before we get into the agenda, a couple of housekeeping items. We will have a Q&A at the end. Feel free to ask your questions at any time. If you use the Q and A button at the bottom of the screen, we’ll see them more easily and we will also monitor the chat. We do have live captions for the broadcast. You can see them through the access button, and we’re going to send a link through the chat as well. So, in terms of agenda, here is what we have prepared. We will kick it off with CI/CD. So enter the CircleCI, some optimizations, and Circle CI CI/CD pipelines. Jump into Honeycomb observability. What is observability? Then some mutual use cases, so CircleCI using Honeycomb, Honeycomb using CircleCI. We have a demo of Honeycomb and then a Q and A wrap-up. So this is your last opportunity. If this is not what you signed up for, this is an exit ramp. If not, we’re looking forward to running this for you. On to CircleCI.
At CircleCI, our mission is to empower technology-driven companies to do their best work. We do this by making engineering teams more productive through intelligent automation. When we say idea to delivery, what we mean is anything from the ideation stage all the way to the time it hits the hands of the customers. So big features, bug fixes, new products, we help the developers and teams focus on what they were hired to do. We handle the automating of the build test and deploy. Some background on CircleCI, nine years old, still growing fast. We have customers in all types of industries, different sizes at different points in their trajectories. All the way from single developers through building a product, unicorns, Fortune 100. We work with thousands of customers and run millions of builds per day. It means we’ve seen it all, highly flexible in terms of things one can do in the platform.
Back to the concept of idea to delivery, on the left here in the green, we see creation. We see the VCS. That’s where all the code lives. It takes all the updates from the developers and adjusts them. All the way to the right, the other green, the logistics, that’s where the code is deployed, think shipping off a container, deploying to Heroku, deploying to Kubernetes cluster, really wherever. The middle part here, this blue, this is CircleCI, automating that fielding, testing, and delivery of that code. All right. And, as I mentioned at the top of the hour, as a solutions engineer at CircleCI, I get to help customers both generally get onboarded, which I think of as ensuring stability and determinism; and then I get to get their build optimized as fast as possible, really the speed piece. I’m going to jump here right into some of the level two pieces associated with speed and stability. These are the levers that I often look to pull to drastically shrink the build times and workflow times. Start with a couple of tried and trues, go to a newer option, a better way to adhere to best practices, and make the config file as simple as possible.
Concurrency and parallelism, a really easy one to play around with these concepts, concurrency, and parallelism. Before I get to those definitions, a quick definition of a workflow. So, for us, a workflow is a dependency graph of jobs, which jobs are running, and in which order. A job is usually a specific task to perform, something like a unit testing suite, building a Docker image, running some linting steps. Now we got that out of the way. We’ll talk about concurrency. Look at job two and three here. Concurrency is used when describing jobs being run concurrently. So if you had to run multiple testing suites, something like a unit testing suite, a snapshot testing suite, those can be run side by side. You don’t have to wait for one to finish to start the other. For you, that means shrinking the amount of time your developers have to wait after pushing code before they get that first piece of feedback. Another level down is parallelism. This is setting the config at the job level. It spins up the specified number of nodes to run that job across. It’s really useful, in particular, for test splitting. We have an internal test splitting mechanism so we can split up lengthy test suites over whatever defined number of nodes. We optimize that by timing data so the test suites finish at the same time.
So think instead of, you need to run 1,000 unit tests all sequentially, split those across 30 nodes to drastically speed up that process and get that feedback back to your developer much quicker. You can also have parallelism for concurrent jobs really speeding things up. Another one is playing around with resource classes. It’s a really easy one available. Sometimes you need a little bit more juice. We cover that in workflow as a dependency graph of jobs that run in a specified order and each with a single thing in mind. Each job also can be highly flexible in terms of platform, Docker, Mac, VM, Custom Runner, Windows, but not every task fits into a one size fits all approach with resource size. There is CPU and RAM.
You’re able to specify different resource classes at a job level. Think of a heavy job that needs more CPU and RAM, something like building a Java app. You can specify that, and they can live side by side with one that needs much less like a quick linting job, for instance. One last optimization speed piece, this is pretty new. The ability to use RAM disk in jobs. Before I get into the how I want to touch on the why. Each job in CircleCI is isolated from one another, but they do all share the local SSD on the EC2 instance. So it could be fighting if you’re running jobs with a high number of parallelism over that SSD. It was brought to light with some inconsistencies in installing node modules, in particular. Which before RAM disk, we actually performed on SSD rather than in memory, but even without this in mind, RAM disk was just plain fast.
What we did in order to solve this piece around determinism and speed is to utilize something Docker containers already have now, which is RAM disk. So you’re able to specify to use these as a working directory, and it utilizes Docker and RAM disk and speeds things up. Really useful for I/O heavy tasks. It can leverage as much memory as available in a specified resource class. So a really great way to mirror together resource classes with more speed. All right. As promised, on to ways to also make life easier in your config cleaner. You can think of orbs here, CircleCI’s package manager, reusable configurations code. It’s good for a few of main use cases, one which we’ll cover mostly. It’s a network of certified and partner orbs. Now, these are orbs in our registry made by either CircleCI or one of our close partners, one like Honeycomb. So if you use one of these, you know everything is up to date. They include best practices, and their version links stability and getting your developers back to doing what they do, not trying to reinvent the wheel and figuring out how to use a tool and to implement that in your CI/CD pipeline.
They also include the source code so you can see what’s happening or repurpose them if you want to make them their own. They consist of commands, executors, or even entire jobs. You can distill what might have been 30 lines, many lines of code down to a few or single lines of code, and use parameters to do what you need to do and make these scalable and useful across use cases. Instead of having to call the entire batch script, you can simplify that and be off to the races. So, as promised, we covered speed and stability levers, many of the level two pieces that customers find value in. I’m going to hand it back to Pierre.
Awesome. Thanks, Ryan. So when we talk about Honeycomb, we often talk about observability. I like to define that. What does observability mean? Observability means being able to ask any question and getting that answer, even if you don’t know what you’re looking for. It’s not monitoring. I want to be really, really clear here. Monitoring is not equal to observability. Do you know what else is not equal to observability? Three different data types. You don’t need logs and tracing and metrics to achieve observability. You need a different lens on your data to get there. You need to be able to look at that data differently. We’re going to talk a little bit more about that as well. Ultimately, what we want to do is focus on what you need to know even when it is you don’t know what it is you’re looking for. So going from that, let’s talk about what it is that makes Honeycomb, you know, what is our mission? So this is observability. How does Honeycomb apply this to our mission? All right? Thanks.
Our mission is to give every engineer the observability they need to eliminate the toil and delight their users. Being able to go in your platform and, with confidence, knowing we can push that release because if it is bad, I will be able to look inside and find out what it is, and I’m not going to struggle to find out what it is. We want to make “on-call” not be a dirty word for your engineers in your SRE platform. We want to make it so when there is a problem, they don’t have to worry about trying to hunt it down or figure it out. They will be given that answer for them easily. So what makes Honeycomb the platform for observability? What sets us apart from all of this? First off, we have a purpose-made datastore for an event-based approach. I said, No more thinking of data as traces and logs and metrics. They’re all types of telemetry embeds, and what we want to do is eliminate these three pillars, these three different tools we need to use, and put it all into one tool. So all your data comes in, and no matter how you’re trying to slice the data, you can see it all with that same lens.
And you want to be able to get there with ridiculously fast queries and no cardinality limits. So It’s one thing to come back with a query in a couple of seconds or milliseconds, but it’s another thing to be able to say you can push user ID to your query and group on it, even if you’ve got over 100,000, over a million, over 10 million users. Honeycomb will perform just fine and still slice the data for you. We’re really allowing you to ask any question across any dimensional platform. We want you to be able to take this and wrap it up with service level objectives and have that be the driver for your observability. Raise your hand if you’ve seen an engineer build a new service, and the first thing they did was create a dashboard with eight charts on it and say, This is what you need to monitor my service. Then when something happened, the dashboard didn’t tell you what you need to know. That’s because we need to create all these alerts for all these different symptoms. Service level objectives allow us to focus on what the business value of our service was. Why did we create this service? What kind of service did we want to do when we built this product? And then we build our alerting from that. We build our observability from that. So if our service level objectives are struggling, we can use that as a launchpad to learn more. And, ultimately, you release an alert fatigue, it allows us to focus on what matters within the platform.
I’m going to look at some of the use cases you can benefit from, putting all of these things together with Honeycomb. The first one is the ongoing development use case. Certainly, if you have better observability, you can get insights into what’s going on, and you can continue to build new features and, you know, be the first one to market with that brand new thing, not being disrupted but being the one that does the disruption. Paired together with CircleCI and the confidence that you know what your code is going to do and no fear about it, you get the pain free releases. I want you to push a button at 5:00 p.m. on a Friday to release your code. Do it on a Sunday, too, because you have the confidence, and you know you’re fine doing so. Systems optimization is another great one. The number of times I’ve worked with customers where they’ve said, We’ve instrumented our application with Honeycomb, and, wow, we did not know. We were calling this query 13 times for a single transaction. That’s moving to a cache, right? Just like that, just by gaining those simple insights, they were able to make their applications make their customers’ experiences better. And, certainly, probably the primary reason why people want to use Honeycomb is for incident response. When something goes wrong, you want the system to tell you why. You want to be able to know what’s going on. I’m going to turn it back over to Ryan to actually talk about some of these use cases of how we’ve used CircleCI and Honeycomb together.
Thanks, Pierre. All right. So it takes two to tango. And I’m really excited to talk about some of our mutual use cases. This is also my favorite slide. I love the idea of tangoing and working together. First off, some ways CircleCI uses Honeycomb. The first use case I want to touch on is quickly finding things that are out of the ordinary for us. One example recently, we noticed there were some very large spikes and log output hitting Redis in places. And, for us, Redis is where the build output collates before it’s sent off to S3, before it goes to more permanent storage. At this point, we have to acknowledge something is going on with Redis. The next step is to figure it out. It sounds easy. Well, it really is using Honeycomb. So we dig into Honeycomb, start going through queries, look for something that stands out, something not like the rest. We take an approach of “you’ll know it when you see it.” And when we finally find it, and there’s definitely no doubt there’s something different here. The graph is what we see in Honeycomb when we finally track down the culprit. We dig into the purple. We’re able to see what’s causing this. We get a granular, specific customer on the branch and the commit that caused the first spike and let us know to look into Redis.
Before I tell you what it was, I want to remind you CircleCI has thousands of customers. We see millions of builds per day. The fact that we’re able to so quickly narrow this down to a specific customer commit with Honeycomb is incredible. So we find out the change was due to a Rails upgrade. This resulted in a huge amount of deprecation warnings that were flooding the logs. With this information, we were able to go to the customer and let them know what had happened. Just making a change in the code or piping the logs out to artifact storage, if they’re definitely needed afterward. But, overall, we were able to get to the bottom of a general piece that bubbled up and make some proactive recommendations very quickly with the help of Honeycomb, the visualizations, and using BubbleUp. The second way we use Honeycomb in our day to day is by following traces. Some parties strongly recommend not chasing waterfalls, but, in this instance, following waterfalls is very helpful.
With microservices, there are a lot of services talking to each other. We’re able to use traces to visualize the waterfall, the API database calls. A great example here. While examining a workflow recently, we spotted that we were running validation calls multiple times for certain conditions. Think of what Pierre just mentioned, 13 times for a single transaction, you should do something else with that if you can visualize that. What was interesting is these are things that could be done only once in an ideal world. So after spotting that, we were able to update work to include a better way to pass what we need in terms of validation between jobs. This reduces unnecessary calls, which is really helpful for workflows and big fanouts. Think of a job running up to 100 parallelisms needing to make a call hundreds of times or just passing to what is needed and doing it once. So we covered visibility into the builds being done as well as our distributed traces, some great Honeycomb use cases from CircleCI. And with that, I’m going to hand it back to Pierre.
Thanks, Ryan. This is a fun use case I love to talk about. It’s really a theme that has been building a lot lately, getting a lot more traction. We’re seeing a lot more people wanting to do this. It’s really all about building better builds. We’re going to talk about how we get there. So first off, I’m going to ask people to raise their hands again. How many people have been the person in the chair with the sword before? I know we’ve all been there. I’ve been an engineer before. I remember saying, Oh, it’s compiling. I will go do something else. With builds systems out there, we push it off to another platform, like CircleCI, to do that compiling, but it’s still something that existed, nonetheless. If we want to think about this a little bit more, what is a build? What do you do in a build? You take a request to create something. You run it through a series of steps or a workflow. Some of these steps happen in parallel, some kick off in single file, and eventually, you end up with an artifact or a set of artifacts at the end of your build. If we think about what is a request going into your application, it comes into a service, you get service by that service. Maybe it gets branched off and executed in parallel by other services. And eventually, all these services are done, and it produced a package, and you respond back to the end-user with a response. They’re really not that dissimilar when you think about what we’re doing, which is a series of steps.
So what if we take the concepts of distributed tracing? What if we take that and we apply it to builds? So if we could take builds and apply distributed tracing to them, you get something that looks like this. I can see how long my entire build took. I can see what commands were run to do it and how long each command took. We can even see if individual commands were successful or a failure. We could add other attributes as well, like did you produce an artifact, the size of the artifacts? And we can kind of encapsulate all this stuff into a trace. So this looks great. I see a waterfall chart of what looks like a build. How did we get here? Well, Ryan maybe hinted at that earlier on. We built a little utility. We called it build events. And then we wrapped it up with an orb, and we released it over on CircleCI orbs. It’s free for anyone to use, and it allows you to add wrappers for every single step in your build workflow, or at least the steps that matter for you, and build up that same waterfall view. So you can understand your builds, so you can understand and gain observability with what’s going on in those builds, optimize them, find better ways to do builds or clean them up.
I would love to actually show you how this works in action within Honeycomb to do this. Let me grab a screen and show this off a little bit. So this is Honeycomb right here. This is our environment. We’re staring at the Poodle service. We call it Poodle is our front end service. We name all of our services after dog breeds at Honeycomb. It’s kind of fun. What we’ve got here are all these markers. And what these markers are here are actually our deployments. So Honeycomb, we build all the time. If there are any new builds, we will deploy every hour on the hour. It’s early in the morning. Engineers don’t work overnight, but you can see this deployed late last night. This is what happens at Honeycomb. So we can go ahead and look at each one of these and click on that deploy. We’ll actually see in the CircleCI workload, how to get there. I can see here this deploy was committed. We can get down to the actual get commit, if we wanted to, what happened, what we’re doing, adjusted some kind of lambda flag for us in LaunchDarkly, and we can see all the steps in the workflow. And this is great. This is fine. And we get some value there, but this is Honeycomb’s build events area. This is where we actually send all the data for all the builds we do. And I’ve got them kind of plotted out right here. I can come right here. I’m just going to make this a little bit better for us to see, see all these blocks. These are all the times that we’ve done a build in the past 24 hours.
So this is about two weeks of data, I believe, right here, that we’re staring at, maybe three weeks of data, which are our build times. It’s over a span of time. And right here, this is October 5th when that commitment was merged, when Dean pushed that commit. You can see the build times went from 900 seconds down to 500. Significant improvement. 30 plus percent is what we saved. And if you want to think about this, if you’re able to save 30% in your build times, this is quantifiable savings that you can have right there, but just really observing your builds, understanding what they are, and applying these techniques from CircleCI to make them better. And you can see, if I can click on any of these dots here, the build tree did drastically change because we went to parallelisms, and we optimized each of these steps to use the RAM disk so they all went faster. So with that, we’re going to hand it back over to the presentation here. We’re going to kind of open this up, if there are any questions that you may have about the platform or what you’ve seen so far, about anything you might have around CircleCI, or anything you may have about Honeycomb. What do you do with observability? Feel free to go ahead and ask them right now. We see them in the Q&A. Or if you’re more comfortable using chat, you can go ahead and post them there as well.
I see one around CPU and RAM usage for each job. So the answer is: Somewhat. So RAM, if you want to get RAM usage on a Docker executor, you can see the high watermark. I can put the link there in the chat for how to do that. You can cat that out. CPU is not accessible, but definitely a kind of feedback that we’ve heard before. We would love to find a way to get that out there.
Awesome. I see another one here coming in about what’s the difference between logs and traces with Honeycomb? There is no difference when you bring in that data into Honeycomb. We consider all those events in our parlance. Certainly, a trace has additional attributes associated with those events where we can connect them to other similar loglines if you will. Really, I would like to kind of say the journey from log to tracing is going from, when you have loglines, you’re going to have an application running with a lot of different entries going inside there. When you go to debug it, you’re looking at all these entries flying through, and you can’t really correlate which ones make up a single transaction because that one transaction would have dumped 25 different entries in your loglines. When we trace these, tracing is the glue for all of those entries. So when you, say, pull up that one transaction, it’s going to extract all of those loglines for you and kind of draw it up. And that’s the way I like to kind of say what logging into tracing looks like. But to Honeycomb, you can send either/or, and our platform allows you to do everything that you see us do on it. We can’t draw waterfalls with just loglines, but we can do everything else with Honeycomb.
I see another one here, a question. Is there an orb available that will allow us to publish our CircleCI jobs into Honeycomb? You mean maybe need a marker if you will. Interestingly enough, we probably should produce an orb for that, but it is a single API call or a single CLI command as well. We do make a CLI available for it. We publish an image for it. However, duly noted, it probably wouldn’t hurt for us to build an orb to do the same thing. I see another question here. I see a chart of the total time required for build. Do we have similar charts showing total compute per build? You can do that. It is something we can do in Honeycomb because each step is called out, and I can make a chart just for specific steps. Your steps could be compute steps.
I think we could probably pair that with, I have another like a script to systematically get like the high watermark of RAM too. We could probably couple those together.
I see a couple more questions here. There’s maybe a follow up to a prior question. You weren’t looking to do a marker. You meant an orb to instrument the pipeline. That is what the orb build events is for, is to instrument the pipeline itself. So you use it inside of your CircleCI’s config workflow, and for each step that is important for you, you would leverage the orb there. And I do see another question about datasets for Honeycomb. I will recommend a separate dataset for your build steps. We do that at Honeycomb ourselves. It’s called “build events,” fittingly. I recommend you call it the same thing. I think that’s actually the default associated with the orb, so. There are not any more questions. I certainly thank everybody very much. I hope you found this informative. If you would love to learn more, go ahead, Ryan.
I was going to say, for us, we would love you coming out of this just excited about speed, stability, but also just as excited about the orb. So check out our developer hub, the org registry. Get up and running.
If you enjoyed all this and you want to learn more about Honeycomb, we offer you plenty of different ways to learn more. We offer observability office hours with our developer advocates, Liz Fong Jones, and Shelby Spees. They’re wonderful. They offer great advice. And even if you’re not a Honeycomb customer, if you’re just looking for general observability office hours advice, go ahead and give them a ring. They’re there for you. Every week we provide a fun, weekly live demo. If you want to see more in-depth of Honeycomb, it’s every Thursday at the same time. I believe it’s 10:00 a.m. Pacific, 1:00 p.m. Eastern time. And if you want to try Honeycomb, you can try it for free. We do have a free tier. It’s great for anybody to try it out. It’s a little bit limited in the number of events you can send, but it’s absolutely there. It’s free. It’ll be free for you to use. And if you’re ready to grow and scale, we’ll be there with you as well. And, on that note, thank you very much. Remember, test your code in production because that is the only spot where it actually lives and where actually is real. And build often with CircleCI.
Thanks very much.
If you see any typos in this text or have any questions, reach out to email@example.com.
Ep. #29, Testing in Production with Glen Mailer of CircleCI
In episode 29 of o11ycast, Charity and Shelby are joined by Glen Mailer of CircleCI. They discuss testing in production and rethinking socio-technical systems from the ground up.
Cultivating Production Excellence
Taming the complex distributed systems is not just changing tools and techniques. It also requires changing who is involved in production, how they collaborate, and how we measure success. Liz Fong-Jones walks through the thinking behind a team that strives for production excellence.