Frank Chen shares how traces gave us a critical and compounding capability to better understand where, when, how, and why faults occur for our customers in CI. We share how shared tooling for high-dimensionality event traces (using SlackTrace and SpanEvents) could significantly increase our velocity to diagnose code in flight and to debug complex system interactions.
Frank Chen [Staff Software Engineer|Slack]:
Hello. I’m Frank. My pronouns are he/him. I’ve been at Slack for two years and guided many of our internal teams on how to instrument and improve observability with traces to solve problems.
One particular problem is test flakiness. It’s one of those insidious problems that has multiple causes. For engineers, it’s extremely frustrating and for service owners, hard to concretely pinpoint. We were able to use tracing to understand flakiness and bring the flakiness percentage per PR down about 10x over the last few years.
I’m delighted to share a few stories of finding causation for performance and resiliency problems within our infrastructure with Honeycomb. We will hear about what worked, what didn’t, and what helped drive organic adoption for tracing and various use cases. Today we will start with a high-level overview of our department, our trace infrastructure using SlackTrace and SpanEvents, overview of CI infrastructure workflow, initial instrumentation, and investments in CI traces, and finally specific ways teams at Slack use traces to decrease flakes for their customers.
I would love to spend the first few minutes talking about developer productivity and business goals. Developer productivity is about building tools and programs used to get Slack software to customers quickly and with high quality. It’s been a fun journey to be on as Slack scales. I work with teams that have the opportunity to make engineering at Slack simpler, more pleasant, and more productive. Both Slack customer and code bases grew quickly over the past six years. This begs the question, how might we create better internal tools for engineers to build, test, release, and observe deployed code?
Well, there are challenges. As the great philosopher Uncle Ben from Spider-Man once said, “with great power comes great responsibility.” And in the same way it is exciting, there is a shadow side to great growth in customers and code bases. Complexity, fuzzy service and team boundaries, and more. Here’s a simplified diagram of Slack for desktop and mobile clients. Slack has evolved from a single web app and a hackling monorepo to a topology of many languages, services, and clients that serve different needs. The core business logic lives in webapp and routes to other services like a distributed cache called Flannel, Search and Discovery, and many others.
Internal tools were built quickly and scaled just enough for Slack’s main monorepo, webapp. In 2019 when I joined, our team was already growing about six-fold. Today internal teams — the internal tools team consists of 50 people (and, of course, we’re hiring).
This helpfully resonates with y’all as well. Services are reasoned about individually and within team silos using logs and metrics. For example, the Vitess team might create a Prometheus metric for errors with different dimensions that are meaningful for development. Another team might create a circuit breaker for error rate on Vitess but misunderstand the nuance and misuse the metric at the coarse level without context. So how might we do better?
In my previous role, I helped build products and teams to help companies with their hardest problems by reasoning about data, pipeline, and platforms with Palantir. I found you can create massive business results when people speak the same language and reason about data using that same language. Oftentimes, one team’s definition of dimension X means something very different between teams and especially business groups. So let’s take a different approach.
How might we create a common language to increase legibility and impact complexity across teams? How might we create an integrated diagnosis and reasoning about complex system interactions with people in Slack? How might we evolve our observability culture? How can we share business impact, ease of implementation, and increases in developer velocity?
This body of work was a slow conspiracy, sprinted on between many teams and didn’t happen overnight. I want to say a special thank you to Suman and Ryan on the observability team who founded this thesis around SlackTrace and all the folk who improved the CI ecosystem with traces. During this Q&A session, I would love to hear your stories on what grew with adoption and what didn’t.
First, how did we come to this approach? “It is slow” is the hardest problem to debug in distributed systems. “It is flaky” is the most heard problem by internal tools teams. A tracing system doesn’t tell me what is slow or flaky or more importantly why. It’s left as an exercise to the reader. Most of the utility of tracing systems today are single-use only. To get more, we have to use a vendor solution that may provide a few more answers depending on the UI and analytics. An in-depth exploration of Slack traces written by Suman, my mentor and lead of the observability team, and can be found in the blog post linked. For the purposes of this talk, I’ll share a sky-high view of the motivation and the infrastructure.
First, we start with the SpanEvent structure such that we can create an event once and use it in multiple places. For example, a function within webapp might create a single SpanEvent. This SpanEvent will contain context for the rest of the user request with more SpanEvents. Now we can adjust SpanEvents from multiple clients and are able to craft views from the same data model by processing it through Kafka. Users can access spans through Datawarehouse, Presto, SQL queries and real-time stores like Honeycomb, or Elasticsearch for a full text search.
And now back to CI. Here’s a very simplified view of our CI workflows for users and infrastructure. I’ll use webapp and e2e tests for purposes of illustration. Webapp is where most engineers at Slack spend their development time. It ties together business logic from each client and dependent services. The CI workflow probably looks familiar. A UI does development on the local branch, pushes it to GitHub, opens a PR and are presented with test results. The screen shot shows DMs from checkpoint, an internal CI/CD platform. It drives and bridges workflows between GitHub enterprise, AWS services like Kubernetes, Jenkins, Consul, and QA environments that are beefy machines running Slack to execute end to end tests. You might ask, what might go wrong? Good question.
When I joined in 2019, a lot of existing CI logic was written by the CTO and early employees. It was mostly untouched for four years, and it mostly worked well enough. Well, why trace? To find causation. Today, there are many more downstream services from webapp. Cardinality for CI traces are very different from other use cases. CI has lower volume—so there’s no need for sampling—but higher criticality. CI requests go through critical and interconnected systems where a fault that any system means that a user is blocked. So what does that fault look like?
It means users hit retry on their tests and are frustrated. Between 2017 and 2020, Slack saw a 10% month-over-month growth in test execution count. This led to a lot of systems being stretched to their limit. Before a series of projects around GitHub load, circuit breaking on dependencies, anomaly detection, flakiness reduction and more recent workflow change, we were seeing a flake rate at approximately 50%. Today it’s around 5%.
With a flake rate that’s around 50%, developers no longer have trust in tests and have a very slow velocity because they’re forced to hit retry and this generally leads to frustration. Observability through tracing played a role in each of these projects.
Today we’re about 10x better. When we’re measuring flakiness rate per PR. Both velocity and confidence have increased in recent developer surveys as well. And every day we’re still learning how to understand, operate, and evolve this very complex system we call Slack. And in the past year, many business requirements, services, and teams changed from where our CI infrastructure was built. Early on in my time here, I saw an opportunity to work with the observability team to build a better set of tools, understand, and to change how we understood causation in CI. Let me share how we got there.
I looked at my early Slack history and found my first conversation with Suman a few days ago. I had heard about his exploration and building of SlackTrace. I understood a similar problem in data platforms from my previous work and reached out. I then spent an afternoon to build a cheap prototype with the hypothesis that the set of tooling would become a critical and compounding value add for developers.
An easy place to prototype was our test runner, affectionately known as CI bot. Even during this PR rollout and a couple of simulated test run, I noticed Git check out was slow for a portion of our fleet. It turned out a few instances in our auto scaling group were not being updated. Easy, cool.
And how — there’s a juicy incident, this is now a few months later. It’s day two of a multi-day, multi-team incident. Day one, our teams are scrambling with one-off hacks to try to bring a few overloaded systems under control. On the morning of day two, I added our first cross service trace and reused the same instrumentation from our test runner. Very quickly, with Honeycomb’s BubbleUp, it became clear where problems were coming from.
Git LFS on a portion of the fleet had slowed down the entire system. Over the next month, this sort of cross-system interaction led to targeted investments on how we can add this throughout Checkpoint traces.
Here’s a sample of some of the shared dimensions we created for users and developers in CI to make queries in Honeycomb legible and more accessible. These dimensions were stubbed early in a library and instrumented with a few clients. Since then, various teams have extended and reused these dimensions for their use cases.
Back to the root challenge. Developer frustration across Slack was increasing due to flaky test runs over the last few years. Flaky test runs was one of the top reported issues for a few quarters. By mid-2020, automation teams across Slack had a daily 30-minute triage session to triage and focus on the flakiest tests. Automation team leads hesitated to introduce any additional variance on how we use the Cypress platform, an end-to-end test framework. The belief with that flakiness was from the test code itself. Yet there wasn’t great progress by focusing on the tests.
I typically work on the infra side of the world and felt strongly we could do better with causation by instrumenting how we used the Cypress framework. Great failures are where the system reports itself as healthy and yet the application—in this case, tests—report a failure or flake. After some negotiating and identifying no verifiable decrease in performance nor resiliency, we scoped a short experiment. We’d instrument the high-level platform runtime for a month to capture some runtime variables. What could go wrong?
Well, a lot went right. We discovered a few runtime variables that correlated very strongly with higher flake rates. In this graph, you can see compute hours spent on just flaky runs. At peak, we were spending roughly 90k hours per week of very large, very expensive machines—on tests that were discarded because results were flaky. To build confidence and address concerns at every merge and hypothesis test, we cued up a revert PR at the same time. We never reverted.
A challenge we saw in CI is expertise in fuzzy service boundaries especially when it came to services operated for internal use like dev or QA. A broken dependent service search or an upgrade in a vendor’s check out library API means that the test will start to flake for users in CI. While a full discussion of anomaly detection and circuit breakers are beyond the scope of this talk, I hope to share a few screenshots of how we were to build the glue for service and automation teams to have discussions in Slack with observability.
Test suite owners may not be distributed systems experts; service teams may not be aware of how internal customers are using the services. So actionable observability into light weight triage workflows with links and starting threads for each of these issues was the key. In conjunction, these pieces presented a working space for disjoint teams to find causes for CI flakiness and give awareness when a test suite or dependent service had issues.
Finally, I’d love to hear how you solve these type of challenges at your company. Thank you all very much for listening to a few stories and I would love to open the floor to questions.
Ben Darfler [Engineering Manager|Honeycomb]:
All right, let’s give it up for Frank. Thank you, Frank. That was a great chat. Don’t forget to, as I mentioned in the beginning, jump into the Slack channel, drop questions there, we will be pulling them out and throwing them at Frank here. But I can certainly kick one off.
One of the things that came up for me watching the talk was like, how easy it seemed. A lot of people I talk to, it feels like tracing is this dark art of some sort. And it didn’t seem like that. So I’m wondering if you could talk about your hands-on experience about adding tracing and using it for debugging. Do you need to be a tracing grand master? What was your experience there?
Yeah. Like in the talk I described whipping something up in a shell script over the course of an afternoon. And I think part of my background is in design and so building really cheap—almost janky—prototypes that may not be perfect, but help us start to understand parts of the system that we didn’t have before. So like an M.O. might be let’s observe around and find out. I think each version of that, including the instrumentation during that incident, was like: yeah, this is not perfect code. But it helped to solve a very specific problem and understand part of the system that we didn’t have before.
No, I like that. It’s very iterative and light weight. Getting in there and letting the instrumentation lead you into posing new questions and adding more instrumentation, but it doesn’t feel like it’s tricky or hard or difficult. I do think that people can get stuck with the up front, I need to trace all of the things to start with. And we see a lot of benefit in that iterative loop of asking a question, trying to figure out what information you need to get the answer to that.
Yeah, with Prometheus metrics, I feel like this, like previous companies as well, it’s like you can have tens of thousands of Prometheus metrics with different dimensions and because we’re somewhat starting from scratch with observability in CI, we’re able to start stubbing some of this out and building an enum for what canonical trace metrics might be. And that really helps make this easier to understand and easier to explore, and ask questions of Honeycomb and our systems.
Yeah, you were pointing back to, with Prometheus setup, you get all of the auto instrumentation, you turn it on and all of a sudden, you’ve got all of this information. And it feels like you’ve got a lot to work with and you’re really set up for success.
And then you start trying to ask those questions, you’re like, I don’t even know where to go, how do I sift through this to get what I actually want.
And it’s interesting, I do see a lot of our customers that want to start with auto instrumentation for tracing and it makes sense: you get a lot of rich value out of that immediately. But almost maybe taking the opposite approach of incrementally adding your own customization, gives you that really high value data right off the bat and you start to see some of these insights pretty quickly.
Yeah, I think, on a development perspective, one change we made in 2019 was to a codebase I typically don’t work in. In webapp. And one way I was able to kind of understand how pieces fit together, was with tracing to put in like easy, small flags to understand: well, if we change how we initialize this part of our QA setup, does that affect anything in that codebase?
So it helped build out our understanding of, well, codebases that, yeah, our team typically doesn’t work in.
Yeah. We’ve got a question here from Lee. He’s wondering how do y’all encourage your engineers to be curious about their code and ask more interesting questions of their code? How do you encourage that kind of ownership?
Oh, that’s a heavy one. Well, ownership, I’m going to shelf that for a second. That can mean a lot of different things. I think curiosity, right? Like we all want to know more about our code, but like oftentimes it’s not easy to reason about really, really complex systems. Right? I think — and I’m bad with the attribution of this quote because it’s like, observability (I think this may have been Liz in a blog post or tweet…). It’s like observability helps you observe — understand your deployed code. Tests help you figure out what your code should be doing. Right? And CI, there’s like some tension and a little bit of both.
But to your question about curiosity, well, start to show and share some of the stories on how you came to understand parts of your system as you built them. And potentially just iterate and build out easy interfaces for other teams to start using this way of thinking.
Yeah, no, I think that rhymes a lot with some of the answers you’re getting back to your question in Slack, what do people do to get adoption across the organization, right? So there was a lot of people pointing out, you know, having Honeycomb charts in incident reports, bringing up Honeycomb charts and Honeycomb in engineering-wide demos, those ideas to just keep putting the idea in front of people, and then you can just, yeah, just keep it top of mind and show people, get people intrigued and interested and they can take it from there.
Yeah, I think one question is always like, what are you trying to solve right now? And oftentimes, I found exploring and putting in like feature, like if you have a feature that’s risky enough to put behind, well, probably everything should be behind a feature flag if you follow that way of development, but if something is really risky, how might we prototype and understand how users interact with it, how it interacts on other parts of the system if you’re on an infra team, how might you understand that with tracing?
You get a lot of great, like, easy telemetry thanks to, yeah, thanks to the Honeycomb ecosystem. And like very, like very broad and brief strokes talk through and describe SlackTrace. Like having multiple lenses to the same dataset is powerful.
Like at Slack, the observability team has built out interfaces with like ES, so I can search for stuff in Kibana, and analytics. So we can run like really big, like more computation-heavy Presto queries against it in the backend.
Time for one quick question, do you find value in comparing tracing between runs or most of the focus on runs in isolation during the incident you’re talking about?
Tracing between runs. Oh. Yes. So in my — so I have an appendix over here and maybe I can anecdotally talk through it. But one area of, yeah, one area, is our automation engineering team really taking traces and running with it. One recent experiment was debugging a Docker time-out. And the engineer that drove the project to understand how do we minimize Docker time-out, help us instrument test case tracing used by multiple teams, but with understanding Docker time-outs, we found a really, really high variance in how we were pulling Cypress dependencies in npm. And now we can understand and potentially build an improvement on that by prebaking some of the container and potentially, yeah, other optimizations. But that we only saw, yeah, a little bit later and yeah, with vast, small telemetry.
Thank you, Frank. It’s been great to chat with you and thank you for your talk.