Eyes Wide Open: o11y From Theory to Practice
Peter covers his team’s journey, including how they did the early integration, included interactions with existing tooling that was available in their organization, and went through several ‘aha moments’ they encountered when they had instrumentation in place that proved valuable in uncovering problems in both their deployed environment and test suite.
Peter Royal [Senior Software Engineer|Netflix]:
Hi. My name is Peter Royal. I am a Software Engineer at Netflix and I’m here to tell you a story about my journey in learning about observability practices and the value of integrating them deeply into my development work. How that has helped me in the past and how that has helped me in my current engagement. My first contact with distributed tracing was contemporaneous with Google releasing their paper on Dapper back around 2010ish. At the time, I was working in the financial industry at a high-frequency trading firm. We were building new systems to trade on exchanges the firm hadn’t originally done business with. We were very concerned about overall latency, where we care about the duration between when we receive information and when our computers act on it.
The way this was done is we had these capture packages on the Ingress and Egress links and those were links to GPS clocks. That gave us the end-to-end latency. And then within the application to build a causality chain inside of it with different timestamps as it went through processing steps that we could then use to segment that overall time to see how much individual time each step took. While latency analysis was the initial goal, we also quickly found that having this sequence of causalities captured from within the application was an excellent debugging tool to track and understand what the internals of the system was using.
As it happened, we went years using this and other specific probes to monitor the health of our system. It was entirely domain-specific monitoring rather than relying on external metrics such as load averages or other things that you might capture about the process. By integrating this monitoring deep into the application itself — essentially monitoring the application’s heartbeat — we had a general idea of how the overall system was performing. I’m happy to share more stories about everything that we did at this time, but this is really stage-setting for what I’m going to do.
I’m Peter Royal. I’ve been a professional software developer for over two decades now over a number of domains. My career has been a smaller number of longer engagements which I feel has presented me with an eye towards development practices with longer-term benefits. I have been present for the initial construction of several systems and have been able to see them mature. For the past several years, I’ve been employed by Netflix and have been helping build out tools to help our business partners that are doing production. This talk is stories about what we’ve done and how that has helped us, and my intention is to inspire you to consider doing the same within your domains.
In early 2020, I helped start a new project. I had many ideas about how to approach this body of work, but the one that’s relevant today is how to observe the system. I had been following Charity on Twitter for a while at this point. And there were two repeating themes in my mind. The first one was that observability comes from within, so to speak. And this one resonated with me because it was directly aligned with my prior experience, that is essentially what we had done.
The second one was the trickier one, and this isn’t exactly the framing now, but it is important for what we were doing: basically capturing everything that might seem interesting upfront, don’t think too hard about it, just if it seems interesting, capture it. And record that off and use it for later. Although this one had me scratching my head a little bit because I was having trouble figuring out how it fits into some other things that we have going on relevant to the story.
Netflix was able to build upon the shoulders of giants. We have excellent infrastructure that’s present and this amazing tooling for capturing metrics and analytics. The broad strokes of what we’re doing is building services that have GraphQL APIs. This is on top of Spring Boot. At the moment, Spring Boot exposes the Sleuth, I’m sorry, Spring Booth’s API is for tracing and that directly exposes Brave’s API. And each link goes into more detail about what we’re doing but this is stuff we’re able to build upon.
As a development process, one of the things I like to do is limit the third-party dependencies. Often, all of their functionality isn’t really needed, and the desired surface area is much better constrained to something that I’m able to control. This is a practice from experience. When we were working in the trading industry, we had very finely tuned the API that we were using to capture data; the same practice also applies. By managing the way that those dependencies intrude themselves upon your domain code, you have more control over how changes there affect yours. I can manage the changes from a dependency in one place rather than having it scattered throughout the code.
The system that I’ve been working on is event-driven, all throughout. And as part of this, we have this general envelope structure that contains our message, and this provides a commonplace for a unique ID, timestamp, who made it, the content of the envelope itself, and most importantly, this causality. You can think of it as the little stamp the Post Office puts on it saying: where has this letter been.
This was the deep hook that we added to our work. Any internal code involved in our event processing has access to this. And it’s a very, very simple API. It’s like the span customizer — or you can just kind of drop whatever you want in, but we only exposed exactly what we needed. And, also, we were able to enforce the way it was used as well as far as naming conventions or other constraints that we wanted to happen. And early on in development, this turned out to be useful.
One of the limitations of our internal tooling is we don’t have a unification between metrics and traces. And this is kind of a legacy of Netflix’s journey. Our streaming service operates at a massive scale and metrics are the primary way to observe the health of that. It’s the most feasible thing given what we’re doing: turn tracing on. There’s an open-source project called Mantis that helps us do that and could help you too. But you need to have confidence that the anomaly you saw will occur in order for this approach to work. But we’re doing business applications. And business applications are different. Rather than requests per second, we’re doing requests per minute.
If you don’t capture something by the time when it happened, it might not happen again. So in this case, capturing everything upfront is a feasible problem. But you still have the problem of linking specific metric anomalies you see to specific events that actually cause them. I’ve been thinking about this gap. And that is part of what led me to Honeycomb.
The value proposition of being able to see those trends as far as what is happening, and seeing an anomaly, and drilling down to that specific request really spoke to me. I wanted to see how I could bring that into my service. However, at the same time, I’m also squaring that second theme of Charity’s, which was a single-wide event versus the common tracing model. And this is where owning that abstraction became useful. As an experiment, I tried dual writing my traces into both our internal tracing system as well as generating that single-wide event for Honeycomb. This was all hidden behind our internal abstraction and everything worked fine. So we built this and kept going. Except, it didn’t really work out. Once we started flowing requests through, we realized this was clunky. Everything was expected but Honeycomb wasn’t ideal. We had multiple duration fields and it didn’t work right with the product.
At this time, I came across the Beeline library and how it integrated. And it was: “aha!” I can actually represent this as traces in Honeycomb as well. This is a simplification of what I was doing, but because I had owned that abstraction, it didn’t procure the rest of my code. So now we got the desired display in each system, we could use the benefit of each tool, and the trace IDs were the same. So I could see something in Honeycomb, bring to it our internal tooling or vice versa. We matched towards feature completion and acceptance testing with our users.
However, unintentionally, we built the system in such a way that it could blow itself up, metaphorically. We were able to go into recovery mode before we realized the scope of the problem. This was awesome because it was a great opportunity to invest in the resiliency of the system. We needed to automatically recover from the failures but it’s undesirable.
One of the things that we did while trying to debug this was: we enabled the Java flight reporter, a continuous profiler. It didn’t end up being terribly useful in giving us insight into what was going on, but it was low overhead so we just left it running in the background in case something interesting happened in the future, we would have been correcting that data.
We were finally able to narrow down the problem to a design decision that was interacting poorly with some of the larger inputs. So we went in to fix it. We know what to do.
Because the system was mostly working, and we were reworking a core part of the calculation, it was time to up our testing game. Broadly, we were taking something that was being computed on-demand to being computed upfront. And now we cared about all the specific places, or all the specific interactions that would need to rely on or cause the information to need to be recomputed. This was now a new set of problems that our tests hadn’t previously covered.
As part of this, we decided to introduce staple property-based tests for our API as an explanation, the test picked a random action in the system, performs that, validates the expected output and picks another random action, and continues on. This has been great because it’s found a bunch of edge cases that would have been awful to discover in production. On the flip side, this is still costly coverage. The more iteration the tests performed, the greater the confidence that we’ve covered the entire state space. But the more iterations we perform, the slower the build is. And we weren’t happy with our balance of the number of iterations versus the build time that we were having.
I was preparing to add timing information to our test to see where the time was being spent when I had an insight in the shower. You know, I’m one of those people where the water pours over your mind and your thoughts can drift. And I realized, well, we had already instrumented our application and were comfortable analyzing that for when it’s running deployed.
In big air quotes: “All we really needed to do was have our test participate in that.” Conceptually, each test iteration is a trace. So I was able to add a decorator to the test suite to rack each of the actions in the span. I turned on trace reporting and I had a trace with a span for action and then inside of that span was all the work the system was already doing. And I could use all the tooling we had already done to understand why my test suite was slow.
I was always prepared to add this custom integration but reusing that work ended up being a huge time-saver. We were able to reduce the time for each iteration and used the time saving to strike the balance that we were comfortable with between total build time and reiterations. And this is also where frequent builds help too because every time we run a build, these things compound in order to give us confidence in the work that we were doing. In the end, we were able to improve system stability and we had given it a performance boost as well.
Another round of acceptance tests was underway, and I was browsing our metrics just looking to see what was there. And I saw some weird outliers and seemingly arbitrary portioning of framework code prior to the request entering my code. So what I wanted to do was, say, could I try to isolate this duration so I could graph it irrespective of how long the request took. And thinking through the model: the Braves model is span-centric. So at any time, you have this is what the current span is. But I wanted to get a hook on to the span to understand the timestamp of when the request actually started. I realized that this is actually available due to how the integration was done. So I realized that if I could put in annotation on that span, my code that reports that event to Honeycomb could calculate the time distance between the annotation and the root of the span, and given that duration, I could then graph that to see what is the distribution of these times.
And I added this and it was awesome, it worked, except it was blowing up for a bunch of requests. Why? And now I was starting to pull my hair out because based on everything I understood about how the framework code worked, what I had done should work in all cases. Except it wasn’t.
This is where prior work came to the rescue again. When I had done the original work of generating events for the annotations, I propagated all of the span’s tags onto the annotation. This meant that the annotations that contained the duration of how long it had been until the request entered my code had another useful tag, the HTTP path. And this ended up being the clue to realize that it was a specific endpoint prior to reaching my code that was causing the behavior. I added more diagnostic tags to narrow this down and I ended up finding two bugs in the framework that we were using. One was an unintentional sharing of data across requests. And the failure was to initialize data to requests to some path.
I then had a clean data set that showed my other problem: these unexpected latencies that afforded in the framework prior to entering my code. I had some excellent help from our platform team trying to add different intermediate traces or intermediate annotations to the framework code to try to narrow this down. And at this point, somebody else that hadn’t been working closely with us stepped in and said, “have you tried disabling the flight recorder?”
Remember, we had turned that on — at this point, it had been months ago. And at the time we thought, what’s the harm? Except that was the cause of all the latencies. In this graph, you can see all the noise there. And when everything flattens out, that’s when we turned that off. We saw this dramatic result when we flipped that. And this was eye-opening: there really is no free lunch. Even though this was the low overhead mode of this continuous profiler, there was a cost to our service when we actually looked at our latencies.
At this point, the system started to move into production. And one practice we decided to start doing as a team was to meet at the start of the week as part of our on-call hand-off, as you will. And review what our different observability tools had captured over the past week. The goal of this was to build a mental baseline of what normal behavior looks like so when the inevitable incident occurs, we aren’t chasing something that may appear funny at the time. After a couple of weeks of this, we found something. A pattern we saw was in the latency distribution to the calls of another service. And there hadn’t been a dramatic change in the data volume. When we saw this chart where it spiked, we knew that there was something curious going on.
Because we had been capturing this data, we were able to zoom out and realize that the variations in here weren’t new. This was not a new phenomenon. This count should be stable or slowly increase, not any of these wild swings that we were seeing. What’s going on is there’s a call that we have that is batched. And in a single request, we can ask data for n elements at the time.
At the time when we originally had done this, we added the size of that batch as a tag to the cause data that we were capturing. We had no idea whether or not we would use that in the future, but we stuffed it into our payload for later recording. So what we did was, we were able to add a derived column that bucketed this. And all of a sudden, some patterns started to jump out. When we then added another grouping by the application version, that was enough for us to realize what the cause of this pattern was.
One of the features of our system is that it has the ability to recreate the database schema used to surface reads. When this happens, it was causing a change in the input rate to the component making the RBC calls. When the system starts up normally, the component gets all of the states very quickly and it’s doing fewer large batches. But when we rematerialize our database view, it’s a slower trickle into that component which led to more batches that were smaller. That’s the pattern that you can see in the chart. Normally it should be that dense blue at the bottom. And that indicates deploys that happened when we were in that steady-state. And in between, all that other noise is, you know, some of it is going to be users adding things into the system that will tweak the batches. But this is work that doesn’t normally happen on every deploy.
We were able to make a change to the system that changed it so the batch size will be roughly constant irrespective of how quickly the component received input. This didn’t result in an operational problem, but it was undesirable system behavior that we were able to fix. We saw it because we were essentially just looking under rocks to see what was there to understand what was going on. And because we had captured this contextual information at the time, we were able to see the longer-term pattern and address a fix. This really validated this practice of just looking at our tooling as a regular course of understanding what was going on inside the system.
This is what brings me back to a metaphor about cabling that I used in the abstract of this talk. When doing construction, when the walls are open, that’s the ideal time to run wires to where you want to network. But in addition to running the wire, before the walls are sealed up, a pull string will be left behind so in the future, even though the wall is sealed, you have a mechanism to pull new wires through that same conduit. And that’s how I think about my software.
When we started, we put this abstraction deeply into our domain code, so it was available everywhere we needed it to be. And that acts as the conduit and pulls the string as a place that we can attach information to that at the time might seem useful. Maybe it will be, maybe it won’t. But we also have these hooks in place that we can add additional. If there are new things that we want to add, we already have a place to hang them.
And while a new system might be the best time to do that, the second-best time to do that is now. You really need to figure out what is appropriate for your domain, create that destination, and start weaving it throughout your system. This is one of the practices that your future self will thank you for. I hope I have inspired you to deeply incorporate observability into some of your practices. Thank you.
Hello. I hope that was enjoyable and you’re inspired to add some of that to your own work.
Corey Quinn [Cloud Economist|The Duckbill Group]:
I enjoyed it, and I’m here to basically ask you questions about aspects. For starters, thank you for giving the talk. It’s interesting to figure out how Netflix thinks about Hipster Monitoring. But from the other side: how much of what you do at Netflix and what you talk about applies to larger — sorry, not larger, there’s not many that much larger, but in terms of other organizations where you’re not Netflix, you don’t have some of the best engineers in the world working there. Although most places have someone around who thinks they are. And let’s be clear here, that combines with the failure domain. If Netflix does something that breaks something, the failure mode is, oh, someone can’t watch a movie for 20 minutes and they might riot in the streets for that but it’s not the same as someone not taking down the ATM networks, for example.
That is true. I agree with that. But what I am helping to work at Netflix is more applicable to other companies, because we’re building business applications to help our internal studio do their work. We’re on the content production side rather than the content presentation side of the stream.
In the same way, you have that parallel between the ATM network and streaming: both are things that people expect to work that have different failure modes and result of our failure. Like if the systems I work on fail, it’s the humans who are unable to use the tools to do their jobs. And I think that a lot of other companies are doing that sort of thing. Like if you’re building some sort of internal business application that somebody is using, that’s similar to what we’re doing.
And yes, we are able to borrow a lot of the infrastructure that the company has built that’s used to support the streaming product that we have, so there are great tools. At the same time, sometimes they are not perfect for what business applications are doing. We care about every request. You know, if a user is trying to submit an invoice to pay, they care if that specific request fails. It’s not the same as I’m going to hit play again because the video didn’t start.
It seems on some level, this is so forth the problem with every environment as described in a conference talk — where what you talk about in isolation sounds awesome. The idea of even the basic stuff of tracing, requests through the environments. But take a look at a lot of shops and they look at the environment and everyone believes that their environment is a disaster burning tire fire.
One of the things I learned through consult, every environment is a burning disaster tire fire once you peel back the covers. Nothing is perfect. No one is happy. But the perception is, we have to fix other things that are higher priority than gaining visibility into something that helps us fix the fire. Given that, as you do, always been an iterative process, at what point did you step back, look around and see this vast sea of everything working perfectly. Aha, now it is time to embark upon an observability journey. I kind of think it never worked that way. What did the rollout of this approach start to look like?
Yes, how do you decide to embark upon that journey? If you don’t have anything new that you’re doing, it can be hard to do that. And that’s kind of how I use this as the initial wedge. We were starting to build a new service. And for my own benefit and for our team’s benefit, that’s when we started to add the observability hooks.
We added it early when we started to design the internals of the system. And then we’re able to reap the benefits of that development process even before we had put it into production. So at the same time, like, yes, there’s never a perfect time. Always start now. And to the extent that everything is a burning tire fire sometimes, yes, every place has its problems, but if the — it’s the type of thing the sooner you can put it in place, you can then start to reap the benefits.
I’m kind of drawn to an analogy with other coding practices that we do. You want — you always want to minimize your tech debt. Make sure your code is formatted and you’re not doing other things that are obviously wrong. And I’m trying to steer this observability information into that best practice that you want to incorporate that early into what you’re doing because, again, you continue to reap the benefits over time.
A question came in on Slack that builds on what I just asked. How do you choose where to apply o11y or observability, rebuffer rate, or something like that? My approach was I went bottom-up, which — spoiler, doesn’t work so well. Let’s make sure that we alarm every time a disk starts to get above 80% and that will drive you batty if you’re not careful.
We kind of did both. We wired information capture deep into the system. But initially, the only real health that we had was checking to make sure that the system is up. And it was once that we had, you know, had kind of gotten everything working, we added a synthetic check to the system. And that is now our primary monitoring. There’s an external service that just pings, essentially pings the system and does a write through the entire write agent and is able to verify it’s really just a counter, the counter went up by one. And that gives us not just the health of “is the system up,” but is the entire cycle that’s happening inside the system working.
We did start with more of the metrics as the health. But once we had a certain maturity in the system, we added this other check that checks more of the actual work the system is doing and that actually makes sure the database is healthy and the other parts of the system that care about. So that gives us the sense that when the users use it: will they get the experience they want. And obviously, we can’t sit there adding pennies to somebody’s invoice for a check for that, so we had to make up something. But there’s a synthetic check.
We can tell you don’t work at AWS.
That’s part of, mostly a joke. My entire world is around the economics of this when you’re like a cloud provider and charging on a consumption basis or a vendor. I think what people lose sight of is internally when you start doing things that have observability stories to it there is a cost to this. Not just in terms of the actual resources being used — by “resources,” I never use that to mean people — but also people, the folks rolling this out. How do you get a culture of focusing on understanding to the degree that you have? What’s going on in your environment makes sense? Because I found that in a lot of shops, that’s an incredibly uphill battle.
It is kind of related to economics and staffing. Because I want to get systems running to the point where it’s kind of a spinning top that kind of keeps going — perpetual motion. As a developer, I don’t want to bake in that, you know, pessimistic career security requires maintenance to keep it going. I can’t work on something new unless I know my prior work is stable and functional. And I know that what I have done will continue working and you know, it will alert me if there’s a problem. And I can turn my attention back to it, but then I can scale myself as a resource across multiple systems. Because I’ve built up stuff in a way that it doesn’t require that constant upkeep. The observability that’s present lets me dig back in and understand what’s wrong when something will eventually go wrong, but that’s an exceptional state rather than a steady-state.
Our final question comes from Jill. She says, “Curious, I am assuming Netflix is a polyglot shop. Does Netflix do observability per application level or a sidecar to standardize the observability of approach in the org?”
We are a polyglot shop. And we are also very much a Java shop. We are standardized on a Spring Boot stack. We are using a Spring Boot Sleuth component that writes traces out using Brave. That’s all standard, that’s collecting that. Some other system information is collected via sidecars. A lot of it is common, we’re trying to go towards the common off-the-shelf interfaces that a lot of other systems use because it does unlock some of that polyglot and it leverages the information that people come to the company have. What’s available in the open-source world has, you know, adapters for so many things, so if you can use that and point it at our infrastructure to capture that on the backend, you’re kind of free to use what you want in your service.
Awesome. Thank you so much, Peter. Really appreciate your time and effort in giving the talk.