Renato Todorov [Global VP of Engineering | HelloFresh]:
Hello, everyone. Thank you for joining me today. My name is Renato. I work as the global VP of engineering at HelloFresh. Over the next 20 minutes, I’m going to be offering six practical tips on how to steer your organization in the direction of observability and hopefully make life easier for you if you decide to take on this challenge.
Because the time is short, let’s get going. So you might be asking yourselves, what can be so hard about it? Observability is super cool. People should be excited about it, right? That’s unfortunately not how things work in real life. True observability is relatively a new thing. It’s a new paradigm. It changes the way people have been doing stuff for many years. Folks take things like traditional logging metrics and APMs for granted. Telling them, “Hey, we do things different around here,” is the challenge.
I was lucky enough to have an observability mentor. Someone who introduced me to Liz [Fong-Jones], Charity [Majors], and some other folks in this space. And after following them for a while, the concept of observability suddenly clicked in for me. And everything they said started making a hell of a lot of sense. Problem is, there’s a big chance a large majority of the people in your organization are not following them or haven’t read the Site Reliability Engineering book from Google. Which puts the burden of setting this vision on you.
So my first practical tip is: Understand your audience. How are people in your organization? Are they early adopters or are they laggers? Do they like reading books, articles? Do they prefer watching videos on YouTube or learning new things? Try to tailor your message according to your audience. It’s important to build the narrative around the change you’re trying to implement. So spend the time thinking about your communication strategy for it. How are you going to view excitement around observability for people who are busy building software and fixing bugs for a living?
Do you have the time to mentor people that could potentially be multipliers of this vision? That can be really, really helpful. By the way, I’m going to leave a bonus tip here. A book called Multipliers: How the Best Leaders Make Everyone Smarter but by another person named Liz [Wiseman]. If you’re going to lead this kind of transformation on a regular basis, I highly recommend this book.
Okay. So you decided to transform your organization and make the shift from the three pillars of monitoring to actual observability. The first thing you need to do even before you begin is to have empathy for the folks you influence. Life is already hard enough for them. There’s pressure from business, and senior management is preaching that “you build it, you run it” thing. But they very often forget that running stuff in production takes time. So dumping an extra burden on top of engineers will not help you with your agenda. Do not go about telling people that from now on they need to use these fancy new SDK and instrument their systems because they will ignore you. They have OKRs to meet.
So my second and probably most important tip is: Identify a real problem that can be solved with observability. Start by asking yourself, how can I make people’s lives easier? What is the biggest pain point for engineers in my organization today? Is it the bugging software in the staged or QA environment? Is it the distributed architecture? Is it not knowing what’s broken until a customer complains? Or is it the cause of an incident in production? Once you identify a real problem, measure it so you can show people that your observability stuff actually had a positive effect.
For example, mean time to detect, mean time to recover, time spent debugging—those are very straightforward metrics that can be used to compare life before an observability philosophy and after that. In the end, this transformation has to be about reducing the cognitive load for people. Remember life is hard. So how can we make it easier? On this topic specifically, I highly recommend having a look at the Team Topologies: Organizing Business and Technology Teams for Fast Flow book, where the authors talk a lot about keeping the cognitive load under control.
So all of these tips are coming from someone who has been leading a platform team at HelloFresh. But this is also useful for anyone influencing people in any way. If you’re a staff engineer, just an engineering manager, a developer that wants to make the lives of your colleagues better, you can use the knowledge in these books and also in this talk to help steer your organization forward.
Cool. So you’ve actually started it and you’re getting a lot less traction than you expected. So what now? The theory of the diffusion of innovation was put together in 1962, but it’s still extremely relevant. I strongly recommend that you take an agile approach here. Don’t try to convert the entire organization into an observability mindset in one go.
Find the innovators. Solve a real problem for them. Make them excited so they become your allies. Then move to the next group. You’ll probably find harder problems to solve with the second group. And again, an agile approach is going to be really helpful. Solve problems as they appear.
This will also help you build momentum and, eventually, you’ll be able to cross the chasm. So I refer back to a concept of the Team Topologies book: You should act as an enabler of observability to your peers. This is a really good opportunity for you to pair or do mock programming sessions or to do interviews with developers to understand what is really bugging them in their day-to-day. What is keeping them from being more productive and being able to focus on solving business needs rather than fiddling around with infrastructure or monitoring?
So my tip number three is: Be intentional about diffusing innovation. Map out your organization, find who’s who, make a plan, and iterate over it. The ability to diffuse innovation is critical for anyone who wants to lead cultural changes. As a bonus tip, the book Crossing the Chasm, released for the first time in 1991, can also be very helpful.
Moving on, I think another thing we should always have in mind whenever building stuff for developers is: Could this be any easier? One of the main advantages of observability is the fact you don’t need to have your logs in one tab, open Grafana in a second tab, Jaeger in a third one. With observability, everything is in this one place.
If you could make it even easier, for example, integrating the tool with Slack, adding links to queries directly on your alerts and runbooks. All of this stuff can help more people be aware that this tool is in place and is very helpful. If your on-call person, for example, gets a page that includes a link to the relevant dataset in Honeycomb, they will use it and love it. When they get woken up at 3:00 a.m., the last thing they want to do is search for this new tool that the SREs mentioned the other day. Their brain will push them in the direction of whatever they have been doing for the past few years.
So this leads me to tip number four: Lower the entry barrier as much as possible. Make it easier for people to find what they’re looking for in the structured logs than anywhere else. Integrate the tool for internal processes. If you have a skeleton that people use to build new microservices, integrate the instrumentation heat in that skeleton.
Those kinds of scaffolds can be very useful if, in order to create the new service a developer has to figure out CI/CD, integration testing, acceptance testing, performance testing, and instrumentation, they will probably leave instrumentation behind. So if you can also integrate the latest version of the SDK that you want to use in this skeleton, this will make it a lot easier for people to have a baseline that is already starting from a good place.
Cool. Now that the basic work is done and you’re about to cross the chasm, if your organization is growing, take advantage of it. When new joiners have their first days in their new job, they are still very much open-minded. They’re curious. They’re more willing to learn. Unless we need to challenge you. This is the perfect opportunity for selling them that vision. If you do it well, they will join their teams and start asking where’s your structured logging? Why are we not using that cool thing they showed me during my onboarding? On the other hand, if you lose this window of opportunity, they might actually start questioning, “Why don’t with have a normal APM here?” and now you’re making your life harder.
So tip number five is: Onboard new joiners to this vision regularly. If you have people joining every month, do a quick talk. Ask HR for a 5 to 10–minute slot in their onboarding session. It’s worth it. As a bonus tip, you can strongly suggest they follow Liz and Charity [on Twitter]. Trust me, it will make your life a lot easier.
Talking about making life easier, if you’re pushing for observability, pay attention to the outcome of your work. If you’re deploying collectors, monitor them properly and make sure they’re working and they can scale and this kind of stuff. Try to cover all your hops with tracing. It’s frustrating for people in the middle of an incident to open a trace and see a giant 10-second span that leads them to nowhere. Do your best to fix missing spans. Even if it requires pairing with developers to fix their instrumentation here and there. Again, this is a great opportunity to meet with people, to pair with them, to understand where are the blind spots so they can have an easier time running systems in production.
Also, keep in mind observability is not something you dump and move on. So keep helping people. Over time, they will want to do fancier stuff. They will want to have end-to-end user journey SLOs, and you will need to help them with that. That’s a sign you’ve succeeded. Your efforts towards observability have brought real tangible improvements to the organization. This leads me to my last tip.
Own it. Well, you build it, you run it, right? So my real advice here is to take ownership of problems, like missing spans, edge-layer integration, frontend data ingestion. The richer and more stable the solution is, the more people will want to use it. You will have to invest time in setting all this up. Integrating your CDN solution, integrating with your different ingress layers. Making sure that the spans are properly propagated.
So I hope you can make use of these tips and that this talk served as encouragement and will help you lead this transformation. I might have made it sound too hard, but leading change is fun and actually very much rewarded. So good luck on your journey.
Ben Hartshorne [Engineering Manager | Honeycomb]:
Renato, thank you. That was fantastic. So many things there that are just … I want to dig into and talk about. Also thank you for all of those books. My bookshelf is, I think, probably growing faster than I can read through it, but we’ve dropped links to each of them in the hnycon Practical Lessons channel in Slack in case other folks want to dig into those.
This bit about cognitive load, I really like that. It’s such a clear part about understanding your systems, getting the ability to walk through an issue when it’s late, when you’re tired. Adding a cognitive load and making sure your systems keep from doing that is such a key part of making things operable. Were there specific bits of cognitive load that you were able to identify and really circumvent as you were walking through with the early adopters?
Yeah. So I think one of the goals of observability is to actually remove the cognitive load. To make it easier for people to find what they’re looking for. But observability has to be implemented. From the platform side, when we started asking people to adopt things like Honeycomb and OpenTelemetry and this kind of stuff, they were busy building their products, features, and experiments. So there is some added overload that we, from the platform side, were asking them.
So I found it to be a little bit tricky and it required some strategic thinking on how can we frame this in a way where people understand that, with a little bit of investment, they will be able to get benefits in the very short term. So, for example, reducing the mean time to recover incidents or spending less time fiddling around with logs and metrics and hundreds of different dashboards. But it is a journey. So when we started working a little bit more closely with the teams, we understood that, well, it’s not just dropping in the SDK. We have to instrument the systems and have to wire things together, like for example, CDN metrics and logs, edge layers, all the different ingresses. A lot of context was being missed and we ended up with a lot of spans, root spans without context, and stuff like that.
So we kind of learned together, and I think successful integration with the development team is one where you end this journey with something tangible. Like for example, now they can very easily identify what is wrong in the middle of an incident. Or even on the staging environment, when something is not working the way they expected, they can jump into the staging dataset and find what’s going on without having to SSH into machines and do these things we shouldn’t be doing at this stage of time.
And I really like the way that connects to the fourth tip there about lowering the entry barrier. I mean, that’s another form of cognitive load, all of the things you need to think about when just getting started on the new project. Adding instrumentation into the skeletons. You’re paving the right path and making it clear that if you follow the normal bit, you get the fun tools along the way.
This was actually very important as well. Instead of just sending people to the README file of the OpenTelemetry repo, we thought how does this apply to HelloFresh? Is there anything specific people will struggle with? So we have our own internal Wiki, where we store guidelines about how to implement OpenTelemetry on your systems tailored to the HelloFresh infrastructure. So the way we use Kubernetes, the hosts and the ports with the collectors, this kind of stuff.
And because people were really into this kind of documentation, we expanded it to also include how do I define SLIs, how do I define SLOs, how do I implement this using the automation that the platform team provided? Of course, this information is mostly in the SRE book, but again people will not have the time to read it, so we decided to kind of digest some of the complex bits of this complex part of SRE to kind of help people take shortcuts without compromising on the quality of the implementation.
You know, that’s a really interesting connection there. The relationship between SRE and platform as an organization and the rest of the development organization. You had a focus there on owning it. You build it, you run it, you take ownership of your own problems. You were speaking of the team setting up the ingestion pipeline, but that’s equally true of all dev teams. How do you build on this relationship so that you enable teams to own it while still having a team that has a lot more experience around operationalization participate in that process?
Yeah. I think that’s what I saw. I read first on Martin Fowler’s blog about applying product thinking to platform. So when you do that, you start to understand that the developers are actually internal customers. And you have to treat them as customers. You have to monitor the adoption rate. You have to take care of the interface you’re offering people. That is the simplest possible interface.
One thing platform teams usually don’t consider is the fact that not everybody who is joining is used to running systems in production. This is not a given. Maybe a lot of people will be doing this for the first time. If they have to learn about a lot of things, if they have to read the entire SRE book to be able to run their systems in production, you’re not setting them up for success.
So I think it’s really important for any platform team, SRE team, or DevOps team, if you have such a team in your company, to really think about the interface you’re offering your developers. It must be as easy as possible, right? Don’t take for granted things like DevOps skills because they might not be there. Some people might not be excited about it, but they still need to run their services. So make it as easy as possible. For everything. For deploying, for debugging, for runbooks. One of the things that really changed the game for us is starting to include runbooks to links to alerts, this drastically reduced the MTTR. It’s such a simple thing; it’s a link we added to alert.
Yeah, you’re being woken up. Please, here are the instructions. You are tired. Let me hand this to you.
Exactly. This kind of stuff. And it doesn’t require heavy engineering. It’s just a little bit of product thinking, of strategic thinking.
Yeah. I almost want to take a tangent towards OpenTelemetry around the interface you’re offering your developers. That might be off too far, though. What do you think of the SDK surfaces that you implement? Do you provide your own wrappers that are specific to your business to make the telemetry consistent across applications and things like that?
So we try not to. This was one of the pain points at the beginning of our journey. It’s a timing issue as well because we started adopting observability while Open Census and OpenTelemetry were thinking about merging. And then open tracing and Open Census was thinking about merging and OpenTelemetry was coming up. It was a very early stage, so we were dealing with these different SDKs. And at some point, we had people using different SDKs on their own services. We wanted to try and be vendor-agnostic. We were pushing for one of the open implementations. And we decided not to do our own.
So what we did instead was we included these interfaces in our default helm chart that people use. So if they wanted to enable tracing, they would just set the flag to TRUE and drop in their SDKs. But we didn’t enforce anything specifically. These days, we highly recommend people to use OpenTelemetry, and we are using these in our own internal services.
And I think the scenario is getting better. OpenTelemetry is getting more stable. The APIs are more stable now. The main features are there. So I would try to avoid, from the platform team, creating wrappers and stuff that is too much tailored to your needs. Like, if you can use whatever the industry is using, then it’s less for you as a platform team to maintain. And this goes back to the cognitive load talk.
It also connects to the thing you were saying about paying attention to the outcome of your work. When adding the facility for instrumentation, it’s not something you just dump and move on. It’s a continuing process. And as you said, you started with open tracing and Open Census and now it’s OpenTelemetry. It’s not a static thing, this idea of moving towards observability as being a continued process. You have to come back and check. You have to make sure that you’re doing the right bits.
Yeah. I think it’s about treating reliability as a product as well. So when you launch a new product in the market to your end customers, you don’t just launch it and leave it there. You keep monitoring it. You keep talking to your customers. You keep getting data from it. So I think it’s the same about platform things in general, right?
You launch this new interface for SLO. You launch this new product for observability. You need to continuously interview developers and understand how they are using it. It’s amazing how people use the systems you build from a platform in ways you never expected. It’s really amazing to kind of pair with these people and understand, like, okay. What are you trying to do that we haven’t thought about? So we can just interface. We can maybe think about a different product that can be a better fit for the problem you’re trying to solve. So it’s really important to kind of keep this close connection with the internal customers of a necessary team or a platform team.
That’s the blessing of platforms and APIs, right? You are offering capability but you don’t know how it’s going to be used. You need to keep checking in and understand how is this use manifesting and what’s its goal so you can make sure it works as well as it can.
Especially because we’re not building services that serve the end users. We definitely need this information coming from the engineers. It’s about feedback loops. It’s not because you’re on a platform team that you don’t need feedback loops. It’s quite the opposite.
So one last question for you: Do you have a favorite tip of these six? Which was your favorite?
I think these days, it’s the one about cognitive load. This is the most impactful one for me. I’ve been observing that with the whole pandemic thing, it just became worse for a lot of people. We are under pressure. Businesses not going well are under pressure because they’re not going well. Businesses that are going well are under pressure because they need to keep up.
It’s a great time to think about the cognitive load and try to keep it under control, not necessarily try to reduce it to a minimum because you need to keep building good stuff. But keep it under control. I think this is the main message here. And I think that’s why I’ve been recommending the Team Topologies book for so many people.
Wonderful. Thank you so much for spending your time with us this morning. Fantastic talk. Thank you for these questions. We are going to take a short break and then continue onto the next session.