When Pierre Vincent, head of SRE, and Rob Meaney, director of engineering, started working at Glofox, there was no culture of observability. Performance issues were treated with a shrug of the shoulders, and statements like “Oh, that happened yesterday, too. That’s normal.” were common.
But as they described in a Honeycomb Brunch & Learn session, Pierre and Rob were able to join forces and turn this attitude around by establishing a culture of observability and giving their developers a feeling of ownership over Glofox’s observability as a whole. This guide will share how they did it.
The preliminary step: Treat deployment as the beginning of the journey
The seed for Glofox’s growing culture of observability is the idea that deployment is only the beginning. The most valuable things you can learn about your product and your users are in the production environment, and you need to see what’s happening in production to learn those lessons. This idea isn’t new, but by preaching it every chance he got, Rob laid the groundwork for Glofox’s developers to not only embrace observability but become leaders in spreading its adoption.
In his talk, Rob regaled us with tales of the old days where shipping code meant literally shipping CDs with code on them to customers. In those days, the process of testing code was rigorous, and a lot of effort went into predicting every little thing that might go wrong because there was no easy way to fix issues after the code literally shipped.
These days, shipping code is much easier, which allows you (and your competition) to move much faster. This speed leads to a more rapid (and sometimes chaotic) deployment process that creates a complex, distributed production environment that’s full of surprises. In this environment, there will always be problems you can anticipate and problems you can’t. User behavior quirks, unexpected latency spikes, and more will arise no matter how much testing you do before deployment.
Rob realized he and the team at Gloflox need a two-part approach where:
- They test code outside of production to minimize the risk of issues arising.
- Follow that code after deployment to understand how it behaves in production and minimize the impact of any unforeseen issues.
This realization is what kickstarted Glofox’s observability journey. They needed visibility into their production environment to test code in production and fix anything that arose. From there, they followed the five overall phases to give developers ownership over observability.
1. Use problem statements as a “definition of ready.”
Instead of just having a “definition of done,” Rob explained they use a problem statement formula as a kind of “definition of ready.” As in, once they define a problem statement, a project is ready to start. Defining the problem early forces the Glofox team to focus on what they want to change and how they’ll track those changes before they even begin thinking about a solution.
A problem statement is a concise, compelling document that defines the problem the code deployment is trying to solve, how you’ll know you’ve solved it, and what kind of instrumentation you need to track your code along the way. Avoid coming up with solutions until the problem statement is complete. This way, you can ensure whatever solution you do come up with will be testable in production and fixable should anything go awry.
Rob recommends thinking about the problem statement as an outcomes-based document. In the case of code deployments, outcomes are changes in user behavior. Keep this in mind as you define the problem and your method for following the solution into deployment.
Many businesses already incorporate a version of a problem statement into their processes, but what’s new here is treating it as a “definition of ready.” Don’t start solving the problem until you know what it is and how you’ll verify its solution. This step will likely require a top-down directive (from someone like Rob) to add or reframe the purpose of a problem statement.
Result: Alignment on what the problem is, why it needs fixing, and (most importantly) how you’ll verify that a new deployment is a success. This plants the idea that deployment is just the beginning and that you’ll need tools and systems to track any new deployment’s behavior in production.
2. Use high-cardinality data to start finding your way back to normal
Without good observability practices in place, what’s considered “normal” production performance becomes skewed. As you begin implementing observability, use it to reset expectations for what normal performance should look like.
Pierre brought up an example where their main API latency would sometimes spike to 10x for no discernible reason. The attitude at the time was, “That just happens sometimes. It’s normal.”
But API latency spikes have a significant impact on user experience by making users wait. If you treat that as “normal,” a lot of unhappy users will go unnoticed because production is “behaving as it should.”
As your SREs begin implementing observability, have them start by using high-cardinality data to identify and fix problems that have become normalized.
In Pierre’s example, you might see that the average latency spikes with an APM tool, but issues like this are rarely evenly distributed as an average. There are many users who are experiencing no latency, while others are experiencing a lot. Those unhappy users are hiding in the average.
To locate them, you need high-cardinality, which is a column that can have many possible values, such as USER_ID. With high-cardinality data, you can dig into the details behind the latency spike to investigate the specific types of users it impacts, the time of day it affects them, and more.
Your SREs won’t be able to solve all of these issues right away. In fact, it’s better to start small with problems your team treats as “acceptable” but have a direct, negative impact on performance. Bit by bit, a new “normal” will start to emerge. This new baseline will become the foundation for measuring the success of future deployments, prioritizing production issues, and keeping users happy.
Result: There will be a morale boost accompanying the realization across the company that you don’t have to live with these performance issues and that it’s not as difficult as you might think to fix them.
3. Start small and gain adoption organically
Don’t try to implement observability all at once. Once you get a better idea of what normal should look like, move forward on a service-by-service basis, allowing each developer to individually implement and gain ownership over observability in their own service.
When Pierre’s team responds to an incident, they create a quick video explaining what happened, how they solved it, and their thought process. In Pierre’s experience, word spreads quickly to other teams, who then request instrumentation for their service.
Pierre’s team can then connect with these interested people and show them how to instrument their services and how to use Honeycomb. Whether instrumentation is through Beelines or wide structured logs, how you go about instrumentation depends on the service. Either way, the developers are directly involved and, in many cases, initiated the idea of instrumenting the service themselves rather than having it forced upon them.
With this ownership, developers get direct visibility into the impact of observability on their code. From there, observability adoption starts to build up steam.
Result: Observability adoption is in part owned and enforced by the developers themselves. They see the benefits and share these benefits with others, slowly spreading the culture of observability without too much effort on the SREs’ part.
4. Be proactive and make time for building long-term solutions
Once you’ve established a baseline of normal production performance and begun spreading observability organically, the SRE team and IT teams can turn their attention to long-term projects. In other words, they’re no longer frantically putting out fires; they’re now focusing on making improvements so to fireproof the building.
With stability afforded by the new baseline of normal production performance and organic adoption comes clarity into what’s happening in your production environment.
According to Pierre, your SRE team should take time each week to sift through the data to find things that are likely disruptors, were near misses, or will be upcoming scaling issues. Then, they can use the extra time gained