Calculating Costs for Observability
22 minute read
Observability is the only way to proactively manage production systems. Complex systems are the top challenge facing DevOps teams. Your customers depend upon you to deliver high reliability without slowing development productivity. You must invest in shortening outage durations and eliminating wasted developer time
Practitioners of DevOps and business leaders alike are beginning to understand that in order to scale and operate a service that drives growth and competitive edge, you must invest in the right tools and approach. Production system performance and uptime is just one aspect which directly impacts the customer experience and when you continuously deliver and integrate new features, systems become more complex and unless tightly managed, business risk goes up. Observability is a critical requirement that enables teams to level up and manage ever-increasing complexity.
Distributed systems architectures are inherently complex, and the addition of continuous integration and continuous delivery (CI/CD) raises the stakes. Visibility and control are central to success and as delivery systems become automated, everything becomes more opaque and therefore harder to proactively manage. Add to this the abstraction layers of containers or a serverless infrastructure and the team feels farther removed from being in control. As a result, the number of potential causes for any given issue increases while your ability to point at any single issue as the cause is becoming much harder.
Debugging in production is a requirement for modern teams, especially for teams who ship frequently. DevOps teams need the best tools to debug issues when they come up, not just hope they can catch everything in staging. Our customers tell us that before Honeycomb, they frequently experienced incidents where problem sources were never identified. Teams can no longer rely on simple metrics alone to provide the level of insight they need to diagnose and resolve, especially at scale. Observable production systems enable you to move beyond locating gnarly bugs or fixing a problematic incident or outage. Designing your systems to include observability from the point at which a feature is released allows teams to immediately learn how it behaves in production and adjust before a critical outage occurs.
When a new feature is shipped, can you clearly see the impact it has on your systems? As load climbs and you have to choose to add capacity or optimize code, do you know where to focus in order to make the most impact and keep your most important customers happy?
REAL WORLD BENEFIT
Intercom used Honeycomb to evaluate performance across all the dimensions required to understand how different users and types of usage affected the performance of a given endpoint. They were able to both identify the portions of the code needing refactoring as well as document concrete examples of how they’d improve performance.
When a user misuses your service, maliciously or otherwise, are you able to locate the vulnerability in your codebase and then address the problem before others notice? Do your tools have the power to isolate the source of an attack, or how many users it may be impacting?
REAL WORLD BENEFIT
When hackers tried to DOS their service, carwow needed the ability to query at a level of granularity that their traditional APM tools couldn’t manage, so they turned to Honeycomb
Visibility into 3rd-party Services
If your product relies on external API calls and responses, can you identify the source of a service slowdown? Do you have the ability to sift through the information coming from your database, your cache, your load balancers, and your own code quickly and reliably to know if you should be looking to 3rd party providers to resolve?
REAL WORLD BENEFIT
Behaviour Interactive (BHVR) had been using a classic APM approach for some time to troubleshoot latency issues in their flagship multiplayer video game, but were unable to identify the source of a service slowdown—was it in the caching, the database, or somewhere in one of the numerous external calls? With Honeycomb, they found the issues in just minutes.
Addressing Technical Debt
As your organization scales and your product’s footprint grows, are you able to maintain clear sight-lines across your infrastructure as complexity increases? Can you evaluate systems performance using distributed tracing views and better understand the interactions among an increasing number of services?
REAL WORLD BENEFIT
While growing as fast as possible to meet their business demands, carwow leveraged Honeycomb to follow a request through its entire life-cycle and understand the impact on different subsystems in the code, leveraging its cross-team collaboration features to solve issues:
User Happiness and Product Management
Do you understand how the end user experiences your product? Do you notice when they use features in unexpected ways and can you capture that data for your product team to investigate?
REAL WORLD BENEFIT
Using Honeycomb, Intercom discovered one of their users was trying so hard to use their product in ways they hadn’t anticipated that it was impacting the overall experience of many others—and as a result informed future product planning for that feature:
Key capabilities to move the needle on your observability practice
If you experience any of the following, then you must adopt an observability approach. This will involve cultural and process-centric changes but for this document, we will focus on technology tooling that DevOps teams require in order to fully understand production, debug faster and spend less time fighting technical debt.
- Increased frequency of code ships or feature releases
- Increase in volume of users/customers
- More questions for engineers from on-call teams
- Customer complaint issues on the rise
- Pressure to get new features into production faster.
Technology tooling requirements for an observability practice requires the following capabilities. Without these, it is extremely difficult to answer the questions that matter to your production system.
Here are some key capabilities and why they matter:
Automatic instrumentation of events and traces across popular languages
Most developers don’t enjoy instrumenting their code, yet everyone on the team needs that telemetry to give context and meaning to ultimately achieve observability.
Query performance suitable for rapid iteration and debugging in production
When your team is in a firefight, the last thing you want to wait for is query results, whether from ETL delay or slow search performance or worse, no access to the data set.
Support for flexible queries over many dimensions
More and more, the questions that need to be asked to move the business forward require the ability to drill down across many aspects of your system data, but most tools aggregate data, removing the detail required.
Intuitive query interface
Seems like every new tool involves learning a new query language, and the slowdown of the associated learning curve.
When something in the data looks unusual, it can take a lot of false starts to get at what might be the cause, to determine how big an impact it may have. When production is impacted, speed matters.
Distributed tracing visualizations in context
Context switching is a known productivity killer.
Smart sampling to retain key data points
Relying on metrics means having little or no control over what gets averaged out of your data.
Fine-tunable data retention policies
Multiple data sources means multiple legal or business requirements for retention.
Support for open standards such as OpenCensus
Ease of getting data into your tools is critical, and some organizations want to adhere to open source standards where available.
The ability to fully leverage your best talent, past and present
Collaboration is more than just an idea or process approach.
Does this seem like a tall order? In many ways, observability is equal parts process, culture and tooling. Building