Escaping the Cost/Visibility Tradeoff in Observability PlatformsBy Fahim Zaman | Last modified on January 5, 2024
For developers, understanding the performance of shipped code is crucial. Through the last decade, a tablestake function in software monitoring and observability solutions has been to save and track app metrics. Engineers love tools that get out of your way and just work, and the appeal of today’s best-in-class application performance monitoring (APM) suites lies in a seamless day zero experience with drop-in agent installs, button click integrations, and immediate metrics collection. However, the success of no-hassle metrics comes with a caveat—the internet is replete with examples of premiere application monitoring costs spiraling beyond expectations:
Is there a pattern behind these worries? It turns out, yes. There is a level of modern system complexity where relying on metric collection spirals into cost inefficiency. Read on for an explanation of the challenge with the traditional APM cost model for modern software teams, and how to begin solving it.
The problem: escalating costs with system complexity
With metrics-driven APMs, costs escalate as systems grow in complexity through modern practices and service compartmentalization. Each new host, pod, node, or service adds to the bill. This is due to most solutions storing and indexing detailed custom metrics for fast analysis, as well as storing data in more forms like traces and logs, resulting in a cost/visibility tradeoff. Companies find themselves either over-investing in observability or sacrificing visibility to control costs.
Limiting observability due to cost constraints hinders understanding user interactions and system performance. This leads to compromised user experiences, as issues in newly shipped code may go undetected. Either your developers slow down on shipping features your business needs, or you understand and solve less about production code, frustrating users with a backlog of unresolved bugs and incidents. Ultimately, your end users have a worse experience.
Solving for the cost/visibility tradeoff with modern observability
Modern observability solutions like Honeycomb don't require storing the same data multiple times in different formats. The solution takes advantage of wide, attribute-rich events and a parallelized query engine that can provide real-time data retrieval, metrics graphs, alerts, and relationship analysis without the need to pre-store custom metrics or rely on patching together separate data types.
With efficient and attribute-rich event handling, Honeycomb focuses on making observability easy to adopt and scale. Our simple pricing model ensures that any system complexity is addressed by unlimited event attributes, avoiding additional costs that come with other vendors. In a quick comparison to a market-leading APM suite such as Datadog, you’ll see Honeycomb’s focus on pricing by event rather than system complexity (pods, nodes, services, hosts, etc.) keeps costs down:
|$ per tracked service
|$ per tracked host
|User/Audit/API call IDs
|$ per tracked ID
|$ per tracked container
|$ per tracked pod
|$ per tracked node
|Separate bill for ingested vs. indexed records
|Price per event, subject to volume discount
|High watermark overage billing
|Burst protection and throttling until contract adjustment
|Sampling creates gaps in trace visibility; custom metric billing limits attributes from analysis
|Rule-based trace aware sampling without dropping any system attribute, keeping all events with incident relevant info
The result: a truly cost-effective and flexible platform that takes your observability program beyond the constraints of traditional APM platforms. This emphasizes attribute-rich data with no limitation on complexity, using a custom datastore and parallelized query engine for efficient processing. It aligns observability costs with actual application events, ensuring scalability without crushing your budget.
Implementing cost efficient observability: practical examples
Step 1: Finding target areas for efficiency gains with modern observability
Getting true control of your observability spend begins with evaluating your system's fit for a new approach and where a change will have the most impact. Key indicators that call for a modern approach to observability include:
- Multiple intercommunicating system components combining on a request.
- Components with heavy adoption of cloud-native technologies and containerization.
- High consumption of custom app metrics for production software insights.
For example, a new data platform customer of ours recently reviewed these factors and found multiple boxes checked with a substantial APM custom metric bill. Within the last 60 days of talking to us, they’ve found the right starting points for their journey towards modern observability, already achieving reductions in their spend.
Step 2: Embedding OpenTelemetry and cultivating developer buy-in
Next is engaging your developer teams in adopting a new, more efficient instrumentation protocol. This involves transitioning from traditional APM metric-storing to more scalable solutions like OpenTelemetry with Honeycomb. This takes a structured approach.
The philosophy of throwing a proprietary agent at your services, saving what you think you’ll need, adding a bucket to grab logs, and piecing everything together in post during a production incident is out of date, time-consuming, and cost-inefficient.
What developer buy-in looks like:
- Software teams adopt an open and sustainable instrumentation package like OpenTelemetry, and build the muscle memory of adding custom attributes in-line with the code they’re shipping whether in staging or prod.
- Instead of planning to tag and store custom metrics as incidents occur, developers collect code and infra attributes as free context on trace events.
Custom instrumentation is muscle memory: Transition from tagging new custom metrics or printing additional log records for your APM suite, to embedding custom attribute tracking directly into OpenTelemetry. This practice becomes standard for both production and staging code.
What used to involve this custom tagging and metrics creation in Datadog like this:
…now becomes a code-level practice like this:
//Get current Span Span span = Span.current(); //Add custom attributes to Span span.setAttribute("OrgID", orgID);
For example, a premier online commerce platform that moved from a centralized APM agents to OpenTelemetry in 2023, is recognizing the need for more adoption training and engaging Honeycomb as a partner. Over 40 development teams that have switched to OpenTelemetry are training with observability experts to learn how to use attribute querying to get system insights they never could before.
Step 3: revamping production investigation to decrease reliance on pre-stored metrics
This step phase marks a pivotal shift in how developer and ops teams view observability data. Instead of pulling up saved metrics and the dashboards associated with them, developer teams operating production code can view any attributes on their trace events as a metric count, heatmap, group by, or much more complex analysis processed at query time.
The focus changes from monitoring numerous preconfigured dashboards to prioritizing a few high-value business SLAs tied to user satisfaction. If something that matters goes wrong, the data you need will be a click away, constructed at query time. With a modern observability practice, you focus on the key results, where they’re failing, and instantly see real-time anomaly detection and analysis within context-rich events.
This shift makes developer teams faster and frees them from worrying about more overhead while shipping new code. Notable examples include CCP Games, Intercom, HelloFresh, Vanguard, and Slack, where hundreds of engineers have experienced a positive transformation in both the richness of their instrumentation and their problem-solving efficiency.
End result: leveraging financial gains for a better user experience
Finally, the savings from this shift in your APM bill can be reinvested into a virtuous cycle of improving customer experiences. Rather than allocating a significant portion of any incremental cloud budget to observability, your resources can be redirected towards product enhancements. As a result, observability expenditures align more closely with the value your application produces.
A striking real-world example that comes to mind is a leading compliance technology provider that reduced observability costs from 5% of total revenue to less than 1% by adopting modern observability practices with Honeycomb and attribute-rich events.
By cutting out billions of custom metric stores tied to high-cardinality infra and user context, switching them to OpenTelemetry attributes, and relying on Honeycomb to surface this data at the right time and place, the customer is cutting seven figures off their observability bill, removing pressure and freeing budget constraints to put toward a better product experience.
The transition from traditional metrics-collecting APMs to modern solutions like Honeycomb represents a strategic shift towards more scalable, cost-effective observability. This shift ensures that growing investments in observability tie directly to business growth and innovation, rather than just managing escalating costs of complexity.
How can you proactively and practically overcome the cost-visibility tradeoff today? Getting started with Honeycomb and OpenTelemetry is free, of course—but you can also get an assessment of your APM bill to see if it would be worth it for your organization. Reach out through our Slack community or our website and begin your journey towards efficient and scalable observability today!
Birdie’s platform is a complex software system that covers a lot of ground—from care management and rostering to HR and finance. To ensure the platform...
The software development lifecycle (SDLC) is always drawn as a circle. In many places I’ve worked, there’s no discernable connection between “5. Operate” and “1....