Real User Monitoring With a Splash of OpenTelemetry

Real User Monitoring With a Splash of OpenTelemetry

7 Min. Read

You’re probably familiar with the concept of real user monitoring (RUM) and how it’s used to monitor websites or mobile applications. If not, here’s the short version: RUM requires telemetry data, which is generated by an SDK that you import into your web or mobile application. These SDKs then hook into the JS runtime, the browser itself, or various system APIs in order to measure performance. These SDKs are usually pretty optimized for both speed and size—you don’t want the dependency that tells you how fast or slow your application is to impact your application speed, after all.

There’s a lot of details that I’m eliding here, of course. The actual act of optimizing these clients, and making tradeoffs between performance and ease of use, is not an easy one. This, I think, is why RUM has set itself apart from other observability workflows for a while. Yeah, it has different users—frontend engineers, product and business analysts, and so forth—and the questions you might want to ask of a RUM tool are slightly different because there are more standards and patterns; think of Core Web Vitals, or conversion funnels. There are also some pretty specialized visualizations that people like to use, like session recording (where you literally watch what someone was doing on a page), or geolocation visualizations. These alternate projections, data types, and query semantics require different setups on the backend as well.

While in general the telemetry data needed for RUM is a commodity, RUM tools haven’t followed suit, and I think it’s due to this specialization. What if we could rethink how RUM works from first principles? I think we have an opportunity to do that with OpenTelemetry. Here’s how.

Real user monitoring is mostly a data analytics problem

The telemetry data that you use for RUM is, by and large, the exact same kind of telemetry data you get from backend systems. Metrics, and other structured data like logs and spans, are the majority of what you rely on. CWV comprise the majority of what you care about, because those directly impact your search rank and user experience. But what’s in one of these measurements anyway? Let’s take First Contentful Paint (FCP), or ‘how long until the first bit of content appears on screen.’ This is an aggregate measurement of many other factors, such as how long it takes for the HTML to be parsed and rendered, how long images or CSS files take to load, etc.

The problem is that while you might get this data, turning it into insights is actually pretty tricky. You need to analyze the structure of a page or view, see how much various asynchronous tasks contribute to the overall FCP, and so forth. The data itself isn’t the problem, the analysis is—and that’s where the trouble crops up. There are a lot of great ways to analyze individual page or app performance, but performing that analysis in near-real time across the globe and correlating these measurements with other metadata is a huge task.

This problem extends beyond the web, though. The kind of rapid feedback loops and fast user feedback that observability preaches for developers doesn’t work so well when it can take weeks or months for users to get your new functionality due to the vagaries of app stores on mobile. You might have hundreds, thousands, or millions of combinations to work through—not just for finding interesting and novel problems, but simply judging your adherence to an existing Service Level Objective (SLO).

In truth, regardless of where your users are coming from or how they’re accessing your application, making sense of what’s happening in production and being able to relate it to your overall business goals is, I believe, the most important application of observability. To that end, we should stop thinking of RUM as something that requires disconnected and specialized tooling to solve, but instead treat it like any other big data problem.

Real user monitoring needs open standards to thrive

The problem with this approach is that, by and large, the kind of telemetry that we need to understand client applications has been locked up in proprietary formats, generated by proprietary tools. This has been great for people selling frontend monitoring and RUM tools, but pretty bad for frontend engineers, imo. If you’re going to warehouse your telemetry, then you need to control how that data is generated, how it’s expressed, its metadata, and so forth. Sure, you could write parsers and converters, but what guarantee do you have that the format remains stable? 

Beyond data itself, you need a consistent way to create telemetry—not for your own code, necessarily, but your frameworks and dependencies. Rather than relying on log parsing and error catching, having an open standard that works across web and mobile for creating, exporting, and collecting telemetry data would help everyone.

Until recently, this was a problem that plagued backend developers and development as well. OpenTelemetry was created to solve these sorts of problems. It’s a consistent, clear, and well-documented set of APIs, SDKs, and tools that unify instrumentation and telemetry data across multiple languages, runtimes, and cloud providers.

As a matter of fact, one of the biggest requests from OpenTelemetry users for the past several years has been bringing the project to the frontend and client space. We’re closer than ever to that happening, and I believe it’s going to be transformative to this space. OpenTelemetry’s composable design means it can easily support alternative, lightweight SDKs for different deployment and runtime scenarios. Its conventions-based approach to metadata reduces confusion and allows for consistency of attributes across different browsers or mobile devices. Finally, its open governance and vendor agnosticism means you don’t get locked in to a specific vendor or tool—you can pick and choose what’s best for you.

Observability 2.0 and real user monitoring

If we learn one thing from the cost crisis in observability, it should be this: Our goal as performance engineers and observability practitioners is to help break down the silos between ‘performance data’ and ‘business data’ in our organizations. If observability is to be truly valuable to our organizations, we have to make the argument for it from the perspective of the business. 

It’s not enough to have green Lighthouse scores or great AppDex numbers. We need to be able to ask how reliability directly impacts the bottom line. We need to know where the risk lives in our system, and understand how changes to our applications impact users along dozens or hundreds of dimensions, and tie all that back into our funnels. It’s not enough to rely on synthetic measurements of performance, or highly sampled real user data—we need accurate and real-time measurements of performance that highlight critical issues in order to respond to incidents along with rich structured data that can be analyzed over months in order to manage and understand change. 

The goal of all of this isn’t just to make better dashboards, or reduce the amount of alerts you get (although both of those can happen!). It’s to resolve the tension between the cost and the value of observability. It’s about breaking down the silos between engineering and the rest of the organization, and being able to not only demonstrate the value of performance and reliability, but make that value one of dollars and cents.


Interested in learning more?
Read my guide to OpenTelemetry and observability.


Don’t forget to share!
Austin Parker

Austin Parker

Director, Open Source

Related posts