Conference Talk

Building Observability Metrics With OpenTelemetry

June 10, 2021

 

Transcript

Alolita Sharma [Principal Technologist|Amazon Web Services]:

Hi, everyone, I’m Alolita Sharma, from Amazon Web Services. I’m super happy to be talking with you today at Honeycomb. By background, I’m focused on open-source observability at Amazon, and I will talk to you about all the cool work we are doing building observability metrics in OpenTelemetry.

For many of you who are already working in the observability space, there is a difference between monitoring, as you know of it, and observability. Monitoring, as you know, tells you whether the system is working or not. Observability is smarter. It tells you — and it asks you about why your systems aren’t working. Observability also requires instrumentation and querying, and that’s one of the areas that we are going to dive into with OpenTelemetry today.

So what is OpenTelemetry? For most of you who are already working in the OpenTelemetry or the open-source observability space, OpenTelemetry is a very popular CNCF project that, you know, everybody who is involved in the APM space, the observability space and monitoring frameworks is involved in. There are more than 250 organizations actively contributing to the project, as well as thousands of engineers who have contributed with different components and code.

OpenTelemetry provides, as all of you know, an open-source observability agent, which is the collector, as well as the instrumentation libraries and a standard data protocol, called the OpenTelemetry standard protocol, that unifies monitoring, managing, and debugging applications and services.

Interestingly enough, OpenTelemetry has been so popular that it has the mandate to support all three data signals in telemetry, which are traces, metrics, and logs, and 11 languages, with some of the most popular languages like Java, Go or JavaScript being supported and at the same time, cool languages, like erlang, rust or swift, have communities interactively working on building instrumentation libraries.

3:05

So why OpenTelemetry? Why do we care? Right, why do customers care? And that’s one of the areas that, you know, customers have been asking for as they have built out observability or monitoring solutions over time. They would like to see a single collection agent that supports all types of telemetry data for them as they instrument their applications, as well as their infrastructure, and gather traces or metrics or logs. They would like to see a single agent, which is open source, interoperable across the OpenTelemetry standard using a standard data protocol, providing a choice of different sources to ingest data from as well it’s different destinations for monitoring that they may be using, as well as the ability to instrument once and deploy everywhere.

It’s super important that, from a cost standpoint, customers have this choice available, in a vendor-neutral way, which is open source.

OpenTelemetry meets those value propositions very clearly, and hence customers have been really, really involved in being able to try out the technology, try out the collector, try out the instrumentation libraries, and continue to build out an instrument with it.

So a little bit about me. I lead open-source observability strategy and development at AWS. I’m also deeply involved in the open source, in the OpenTelemetry project, I’m a member of the governance committee of the project, as well as I lead some of the key initiatives of the project, such as the OpenTelemetry Prometheus metrics interoperability workgroup. I’ve contributed to several OpenTelemetry Prometheus components and made improvements across different libraries.

My team also built the Prometheus remote write exporter which is super useful to many customers who are using the collector and are interested in sending their metrics to a Prometheus service. And I also lead the open-source distribution for the AWS Distro for OpenTelemetry, which is a downstream distribution available from AWS for OpenTelemetry.

So needless to say, it is really interesting at this point in time to work on OpenTelemetry and building out metric support and building out the library support as well as the collector support for it.

5:59

Now, why does metric support need to be added to OpenTelemetry?

As many of you know, OpenTelemetry is the result of two projects, two, open-source projects which were very popular combining together. OpenCensus, which was created and initially founded by Google and had a lot of metric support built-in, was one of the projects, and OpenTracing, which was created by Uber and other members, was the other project that was combined to become OpenTelemetry.

And as we start to build metric support, we inherit a lot of the underlying code as well as the implementation methods that OpenCensus had for metrics. So we carry that over. But yet, as we look at metric support natively, out of the box, in OpenTelemetry, we want to have feature parity, as well as swap out some of the transformations that we have had OpenCensus support for and make it native to OpenTelemetry itself.

In addition to that, using correlation context, stags, resources, those frameworks, supporting more signals means supporting shared context for customers, providing those value props from the project to customers is a big deal.

The other aspect that I want to talk about, as we are building out metric support, this initiative, we started out in the project earlier this year. We had a big workshop, a full-day workshop, looking at what the requirements are that we needed to support, and what we needed to build out to enable full metric support in OpenTelemetry. And that meant changing four specific areas. 

One was, first of all, evaluating and understanding what data model changes needed to be made, as well as what OTLP protocol changes needed to be made. This includes instrumentation updates. This also means ensuring that the data model works for Prometheus and statsd. And it means that it can support Prometheus and statsd out of the box. Similarly, API changes, defining and building an easy-to-use API for each metric type that is supported by OpenTelemetry. And allowing API owners to easily update with stability assumptions guaranteed.

On the SDK side, defining how metric types work in the SDK, enhancing the SDK of metric types are needed to be supported, or adding new metric types and ensuring that that is fully supported is important. And enabling the SDK to be able to leverage new metrics along, within the API is also a requirement.

In the collector itself, which is the — as I had shown earlier, is the out-of-process, you know, pipeline for supporting metrics. Support for OTLP data ingestion and aggregation is a requirement and that’s one of the areas that we are starting to look at. Ensuring Prometheus and statsd support. We are currently working on the Prometheus interoperability requirements and building that out in the collector. And also enabling push and pull-based implementation is a key requirement.

9:55

That said, again, how are we doing this, right? As you know, OpenTelemetry is a very large project, and there are many developers, many maintainers who are involved and are experts in having built large-scale telemetry systems before. So as we work together in the OpenTelemetry SIG meetings and the technical committees and the discussions on the repos and in the issues and the PRs themselves, we are actually working on multiple streams, because, again, there is huge anticipation that we will be rolling out metric support later this year. And in order for an open-source project to be organized and deliver, you know, collaboratively, standards, as well as implementations that work out of the box and are fully — fully stable, means that we split out, you know, and kind of multitask and split out into workgroups within the metrics SIG.

For those of you who attend the metric SIGs regularly, again now instead of one metrics SIG meeting, we actually have multiple tracks ongoing at the same time. So sometimes — usually we have four to five meetings a week, you know where we discuss design considerations, review PRs which are in progress, as well as discuss issues and questions that contributors may have.

What we have done is we have split out in multiple streams in workgroups and the metric SIG. One of these workgroups is working on the metrics data model which is now stable. And with the requirements that the OTLP protocol also is fully supported with the metrics data, some type assumptions that we are making.

Secondly, as I was initially alluding to, Prometheus interoperability is a key goal of the project, and we have a Prometheus interoperability workgroup which I work on very actively as the second workgroup, where we have our own backlog, we have our own issues that we are tracking and the key compliance requirements that we have for Prometheus remote write specification that we are working on, and I will dive into that a bit more later.

Then we have the API and SDK stability discussions that are ongoing into a different workgroup and we — what we are employing as a methodology is a rapid iteration on discussion items, prototyping and implementing this in top languages as we build out the functionality.

So the good news is that API work is actually underway right now and actually has almost become stable now and SDK implementation towards stability is also in progress.

The fourth area, which is super important because Collector is a very, very popular component in the project, is to make sure that OTLP data ingestion and aggregation is fully supported, ensuring OTLP pull-based implementation works, as well as scaling out the Prometheus push-based implementation to ensure that both protocols are well supported.

And another area that we are, you know, diving into is ensuring that each metric instrument is evaluated, and we look at the data types and ensure that there is an owner or a maintainer who is responsible for collecting feedback, driving discussions, owning implementation completion, and coordinating and collaborating with different maintainers and contributors who are working on different parts of project.

Needless to say in such a large project, there’s a fair bit of multitasking and also heavy amount of collaboration that is super exciting to be involved in, but also ongoing at a super-fast rate.

And if you are interested or involved in, you know, looking at what’s happened, you can easily catch up. The metric SIG meetings are all recorded. They are available online, on the OpenTelemetry YouTube channel. So you can go and check it out.

14:43

Moving on, I would like to dive in a little bit into the Prometheus interoperability work that we are doing, and this is super important because as many of you know, Prometheus is a very popular metrics monitoring and alerting framework, especially for Kubernetes based applications. Those of our customers who are instrumenting / monitoring and providing observability for other platforms, which are non-Kubernetes based, there are requirements where there is a need for consuming or handling instrumentation metrics, both from applications as well as infrastructure, not only from Prometheus, but from other compute platforms too.

And, in the Prometheus workgroup specifically, what we are looking at is ensuring that the Prometheus protocol, especially for supporting all the data types, as well as the push-based protocol and pull-based protocol, that they are fully supported and the data model, especially, is fully interoperable. 

What does this interoperability mean? That, first of all, the OpenTelemetry metrics data model changes ensure Prometheus compatibility. Similarly, all metrics data types, including counter, gauge, summary and histogram are fully supported. Prometheus components that are written for the collector, or the instrumentation libraries and different languages, meet the remote and pass the remote write compliance tests. And the discovery and scrape configuration support in Prometheus receiver is also fully functional and supports the configuration needs that our customers are asking for. Similarly, stateful set support is available in the OpenTelemetry operators to support Kubernetes workloads, and being able to support applications in that space. 

So lots of activity there. We have a full backlog for the Prometheus workgroup. If you are working in this place, please come and join us, we meet regularly on a weekly basis, as well as we have all of our issues and track — tracking all the activity and the work that we are doing in order to extend this support.

17:24

Moving forward, as you can see, I’m just sharing the status updates that we maintain on the project website, on OpenTelemetry IO for status and metrics at this point is facing, we are stable, finally, and the last part that we are working on is the collector going stable also, with full tracing support.

In parallel, the project is also working on metrics, as you can see. And that means that the API specification is finally reached stable, just happened this week, and we are currently working on the SDK, the collector as well as the protocol, of course, with the data model have also gone stable. So we maintain this active update on the project website. You can go and check it out any time and it is pretty current.

In the last section, what I would like to kind of dive into is a little bit more about what AWS is doing in this whole process, and in participating in OpenTelemetry, and needless to say, we have been super excited about participating in the open-source observability space, especially on the OpenTelemetry project. One of the key areas, as I highlighted, many of our engineers are working in the Prometheus interoperability workgroup, as well as I lead the workgroup. So that’s been a really exciting area we are working on.

Similarly, we have enhanced metric support, and helped enhance and add support in the collector. We have contributed and written the Prometheus remote write exporter, CloudWatch metrics receiver and exporter, and statsd receiver and we have been working on several design proposals for some of the key components of the collector in order to be able to redesign and really be able to leverage a unified architecture for metrics processor.

In the language library areas, we have been super active in the C++ language library. We built out a metrics API and SDK as an initial experimental version last year. And then, of course, with the API and SDK stability work that’s ongoing right now, we hope to take that to stable implementation.

We have been adding and building out Prometheus exporters for language support including C++, Go, JavaScript, Python, and, again, been deeply involved in all aspects of the metrics support work that has been ongoing. Participating in code reviews and design reviews, and just excited to be working on different parts all together.

20:34

I wanted to talk a little bit about two key initiatives that AWS has been involved in, especially for extending and supporting customers and users of OpenTelemetry. And we rolled out an AWS Distro for OpenTelemetry, which is a downstream distribution of OpenTelemetry. We are deeply committed and contribute all source code for the distribution, you know, any of the components that are bundled in it to the OpenTelemetry project. All the source code that is available in the distribution is actually on the OpenTelemetry project. What we are doing is we are actually running security and integration testing, which AWS runs at production quality for its services. We also actually do the due diligence from AWS’s security guidelines to be able to support the distribution.

We also offer AWS support for all components for customers who require this. We also are building out a one click deploy and configuration for AWS Container consoles, as well as AWS Lambda consoles, as well as have several receivers and exporters that, you know, for the AWS monitoring solutions, such as CloudWatch or Elasticsearch or the managed services – Prometheus or Xray – are available in the distribution. And these are all available upstream in the OpenTelemetry project also. We just bundle it downstream and make sure that this is fully tested and secure and performant for what we are delivering in the distribution.

And, last but not least, we are providing several integrations with partner solutions and partner service endpoints, including Honeycomb’s support with the OTLP Exporter and other AWS partners. So go check out the distribution website of AWS-otel.Github.io. And there’s a lot of technical implementation and configuration details that you can find out. You can also go and check out the source code. The security testing results, the integration testing results. Everything is open source. You can go and check that out at this site.

The other aspect that I wanted to talk about, which really is also a super cool feature that CloudWatch rolled out recently is the metrics streams features. And metrics streams, again, can be used to stream CloudWatch metrics to a destination of your choice and this is super exciting for customers because it promotes interoperability, which is a key mantra of OpenTelemetry also, where metric streams can be used for streaming CloudWatch metrics to a destination of your choice.

You can take the Kinesis Data Firehose that is available to be able to stream any CloudWatch metrics from any accounts or any regions, to be pushing it to an Amazon S3 Datalake or to a third-party destination. And that’s pretty powerful, because you can actually stream metrics into Honeycomb, for example.

And the formats that are supported, which is something that, you know, is super exciting is, from an OpenTelemetry standpoint, is that this is OTLP compliant. We do support from metrics streams. You can dump the data out with OTLP 0.7 support or in JSON, depending on what your monitoring back end consumes.

So go check it out. The documentation can be found on AWS’s website. And I wanted to share some useful documentation and for further reading links. Please go and check out the CloudWatch metric streams documentation.

25:02

I have shared the links. Please feel free to take the look at the slides or take a snapshot to get these URLs. The AWS Distro for OpenTelemetry documentation, all the technical documentation is available at aws-otel.github.io, and OTLP support for ADOT, which is how Honeycomb also connects and uses OTel collector through ADOT to connect into the Honeycomb backend services. These URLs are available right here. And, last but not least, check out the open — the observability engineering book that is actually being done — written by Liz Fong-Jones and Charity Majors and other members of the Honeycomb team who are in the process of releasing the observability engineering book published by O’Reilly. I think you can go to this URL and go and check it out.

Super excited. Again, I’m looking forward to reading it. Again, you know where to find us, but with that, I would like to say thanks. Super excited to be at Honeycomb, and if you have any questions about OpenTelemetry metrics work, or about AWS Distro for OpenTelemetry, feel free to reach out to me. I’m available at Twitter @alolita and GitHub @alolita and I shared my contact at Amazon earlier.

Thanks again and I look forward to any questions you may have.

Yeesheen Yang [Product Manager|Honeycomb]: 

That was awesome! Thank you Alolita. It’s really exciting to see all of that. At Honeycomb we’re super excited about OTEL – seeing AWS as a contributor. What do you hope to see happen in the OpenTelemetry community in the next few years?

Alolita Sharma:

Thank you. I’m happy to be part of the OpenTelemetry working closely with so many great contributors, you know maintainers and contributors from different companies as well as individual contributors but I’m — I’m very excited about the future of OpenTelemetry, and the future is bright, I think, and I’m looking forward to being part of the project for a long time, hopefully.

Completing, you know — I think there are lots of exciting initiatives from a project standpoint in terms of areas we’d like to build out and accomplish.

Some of those areas are completing out and delivering a stable metrics implementation across all the language libraries as well as the collector. Also, the logging work that is in progress to support logging and have a full implementation end-to-end for supporting logs as the third pillar of observability is another exciting area that’s ongoing. 

28:25

Long-term, one of the things I do envision is the auto instrumentation of, you know, making it easier for customers to be able to use OpenTelemetry components out of the box, be able to run with them and make it as easy and seamless as possible with the compute environments that OpenTelemetry runs on, whether that’s Linux or Windows or Mac or Kubernetes or the other container environments. That will continue to grow and improve. We have a lot of work to go there.

Also, we’re envisioning and very excited to kind of start thinking about in the next few months about building out a certification program for compatibility with OpenTelemetry, very similar to what the Kubernetes project has implemented, but it’s very essential to be able to provide and build trust with our customers and all users of OpenTelemetry where they can, you know, have a very clear assessment of what they’re building with, what versions are available and stable, what they can run with in production and really continuing to build deeper infrastructure to support that.

We have a really wonderful and awesome community and a very inclusive community. It’s one of the nicest communities, open-source communities I’ve worked with. I’ve been in open source a long time. It’s just exciting to see that energy. So I think we will continue to keep growing that and building that energy and also pulling in more, attracting more contributors to the project, building more plugins, extensions, more languages. So, needless to say, I’m looking forward to the next couple of years and more of the project.

Yeesheen Yang: 

That’s so exciting to hear. Thank you, Alolita, again.

Alolita Sharma:

Thank you. It’s been a pleasure.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript