Conference Talk

Observability is More Fun With Friends: Stories From OpenTelemetry Collaboration

June 9, 2021

 

Transcript

George Miranda [Senior Director of Product Marketing|Honeycomb]: 

I love this panel. Coming from Open-Source roots, I know how hard that collaboration can be especially when you have the conflicting interests of commercial partners and end-users and maintainers. It can be rough. But observability is better with friends. And so I think we have a very delightful panel set up for you today.

Let’s go ahead and introduce our panelists. Amy Tobey is a principal engineer at Equinix Metal. Amy is currently bootstrapping an SRE team that owns observability incident management and SLO programs. While this panel is full of OTel contributors, Amy is here representing end-users. Andrew Hayworth is a staff engineer at GitHub where he spends his time building observability systems and contributing to the OTel project. Andrew enjoys many things, but he’s not entirely sure if computers are one of those things or not. Maybe we’ll find out in this session. Liz Fong-Jones is a Principal Developer Advocate here at Honeycomb. This is her second panel of the day, and I think that’s fitting. In addition to her advocacy work or co-authoring the Observability Engineering book, Liz is also a member of the governance committee for the OTel Project. So she is here as well. And our moderator today is Ted Young. Ted is the Director of Developer Education at Lightstep. He’s also a Co-founder of the OpenTelemetry Project. Thanks for moderating, Ted. I’ll hand it over to you now.

Ted Young [Director of Developer Education|Lightstep]: 

Hey, hey. Thanks for the intro, George. Let’s jump right into it. So collaboration in OSS in particular in the OpenTelemetry Project to me has been a really enjoyable experience. But I would love to hear from our panelists who’ve all been involved in the project. To sort of kick it off, I’d just love to know what experience, in particular, comes to mind collaborating within the OpenTelemetry community that really stands out to you. Amy, do you maybe want to start our round here?

Amy Tobey [Principal Engineer|Equinix Metal]: 

Sure, yeah. As you mentioned when you introduced me, I’m working on this observability project at work. And one of the stacks that we run is a ruby stack. I started using OpenTelemetry Ruby early on this year and immediately ran into problems trying to do things line propagating trace patterns over weird things. I’d put it in a JSON field, but I couldn’t get that going. So I went on the CNCF Slack, somebody suggested that to me. And ended up working with Andrew on it a little bit. I posted a little bit of example code and Andrew helped figure out what was going on and talked me through it. And that was just a really great experience. You know, being able to connect with an old friend and then also get some help from the community and move forward and figure some stuff out.

Ted Young: 

Awesome, awesome. Andrew, do you want to follow that up?

Andrew Hayworth [Staff Engineer – Observability|GitHub]: 

Yeah, no. It’s a natural segue, isn’t it? It’s one of my favorite experiences too. For starters, it was nice to feel helpful and actually say, oh, I know something. I can help you out with this. I’ve been here. And I’ve had that a few times with the OpenTelemetry ruby community so far. That’s been really nice.

It’s hard to find a really specific experience that stands out to me, but I will say that the OTel community has felt very welcoming to me overall. And as a whole, it stands apart from other open-source projects that I’ve contributed to in the past. And I don’t really know what the secret sauce is. I don’t know exactly how it’s ended up this way, but I’ve never felt like I couldn’t be a part of this community. I felt like it was open and available. As long as I wanted to show up and help out, that was welcome. It’s the first open-source project that I’ve wanted to contribute to in a while and felt like I could.

Ted Young: 

Awesome. That really makes me happy as someone who put a lot of effort into the project early on to try to make sure it was structured in a way that was going to be welcoming, collaborative. I’m always happy to hear people enjoy that and they’re now sticking around as maintainers and then paying it forward to, like, other new members who are coming in. But I’m the moderator here. Liz, what about you? Do you have any specific memories here?

Liz Fong-Jones [Principal Developer Advocate|Honeycomb]: 

Yeah. I think two things. One, I wanted to riff off of what Andrew said and say that the CNCF ethos of chop wood, carry water really resonates with how we run things in OpenTelemetry where we’re all here not to be front and center. We are here to chop wood, carry water for our users. We are here to make our users’ jobs easier. It doesn’t matter that Ted and I work for companies that compete with each other. We’re here to help users.

But I think that wasn’t always the case. I think that kind of historical context of the OpenCensus and OpenTracing projects is kind of what led us to, you know, from the lessons learned that we needed to collaborate better. I remember, you know, four years ago Yana and I were at Google and we were like those people at open tracing have this so wrong. Like, wait a second. Why is this, right? Why are we not contributing and pitching in to help make it better? That’s where OTel came from. From let’s take the best parts of OpenCencus and OpenTracing and put them together to make them better for our users.

5:51

Ted Young: 

Yeah. I actually have a specific memory of that. Early on, when it became clear it was just too big of a leap. Even though it made sense to a lot of people to merge the projects, just being able to just go shazam, done. Was just difficult. And we needed to find some kind of common ground. And that ended up being worked through the W3C on getting tracing headers established. This was something that was available to everyone. Regardless of what project they’re working for. And it ended up being a place where I, some other leadership people from the open tracing side, and then some people who are running the Open Census Project were all collaborating. And there’s this aspect of collaborating or once you get to know people as people and actually get to work with each other, it kind of lowers that barrier, just that human aspect to it. And I really feel like that collaboration experience was a key part of an enzyme that led to the projects merging. So I have fond memories of those days.   

But one thing we’ve been talking about here that I think sometimes tries to get swept under the rug but we’re talking about openly which I think has been an open part and being welcoming is actually recognizing that there are a bunch of companies involved. You know, there are various vendors trying to use this. There are the big dog infrastructure platforms like Google and Microsoft and Amazon. There are end-users who are running huge systems who really, really need this to work. And that can put a lot of pressure on an open-source community. And I’m wondering what people think in that realm. Like, what’s important to take into account when you’re building a community that’s obviously going to include those kinds of forces and pressures?

Liz Fong-Jones: 

It’s really interesting.   

Amy Tobey: 

The thing I see in the OTel community and a lot of the CNCF stuff is it’s the same thing we do as SREs which is focus on the users. You can focus on your own company, but if you’re hurting your users by competing for undifferentiated lifting, that’s bad for your users. The way I look at it, one of the big arguments I use to commit Equinix Metal to OpenTelemetry was we can make this decision now and we can defer the vendor decision until we have something to look at in there. And we’re not trying to guess in the dark. That was good for me as a user, and I think it was good for y’all as vendors too. We have this opportunity to now use the best tool for the job and not the one we’re stuck with because it’s littered all throughout the code.

Liz Fong-Jones:

There are so many people who have been burned with, we got stuck with one vendor’s library. It was a hindrance to the entire ecosystem we needed to fix for everyone. The other interesting thing here is Matthew S. Wilson is this amazing person who’s been in open-source communities forever. And I love his writing about the Apache way of doing it. The Apache Foundation has this attitude of, yes, your employer pays for you to be here. But you are representing yourself. You are representing yourself on behalf of users, not your company. Right? And we’ve, in fact, had people move between companies within the OTel ecosystem. And I think that’s fantastic. You know, the technical community belongs to the individual on the basis of what they’ve done, not the company.

Ted Young: 

I’d love to drill into that more. If you’re paid to work on open sources, awesome. Because presumably, you’re geeking out about this stuff. That’s why you want to do it. But you do end up with what can feel like a foot in two different camps. You’re working for a company that has interests. You’re a personal member of this community and you want to promote that community. Do people have advice for how they navigate that? The boundaries you try to set? How do you make that work for you?

10:11

Andrew Hayworth: 

I don’t know that I’ve actually figured it out, frankly. It’s kind of rough. I try to go into it knowing that, while I work at GitHub now, I may not always work at GitHub. And I want my contributions to reflect who I am and the good work that I’ve done going forward. I don’t want it to be sort of only attached to GitHub’s legacy. I want it to be part of what I do and who I am. So I try to keep that in mind. I don’t think I have a really good answer here. I’m curious what y’all think.

Ted Young: 

But authenticity is really important; right? That seems to be a key element.

Amy Tobey: 

The boundaries are hard. The mythos is we work all day and on open source at night. But a lot of us don’t have those nights free to work on open source. I’m a parent. You know, my evenings are largely spoken for anymore. You know, so do it at work. And at work, I have to balance open-source contributions against the needs of the company. And when I find something that’s lined up where everything lines up, it’s I can do an open-source thing and it meets the business goals, there’s that magic spot where you get to have a lot of fun. I put together OTel CI because it was like, this is not really differentiating for our company. It helps us kind of get more tracing down into shellcode and things like that. And it can be good for the community. So it was kind of that… when those things come together, it’s always a little bit magical and fun to go forward and put something out in the world.

Ted Young: 

Yeah, for sure. Definitely, if I’m working on open source in my spare time, I do not want to build community about this. This is my trash side project. Don’t you dare put this in production.

Liz Fong-Jones: 

I definitely have to decide which side projects to use OTel and which to submit during work hours later. Like I tried this and it broke. But definitely, my core contributions to OTel are happening when I’m doing my day job. I think the way I kind of navigate this is what am I doing for the user and to recognize there are other ways to validate this too. I need to listen to them, but Honeycomb has a perspective we can bring to the project of what we think is good for users. That’s a perspective that, you know, should be represented and, yes, we do disclose this is our interest and we’re happy to listen to other interests too.

Amy Tobey: 

There’s a diversity and resilience argument in there too. We talk about this in organizational dynamics. But the diversity of all of the different vendors contributing to this community, you get kind of the crosshatch of all those ideas and desires and things. And you get a more resilient project because of the way that people have to kind of not just say we’re going to do this because I say it should be this way. But you have people to challenge you and to make the best decisions for the community who are opposed to or for Honeycomb or Lightstep or whoever.

Andrew Hayworth: 

That’s always stood out to me. All of the communities have a good makeup of people representing different vendors and groups and companies. That’s fantastic. I think it does result in a better product, absolutely. It keeps it from running away and becoming somebody’s sort of, some company’s sort of free labor they get to help them build their product. I love that about this project. I love that everyone’s working together on it like this.

Ted Young: 

We put some work into the design of the project to kind of ensure that. So there are rules around the governance structure where it can’t end up getting overrepresented by a company. We have an RFC, OTEP process rather than a BDFL process. And the original scope of the project included telemetry but not analysis and back end because we kind of recognized telemetry as the area where we can all come to an agreement in a straightforward fashion. Like, what language should systems use to describe themselves? And if we got into the work of, like, here’s an analysis to OpenTelemetry trace analysis, then that would turn into a weird place.

Amy Tobey: 

It’s a useful boundary.

Liz Fong-Jones: 

And there are projects out there. We collaborate with Jaeger and Prometheus. Those are the clean boundaries. There are open-source implementations. And also, we do privilege open-source implementations with our exporters. They are directly integrated because they are open-source code and back ends.

Amy Tobey: 

I like that y’all sliced off the part that needs to be nested into my code. Right? That’s the part where I, as a consumer, vendors, and tools come and go. But the things I put in my code three years from now, nobody knows how it got there or why it’s there. I don’t have to try to go and rip all that crap out and replace it with some other library anymore. We can make that commitment and move forward for hopefully a decade before we even have to think about upgrading that stuff. That to me is, like, really valuable as an engineer.

15:33

Ted Young: 

Yeah. And having some predictableness to what that code produces that’s gone through a process of getting a lot of feedback from different groups around what kinds of observations are important.

Liz Fong-Jones: 

It took us a while to get there though. It took us two years. I think a lot of that was trying to get this right so that Amy could build it into the code and have it valid for ten years.

Ted Young: 

I feel like we now truly understand why standards take time. Yeah, it’s just a different process from, like, I am like a    with my finely crafted opinions and I’m just going to put them out there and you can like them or not. But if you try to get to a standard where you’re hoping everyone is going to run it and it’s something that’s going to interconnect things. Like the internet is built out of protocols and standards for a reason. It’s about interconnection, and OpenTelemetry is part of that.   

But actually, this is a good pivot point. I’m curious. Obviously, people do want to go out there and experiment; right? And build interesting things. I’m curious, having been a part of the OpenTelemetry community, there’s a part where we want to get an agreement. Where are the places where people today in this ecosystem if you really want to start doing something innovative or pushing the boundaries, where would you suggest people look?

Amy Tobey: 

That’s a good one.

Andrew Hayworth: 

Yeah.

Liz Fong-Jones: 

OTel CLI, Amy.

Amy Tobey: 

I wasn’t going to toot my own horn. That’s one where I started writing it, I was like I could probably read all the steps and get all the terminology lined up. But we need to get this idea out there. It’s based on the idea that came before. So I’m not taking full credit for it. Just sitting there and going, we just need to throw a solution against the wall, get it out in front of some people. And, you know, see what happens. See if it’s something that people want.

Ted Young: 

Can you tell us more about the CLI and what it does and how it works?

Amy Tobey: 

Well, yeah. At the simplest layer, you take COBRA, which is a command-line library for Go that allows you to write pretty straightforward. And the OpenTelemetry library and mash them together. The use case I had was in every modern infrastructure, most of the world is held together by shell script. We want to pretend we pushed it all down underneath Kubernetes and all that stuff. But it’s all shell script at the bottom. And I want traces to go down as far down and close to the metal as possible. So one of the challenges I had is sometimes code, you know, in our infrastructure and just about everywhere I’ve ever worked this happens. There’s some loop where some piece of code calls a shell script and then it curls out to some other service. And why do I want that to be a gap in my telemetry? I want it to carry through that shell script.

Ted Young: 

That’s awesome.

Liz Fong-Jones: 

I think if we could pick up on the broader pattern, the broader pattern is no clear way of doing it, okay let’s throw something against the wall and see it if sticks; right? I think for the surface area of GRPC services; right? That’s an area where we’ve had to work very carefully because there are a lot of GRPC services out there. We want to make sure that we get it right. Whereas that isn’t a straightforward thing for problems for CLI. And there’s this focus group on OpenTelemetry on how to log into all of this. There was no known answer to all of this.

Andrew Hayworth: 

I think another interesting area that’s come up interesting, Case Man which highlights what Kubernetes is doing as it goes about its business and makes that visible for the first time. I think there’s a ton of areas that could be explored like that. Places we never bothered to look for telemetry before. Now that we have that, you could actually use that as the basis for a lot of your experiments and go pretty far with it. But I think there’s a whole world like Amy said of things close to the metal that we didn’t observe before because it was difficult or embedded. Or HA proxy just released an OpenTracing integration recently. There’s an area that could be done that normally you don’t ever do anything inside that system because you want it to be fast and performant. There are a lot of places close to the metal I think we could work on a ton of stuff there.

Liz Fong-Jones: 

Close to the metal or closed box. I think there were questions earlier about Kafka tracing. How would you go through a Kafka broker? Exciting work to do there. We know how to get from producers to consumers, but how do you trace the actual inner broker protocol?

Amy Tobey: 

They announced something yesterday, didn’t they? Confluent did. Something on OpenTelemetry.

20:46

Ted Young: 

This is my long term hope that observability becomes kind of like a best practice like testing where if you’re building a database or you’re building thee tools, you’re thinking about observability and you’re thinking about how do I give people not just this data but a playbook kind of explaining what to do with the information.

Liz Fong-Jones: 

It’s interesting we don’t want OpenTelemetry to have it for everything. OpenTelemetry should be baked into every library.

Amy Tobey: 

That and I want all my SASS and open-source components or even closed components to emit OTLP. Right? It’s a thing that’s starting to emerge. It’s something I want to do as a product feature down the road for what I work on. That’s way future stuff. But if I could start to have these things if I’m using a sass on one part but my server interacts with it and they emit OTLP to my tracing vendor, then I can stitch pieces together. That makes a huge difference not just in incidents and troubleshooting. I get excited when I share it with engineers who haven’t seen it better. It starts snapping together in ways it didn’t before. Holy crap. We use that? That’s in that path? Those are things that are exciting and still just starting to happen.

Liz Fong-Jones: 

The meantime to WTF.

Ted Young: 

Yeah. For sure. Okay. But this actually, we’ve got maybe six minutes left, and I think the question I’d like to end on and then after this, we’ll go into Slack. There have been some questions there. We’ll follow up there asynchronously. But one thing that I think comes up in these kinds of projects is there’s a need to be extensible and also cover every use case. Because it’s the standard and in the process of doing that you can make things really complicated. But for new users and people getting started and for frankly the success of an open-source project, you need it to be simple and kind of smooth and slick. And I think that is a tricky balancing act. I personally think we have a lot more work to do in OpenTelemetry to make it smoother and easier to get started with. But I’m wondering, do people have thoughts on that subject both for OpenTelemetry and open source in general?

Amy Tobey: 

Yeah. Make your launches generic. I love a Lightstep OTel launcher for Go. But I can’t just use it because the engineers get quibbly when I have to set LS underscore or whatever. And Honeycomb just released a Java one recently. I see the vendors starting to do this for the launches. But I’m also at a loss like where’s the secret sauce here? What’s specific to Honeycomb or Lightstep or whoever? But, like, that’s one of those things I see happening that is cool happening organically like that. But I think filing off the rough edges everybody starts out with starting with OpenTelemetry. I started like that with what the hell is all this boilerplate and why do I have to care about it? Most of my engineers don’t want to    I just have to do this thing. I have this story I need to write code for and I’m told I need to put observability in it. Like, the understanding could come later. But we need to get people to actually put in the traces first.

24:30

Andrew Hayworth: 

Yeah. I think that’s actually a really big deal. We were talking about this in the Ruby working group yesterday. Why do we have to have all this boilerplate to set it up? And it turns out I think part of the reason is that we default to the wrong port accidentally. So we want it easier. But there was a goal expressed there to require this library in your little Ruby script and mostly get a working setup right out of the box. You shouldn’t need an OTel launcher package. You shouldn’t need a vendor’s distro. And I really appreciated that perspective on that. You shouldn’t need to have to look at a code sample for the happy path. I think that’s an area that the OpenTelemetry project needs a lot of help in, actually. Defining what precisely is the happy path and then saying if you choose to do it this way, all you have to do is include the library and it’s going to start working. That’s extremely, extremely exciting for people. I did a demo internally, one of our apps had a little bit of boilerplate, but one of the things that blew people away as I said, you know, look at all of this tracing happening. And it’s all auto instrumented. It’s useful and great.     

That type of ease and out-of-the-box usefulness is really important. You can obviously go overboard with it and make bad decisions, but I think that’s one of the areas I want to personally focus on coming up soon. Making it that simple, making it easy for people to get started.

Ted Young: 

Yeah.

Liz Fong-Jones: 

I think we’re at this interesting point in the evolution where we had to toss a lot of scope off the boat. You can always add new convenience methods, but you can’t take away the kind of core functionality people are relying upon. So that motivated a lot of the, okay, let’s shift the bare minimum thing with you having to tweak all the Config knobs. Then you can experiment with the distros. And I think that’s where we are now. 

The other interesting thing I was seeing was I was actually comparing the dotNet boilerplate to other languages’ boilerplate and seeing the support. Just makes it so much easier; right? Because then you can look at what is needed for the language.

Ted Young: 

Yeah. For any language developers listening in, we’re baking this stuff into your run time. Especially context propagation. It’s freaking weird we do that in userland. End rant.

Amy Tobey: 

There’s a weird side effect in some of that, though, where if it wasn’t the way that Liz described where it’s these rough edges were exposed, I think we might see a little bit less pickup of the collector. Because the happy path today is to set it all up, don’t worry about authentication, and forward to a local collector and let it bounce out and do the authentication to the vendor there. And that is super good for all of us for reasons I think others will go into.  

Sometimes we regret these decisions. We look back and go, yeah that sharp edge probably makes adoption harder. But in the long run, establishing the pattern of using the collector everywhere is really healthy, I think, for just about everybody.

Ted Young: 

Yeah. I definitely feel like the happy path is becoming very clear. Especially for people who are doing things like service mesh and all of that kind of stuff where if we can make OpenTelemetry just have, like, the defaults be a one-liner that sends OTLP, you know, to a local collector. Or, you know, your service mesh or something that’s going to proxy it over to your collector. You do all your configuration in the collector, that is so smooth. So yeah. I’m hoping to see us promote all of this stuff now that the 1.0s are starting to land.

Amy Tobey: 

Yeah, that’s exciting that those are.

Ted Young: 

Yeah. Cool. Well, we’re coming up on the end of our panel here. I think George is going to pop on and say thanks to all of us. But I want to first of all just say thanks to you three for being such wonderful contributors and collaborators. It’s definitely people like you who make the project awesome for the new people coming in. And for the people who are listening to this talk, come say hi in the OpenTelemetry Slack over in CNCF. We’re always looking for new contributors. So thank you very much.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript