After evaluating multiple approaches to distributed tracing, Vanguard ultimately landed on using OpenTelemetry and Honeycomb. Now, they have hundreds of teams using Honeycomb, with a different mentality to the way they run and manage production. One example is a team using SLOs for a critical service. A burn alert came through, and they were able to remediate this issue before it became customer-impacting.
Rich Anakor [Chief Solutions Architect | Vanguard]
Hi, everyone. I’m Rich Anakor from Vanguard. Today I’m going to talk about Vanguard’s journey with OpenTelemetry and Honeycomb. I will talk to you about this in three ways. I will talk to you about how we got started. I will also talk to you about where we are today, and I will talk to you about where we’re headed. And, more importantly, also, I would hope that this serves as a template for organizations our size and also for, you know, organizations in highly regulated industries, like the financial services industry.
To build a context around this, several years ago, Vanguard had this idea to move all of its workload from our data centers to the public cloud. There was this transformation that happened sequentially. So we moved from the data center to the private cloud. From the private cloud, we would go to the public cloud. What happened is we ended up in a state where we were running across these three environments simultaneously.
We had services that had dependencies across the three. This layered in so many complexities for our support teams to really understand what’s going on in our environments. I joined Vanguard about two years ago, and my job was really to help build instrumentation and build out the telemetries that would really help our teams know what’s going on in the environment. We set up goals, and these goals were about how we can support our applications with these layers of complexity? How can we know what’s going on in them? We needed an approach that would help us understand this modern production environment.
We knew that our current APM solution did not scale. It was not really bringing that engagement from our teams. We knew we had to solve this problem. So how did we get started?
The one thing I want to highlight about how we got started in this journey, is we started really small. Starting small is really a good technique that I think teams should learn from.
I come from financial services: been there for more than a decade. One thing that you see is things like this that require approvals, require so many organizational involvements to really get an idea of the ground. In this case, that was not the case. There were only three of us: myself, an engineer on my team, and an engineer in one of our feature teams.
When we came together, we knew the current approach did not scale. What can we use? What technology solutions are available out there that can tell us the patterns that are happening as calls are going from our data center to the private cloud to the public cloud and traversing back and forth? How can we see this? How can we help our teams reduce the meantime to recovery? That’s the goal we set.
So now what did we do? We knew one of the technologies that was top of mind, was we needed to use a distributed tracing approach. So we started looking at technologies out there, open tracing came to mind. But we needed a backend to send this trace information to begin interrogating our systems to see what’s going on. We looked at all the vendors. Honeycomb became a partner that wanted to work on this journey with us.
We started really small, as I mentioned. We started with one of the services that had dependencies across these environments. How did we get started? It was a small and self organizing team. We started with the instrumentation. We were able to get early feedback. We were able to see what was happening.
But we did this initially with Beelines. One of the approaches we wanted was something that was vendor neutral. We didn’t want our engineers worrying about licensing, or what vendor agent we’re installing, and all that stuff.
We went with Open Tracing at the time. But Open Tracing didn’t give us the auto-instrumentation capability we were looking for. We knew about OpenTelemetry and the progress happening in that community. We decided to try it out. We brought in OpenTelemetry.
Honeycomb didn’t care. They said, Whatever you use, our backend can handle it.
We started with OpenTelemetry. The auto-instrumentation was one of the main drivers that really helped us. We were able to propagate context across application boundaries, across environments, and we were able to interrogate these systems and really understand what’s happening.
Let me tell you a bit about where we are now.
We have hundreds of teams now using OpenTelemetry and Honeycomb. We’re able to bring a different mentality in the way we are able to run and manage our production systems. We were able to really help our engineering teams. We’ve changed the culture.
One of the things I will highlight today is how we think about production systems. We often think about APM as something that’s only used in production. Or, we use it to fight fires and use it to respond to incidents.
With OpenTelemetry, we found that, yes, it’s actually very effective in doing that. But, also, it helps you do analysis. I’ll highlight two examples.
One main example that we found, our teams discovered, was that there was a migration effort going on. And this team wanted to move some data to a new repository in the cloud. And they wanted to know all the dependencies that were involved. They wanted to know all the user actions, how they map back to these backend stored procedures.
They’d been going for months with spreadsheets, looking at code, involving really smart people, engaging, and really trying hard to solve this problem. But they could not. Because this was on-prem and considered a legacy application, we did not think we could help. This application had dependencies with other workloads in our private cloud. But, we said okay. Let’s try this out.
With OpenTelemetry and Honeycomb, they were able to answer these questions within minutes. Minutes! So that was key. It just showed our teams that this is beyond just responding to an incident. You can actually understand how your systems are behaving.
Another important thing that I want to highlight today is when it comes to measuring what’s happening in your production systems, really measuring what matters. So I will talk about Service Level Objectives and the impact this had in the way that we manage our production systems.
One example I would like to highlight today is a testimony from one of our teams. They had an SLO defined for a critical service. They actually got notified through a Burn Alert that they had to respond. And within 30 minutes, if they didn’t respond, there would be a customer impacting issue. They were able to respond. They were able to figure out what the issue was. They were able to remediate this issue before it became customer impacting.
These testimonials have energized our teams. We have a mandate, as we speak, that any application in our environment, any new service that’s being built, must be instrumented with OpenTelemetry and reporting traces to Honeycomb.
These have really changed the way we do business. It’s changed the way our engineers work. It’s changed the engagement. It’s made them more productive.
So now where are we headed with this? As we’re making these mandates, I think it’s important to connect the journey. So we’ve set some goals. I will talk about our year end goals. One of our year end goals is we want to move 100% of all our applications to Honeycomb using OpenTelemetry. That’s a difficult goal but one that I’m confident, with the level of engagement that we have from our teams, we’ll be there by year end.
One of the bits of culture we also want to adopt is to really drive down our Mean Time to Resolve. How do you do so? You do so by really knowing when there’s a problem, knowing where the problem is, and knowing how to solve them. While OpenTelemetry is not something that directly resolves the issue, it happily gives you that power. Using Honeycomb allows you to slice and dice this data any way you need to see it. That’s the goal we have. We want to reduce our MTTR significantly.
This is a really, really powerful thing that has driven value and every stakeholder on our teams has come onboard. We no longer have to convince everybody that this is a good thing. Everyone wants to join the movement.
One thing I want to leave you with, I want to leave you with this thought. For organizations like ours—I keep saying that because that’s where I’ve spent most of my time with companies like Vanguard—this may seem like a difficult journey to embark on. But it’s important to start small. It’s extremely important to celebrate the small wins. And, above all, engage the right people early.
That’s what we did, these three things. The success really comes with the result.
People are going to get engaged when they see value. There are always questions that people have about their systems, but they cannot answer them. When you give them the avenue to answer these questions, it becomes extremely powerful. That movement is something I’ve seen at Vanguard, and we’re excited about where we’re going, and we’re excited about learning more and sharing with the community as we make progress.
So all I’m leaving you with today is to really, really think about your current situation. Think about the challenges you have with your systems. Think about what the questions are that you want to ask about your systems that you don’t have answers to. Maybe tracing is a good way to look at it. Maybe OpenTelemetry can help you. Maybe Honeycomb can help you.
This is the Vanguard story, and I hope to continue to tell this story as we make more progress. If you have additional questions, my contact information is available. Feel free to reach out to me directly and ask about our journey. I will be very happy to share whatever information we have.
All right. Thank you so much for listening to me, and I hope you have a great day. Thank you.