Culture  

How Vanguard used Observability to Accelerate and De-risk their Cloud Migration

By Eric Thompson  |   Last modified on May 2, 2023

Rich Anakor, chief solutions architect at Vanguard, is on a small team with a big goal: Give Vanguard customers a better experience by enabling internal engineering teams to better understand their massively complex production environment—and to do that quickly across the entire organization, in the notoriously slow-moving financial services industry. 

They also had a big problem: The production environment itself.  

As he explained in his keynote at 2021 hnycon, Vanguard is in the process of sequentially moving its workload from its own data centers to the public cloud. During this transition, Vanguard’s production environment is split between on-prem data centers, their private cloud, and the public cloud. In this kind of environment, it was clear to Rich that Vanguard's existing approach to application performance monitoring (APM) would not scale. So he and his team came up with a new approach powered by OpenTelemetry and Honeycomb. 

Implementing OpenTelemetry and Honeycomb at Vanguard helped Rich identify a roadmap that improved tangible metrics like MTTR, lead the charge in a larger cultural shift at the company, and use that momentum to champion a mandate that all teams use OpenTelemetry and Honeycomb in every corner of the production environment. In this blog, we outline the tips he shared on how to drive observability adoption in an enterprise environment.

Start small and experiment with instrumentation

Getting buy-in from all stakeholders in an enterprise-level organization is a huge, tedious task. Taking a step-by-step approach, Rich and his two teammates started small with one service that had dependencies across the production environment before iteratively moving forward to other services.

At the very beginning of this process, Rich knew Vanguard needed a distributed tracing approach, so they started with OpenTracing for instrumentation. They discovered Honeycomb when they needed a backend data store for the telemetry data. Once they decided on Honeycomb, they switched to using Honeycomb’s proprietary Beelines for instrumentation. 

Later, Rich decided that Vanguard needed a vendor-neutral approach to instrumentation so that engineering teams wouldn’t have to worry about proprietary issues and vendor lock-in. He turned to the OpenTelemetry project, which meshed well with Honeycomb, because, as he put it, “Whatever you decide to use, Honeycomb can handle it.” 

By starting small—with one service instrumented with OpenTracing—Rich was able to test out what worked and what didn’t, eventually landing on the perfect setup for Vanguard’s specific needs.

Rack up quick wins to drive a cultural shift

One team was working on an ongoing migration effort moving data from a legacy, on-prem system to a repository in the cloud, and needed to know all of the service and data dependencies tied to the application. They worked for months with spreadsheets, dug through code, and talked to experts, yet couldn’t identify all the dependencies. 

That’s when Rich and his team stepped in with OpenTelemetry and Honeycomb. With their help, the migration team was able to answer all of their dependency questions in minutes. Months of work whittled down to mere minutes. 

The results proved to the migration team that what Rich and his team are doing with Honeycomb and OpenTelemetry goes well beyond just incident response. Rather, Rich’s team is at the forefront of introducing a cultural shift at Vanguard—one where gaining a clear understanding of how production environments behave is not only possible, but also standard operating procedure. 

Iterate toward big goals by focusing on day-to-day impact

With an eye on shifting Vanguard’s culture toward observability, Rich and his team first focused on making a tangible impact on both day-to-day workflows and overarching organizational goals.

One way they achieved this was by introducing Honeycomb’s Service Level Objectives (SLOs) for alerting. One team Rich helped was able to set up an SLO error-budget burn alert to notify them that if an issue isn’t fixed within the next 30 minutes, it would impact their customers. The team was then able to identify and fix an issue before their customers even noticed anything was wrong. 

This capability to fix issues before they become bigger problems led to three tangible improvements:

  1. The team experienced less stress because they could detect issues before they were reported by customers.
  2. Customer experience improved because customers ran into fewer issues in production.
  3. Decreased overall MTTR across the organization. 

Get organizational buy-in

Through this larger cultural shift in Vanguard's adoption of observability, Rich was able to work with key stakeholders in his organization to prove value that could be realized by every engineering team—namely, a complete understanding of what’s going on in production. 

Vanguard now has a mandate to expand OpenTelemetry and Honeycomb to every application in their production environment. Currently, they are aiming to replace all of their existing APM tools by the end of 2021 with the approach pioneered by Rich’s team. To dig deeper into the details of how he was able to get organizational buy-in, watch the full recording of his talk

Want to learn how you can change the way your teams prioritize effort so they can do their best work, identify an observability roadmap that fits your business, and how you can use Honeycomb with OpenTelemetry? Join us and Rich for this webinar, "How Vanguard Upleveled Their Org and Brand With Observability."

 

Related Posts

Software Engineering   Culture  

The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation

The CoPE is made to affect, meaning change, how things work. The disruption it produces is a feature, not a bug. That disruption pushes things...

Teams & Collaboration   Software Engineering   Culture  

What Makes for a 'Good' Pair Programming Session?

Software changes so rapidly that developing on the cutting edge of it cannot fall to a single person. When it comes to asynchronously disseminating information...

Software Engineering   Culture  

Staffing Up Your CoPE

Getting the right people working in the CoPE is crucial to success because these change agents must limber up the organization and promote the flexibility...