How to Hot-Swap Active Production Services

 
Want a copy of the for yourself? Download the PDF
 

Mode Analytics is a data tool that caters to data scientists with SQL, R, and Python-based data analysis and visualizations. They were in the process of swapping out their most critical service with an upgraded version, but did not have the luxury of doing it with downtime. They had to hot-swap a live production service. The service they needed to upgrade is critical because it underpins their visualization technology. Upgrading their visualization service is not only an incredibly delicate process, but also one that will have far-reaching benefits. 

Two critical path engineers in this process are Mode’s Talia Trilling, Senior Software Engineer, and Ryan Kennedy, Staff Software Engineer. They sat down with us to discuss their process for hot-swapping active production services and the lessons they’ve learned along the way.

Talia and Ryan presented the essential steps necessary to hot-swapping an active production service as three core phases:

  • Phase 1: Fire and forget
  • Phase 2: Full evaluation
  • Phase 3: New service exclusive

Before embarking on these phases, Talia and Ryan both emphasized the importance of the required preliminary steps you must take before beginning your hot-swapping journey. Let’s dive in.

Preliminary step 1: Clarify your requirements

Before you start any of the migration steps Ryan and Talia laid out, you first need to clearly define your problem domain and your success criteria. First, you must have clarity on:

  • The problem you’re facing 
  • The goals you need to achieve 
  • What success looks like
  • What happens before, during, and after the swap

Define the problem

First, clearly define the problem you’re trying to solve, as well as its impact on your business. This clear definition not only helps you find the right solution later on but also helps you make a case for significant changes that may not have immediate ROI.

In Mode’s case, their visualization service, nicknamed “Flamingo,” was made up of two separate service components that were too tightly coupled. A different team managed each component and, while it made sense years ago to couple them together, this setup was a roadblock for efficient iteration. 

The structure of Mode’s visualization service, called Flamingo, which is made up of two tightly coupled components.

Figure 1: The structure of Mode’s visualization service, called Flamingo, which is made up of two tightly coupled components.

The two components that made up Flamingo were the data engine and the visualization grammar. The data engine, which Ryan and Talia’s team manages, owns the API, data manager, and the management of the in-memory database. The visualization grammar, which another team manages, takes a visualization request and turns it into an execution plan, usually in the form of a few queries against the in-memory database. 

The tight coupling between the data engine and the visualization grammar worked functionally, but it prevented efficient iteration on either component. As a data science and business intelligence platform, iterating on how Mode presents data visualization is vitally important for their future as a business.

But buy-in from other affected teams wasn’t a given. Talia and Ryan’s team had to make a strong case for hot-swapping one of Mode’s most essential services. As Talia pointed out, “[Flamingo is] serving us for what we need it to, but if we want it to grow, it’s going to be next to impossible because of how tightly coupled everything is.” 

Characterizing the problem as a threat to future growth is what got the attention of other  key business stakeholders. By defining the problem clearly and explaining its impact—even if it didn’t immediately affect the bottom line—Talia and Ryan’s team presented a clear objective and got buy-in from teams across the business.

Define your goal

Next, establish what you’re specifically trying to achieve. To define your goal, look to your users and ask yourself what you’d like their experience to be during the hot-swapping process.

Ryan explained that, from Mode’s perspective, “For someone running an analysis, the only thing worse than giving no answer is giving the wrong answer. It’s something we are acutely concerned with: making sure we’re not giving people bad data that they then use to make big business decisions.” 

With this perspective in mind, Mode defined its goal as needing to separate the two Flamingo components—the visualization grammar and the data engine—in a way that did not impact the end user experience or accuracy of results. 

Uncoupling these components meant Flamingo would house the visualization grammar while the data engine became its own separate service that worked closely with Flamingo. And this swap needed to happen without impacting the user experience.

Define Success

Create clear markers for what success is and how you know you’ve achieved it. With your goal in mind, define what indicators to look for that show when you’ve succeeded or failed.

Talia and Ryan’s team sat down with the product management team to create two definitions of success everyone could rally around:

  1. The new system of decoupled service components needs to be at least as reliable as the existing system of tightly coupled service components.
  2. P50 and P90 latencies for Flamingo would be no more than a threshold slower than its baseline performance. Talia and Ryan’s team knew they’d experience a temporary spike in latency, so to find this success definition, they asked the product team, “What change in latency can you stomach?” The tradeoff being that putting up with some latency now will lead to increased speed later.

Everyone at Mode is a stakeholder in Flamingo in one way or another because visualization is so critical to their product. Setting clear success definitions helps everyone understand what is happening with Flamingo during the hot-swap and why.

Define state before, during, and after

Lastly, use all of the work you’ve done up to this point and delineate what the state of the services will be before, during, and after migration. Here is how Mode laid it out:

Mode’s definition of before, during, and after hot-swapping active production services.

Figure 2: Mode’s definition of before, during, and after hot-swapping active production services.

“Before” is how things stand when you have a problem. In Mode’s case, it’s where everything is too tightly coupled into one larger service.

“During” is where things stand while you go through the hot-swapping phases. This is where Mode built out the Data Engine as a new service separate from Flamingo and ran tests. It’s also where all three phases of hot-swapping described later in this guide happen.

“After” is where things should be after the hot-swap is complete. This is where Flamingo and Data Engine are distinct services that work together.

Preliminary step 2: Get visibility into your production environment

Hot-swapping active services can be extremely risky if you don’t know exactly what’s happening at each step along the way. To monitor how the hot-swap progresses in each phase, you need to set up a dimension in Honeycomb that’ll provide a baseline understanding of system changes. Mode uses a dimension they call EvaluationMode. “EvaluationMode is a dimension that we added to Honeycomb that reflects what’s happening with our LaunchDarkly flag,” Talia explained.

Here’s how that looks in Honeycomb:

Honeycomb view of the EvaluationMode dimension.

Figure 3: Honeycomb view of the EvaluationMode dimension.

The orange line is the control Talia and Ryan’s team set up before they deployed any code related to the hot-swap. Each subsequent vertical line correlates with the start of one of the migration phases outlined below. The first vertical is when the hot-swap begins, the second vertical is when the second phase begins, and a third will appear when Mode is ready for the final phase.

“These vertical lines show where each one of these feature flags are being enabled at certain percentages,” explained Ryan. “The average is the success rate, so 100 is a success, 0 is a failure. It works similar to an SLI-type metric where we can run an average of it to get a percentage. We can see that in Flamingo exclusive valuation, the success rate is a bit better than pretty much all other evaluation methods, so there’s something for us to go in and look at.”

This visibility provided Talia and Ryan’s team with direction on where to start looking when they ran into issues during the hot-swap. Without this level of visibility, you won’t be able to see if you’re trending towards your goal or

Want a copy of the for yourself? Download the PDF