Spooky Tales of Testing In Production: A Recap and Lessons LearnedBy Alaina Valenzuela | Last modified on January 11, 2019
My biggest takeaway from Paul Biggar's talk, "The Time Our Provider Screwed Us”, was how his team conceptualized their incident response. When a big incident strikes, if you haven’t been in a situation of that magnitude before, it can be easy to panic and not know what to do first. Paul described a strategy that any team can use to manage incidents.
First and foremost, the team chose a leader to coordinate the investigation into and resolution of the incident, and to manage communication with customers. Paul himself took on this role. Next, they prioritized their goals: 1) Ensure safety of customer data, 2) Communicate transparently with customers, and 3) Restore the system to working order. They used these priorities to determine their plan of attack.
Because ensuring customer safety was ranked higher than restoring working order, they decided to essentially shut down their service. A temporary outage, in their view, was worth the potential risk of losing customers in the light of the greater risk of further compromising their customers’ data. Once the system was shut down, they set out to investigate and resolve the security breach so that they could restore service to their users.
During this process, they had to keep in mind their second priority of communicating transparently with customers. At each quarter hour, Paul checked in with his team to get status updates. At the midpoint of every hour, he released an update to customers via the company's status page. He also communicated directly with customers that had asked specific questions. He stressed that this formula of when to check in, when to communicate, and what to focus on first helped guide his team to success.
When the incident was over, Paul's team wrote a detailed blog post explaining what had happened and what steps they had taken to resolve the issue. Paul stated that their transparency in their communication with customers was a key differentiator between them and other companies affected by the same security breach, and that ultimately they were able to keep most of their customers as a result.
At Honeycomb, our incident response protocol is similar in that we assign a team member to communications duty and have them update our status page and respond to customers via Intercom. I think we could do better, however, in clearly defining our priorities as a team during each incident, so that everyone is confident they are working towards a common goal. I'll be taking Paul's recommendations back to our team to improve our own incident response.
Marc Deven’s talk, “The Penny Glitch that Cost Big”, also emphasized the need for transparency with customers and taking care of them throughout the incident. His company's glitch made all their seller’s products, some of which were big-ticket items like TVs and cameras, cost only 1 cent. Of course many buyers took advantage of this, resulting in what could have been large losses for the sellers. Marc’s company worked closely with the sellers to offer credit to buyers in exchange for cancelling their orders. He also mentioned that in the cases where buyers refused to cancel their orders, his company compensated the sellers for their loss. He stressed the importance of business insurance for this purpose.
While Marc was not actually employed by this company at the time of this incident (he was hired a few months later), his thorough knowledge of it emphasizes how incidents and the success of the incident response become a part of company lore and influence company culture. It also underscores the value of sharing these stories instead of secreting them away out of fear for how others (either outsiders or employees) will perceive the team and its leadership. By achieving a successful incident response, the company was able to keep most of their customers for many years after, therefore turning what could have been a devastating situation into a source of pride for the team.
Towards the end of his talk, Marc emphasized how gating software could have helped the team roll back the error more swiftly. Marc took this lesson to his new company, where he helped minimize the risk of a large deploy packed with shiny new features. At Honeycomb, we use LaunchDarkly as our gating software and love how it helps us target beta features at customers who have opted-in, or to minimize the risk of releases, so this point was very near and dear to my heart.
We all know Google wrote the book on SRE, so hearing Eric Pollmann describe a deadly data push that took down Google Ads was an interesting insiders-view into incident response at Google. Eric described how his team received a page for CPU overload which quickly escalated into the realm of disaster when he noticed all the servers were repeatedly crashing. His first thought was that maybe this was caused by a new code push rolling out.
Google already had some measures in place that would have made reverting bad code fairly easy. For example, its software development teams used feature flags to lessen the risk of new code pushes. However, in this case, new code was not the problem.
Eventually the team found that a bad data push was the culprit. While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn't taken the same precautions with data pushes. The data push was automated, and there was no stop button or revert procedure. It was being rolled out to all the servers incrementally, and there was no clear way to halt its progress. Instead, Eric and his team typed furiously for the next 60 or so minutes, logging into servers, writing scripts to flip symlinks, and other crazy shenanigans. It was refreshing to hear that even at Google, engineers have to resort to getting on hosts and hacking out fixes sometimes.
Eric’s takeaways are lessons we have also learned at Honeycomb. First, a check that comes back successful does not always ensure success. In Eric's case, the servers had a check to make sure they still functioned after a new data push, but they did not have another check after they had served a lot of queries. At Honeycomb, we often have multiple related triggers that will test failure modes of the same process from slightly different angles. We also have end to end checks that test the continued health of our system. This helps us to ensure that we're not missing any potential issues. Second, always have a plan to revert, and make sure this plan is well-documented and easy to find!
Ready for more?
We're super grateful to our pals at LaunchDarkly for cohosting this fun and educational event with us. If you're interested in being able to Test In Production, check out the meetup--and try Honeycomb out for free.