Incident Report: The Missing Trigger Notification EmailsBy Steve Lewis | Last modified on February 7, 2022
On November 18, between 00:50 and 00:56 UTC, an update was deployed that improved Honeycomb’s business intelligence (BI) telemetry available from our production operations environment. Contained within that update was a defect that escaped notice until around 14:56 UTC on November 22. During that time, we failed to send approximately 94.1% of the email notifications for triggers. The incident went unnoticed for several days until a customer brought it to our attention.
In this post, I’ll walk through what happened, the steps we took to fix the issue, and the lessons learned.
What caused the incident?
We don’t operate our own mail server. Delivering email is not a key differentiator for our business so we work with partners to ensure that it happens quickly and reliably. We utilize an SDK to work with one of these partners to make this simpler. That SDK behaved in a way that we did not anticipate when we rolled out this small update, and the resulting error went undetected.
I made the change to improve our BI telemetry by adding a new bit of metadata to our API request. I expected that any errors during the processing or delivery of the request would be handled by the existing error handling we had in place, which would then be picked up by the existing instrumentation and be visible within our kibble and dogfood environments.
In the Go programming language, the convention is to return a tuple from functions containing the return value and/or any error that occurred. With this SDK, the returned response also must be inspected to detect some types of errors. This surfaced some gaps in our prior implementation:
- We did not instrument the request
- We did not instrument the returned response
- Auto-instrumentation didn’t reach into this SDK to provide transparency into the API calls
- We did not have logic to inspect the return value for hidden errors
Our automated tests did not detect a failure because the third-party API was mocked in those tests. I manually tested that email was being delivered for several user account maintenance scenarios, but I did not manually test triggers. We even had a service-level objective (SLO) on the email deliveries—but because the failures went undetected by our implementation, that SLO was not reflecting the actual failures that were happening.
How do we prevent incidents like this in the future?
The fix was not hard to work out. There was one new line of code that could have resulted in a change to the behavior for triggers. The way we used the SDK here was subtly different from other places where emails were being sent (such as in user account management), and I inferred that the metadata values passed in these commands were probably not being escaped properly.
The harder part was being able to confirm the needed change would resolve the notification failures. Without telemetry in Honeycomb showing me the failures before the fix, it would be more challenging to ensure there were no regressions or failures afterward. So it took a little more time to build confidence in the fix, which primarily entailed manual testing throughout the product in our dogfood environment.
Since then, we have completed a few rounds of clean-up in the way we leverage this SDK, including adding many more instrumentation points. We have identified an integration testing interface provided by our email partner that we can incorporate into our automated testing. We have also tested that the SLO now sees the hidden errors, which it did not detect before.
What did we learn from this incident?
A few facts stand out for me from this experience: I did not understand the initial behavior of the code I was working on, but I thought I did. I did not take time to observe it before I started making changes. And I didn’t include observability in my development process (horrifying, I know). If I had added the metadata value or the response status code to our instrumentation, the fault would have been revealed as soon as I looked at an event trace.
In short, with proper instrumentation in place, tracing can help us spot bugs. Making observability a part of the developer workflow can help us prevent bugs, particularly at the boundary with a third-party API where there are more unknowns and more potential for unpredicted failure modes.
We apologize for the outage and to the customers who were affected. We know you rely on us, and we take that very seriously. Thanks for being patient as we resolved this issue.