Dear honeycomb users,
On Saturday, Aug 19th, we experienced a service outage for all customers. This was our first-ever outage, even though we’ve had users in production for almost exactly one year, and paying customers for about 6 months. We’re pretty proud of that, but also overdue for an outage.
We take production reliability very seriously for our customers. We know you rely on us to be available so you can debug your own systems, so we’ve always invested effort into defensive engineering and following best practices for a massive, multitenant system. We learned a lot from this outage, so we’d like to tell you what happened and what we’re doing to prevent the next one.
(Naturally, we use honeycomb to debug honeycomb, in a fully isolated environment called “dogfood”. We love you, so we’ll include some graphs from dogfood showing how we debugged the problem.)
At 19:35 PDT, mysql disk space began growing and connection count skyrocketed. Oncall was paged a few minutes later, and saw that traffic was dropping across the board.
The on call engineer then looked at latencies, which were climbing, and started checking sources. It could have been climbing due to kafka, mysql, a cache, or the data nodes, but he could tell that it was mysql getting slower because the
get_schema_dur_ms was climbing.
Mysql continued to serve traffic to any open threads, but it was over the peak connections limit so new conns could not be made. This prevented our normal mechanisms from kicking in to automatically do a temporary ban if any particular dataset or team was spiraling out of control and using too many system resources.
We knew the problem was mysql, though, so we kept digging. Query times were spiking from mysql’s perspective, although the degradation was gradual since our API caches values for varying lengths of times.
Mysql was out of disk space, we discovered — it had filled up in 2 minutes and was stuck thrashing, trying to write a temp table to disk. This, we realized, was due to a customer having inserted a few rows with column counts of 52k to 104k. Mysql was trying to write out the schema to disk.
We shut down honeycomb to reclaim mysql connections so we could kill off the lagging queries, increased the capacity of our mysql primary, and deleted the problematic rows. We temporarily blocked the problematic dataset, emailed the customer, and opened traffic back up again.
The service was totally down for everyone from 21:00 PDT to 21:18 PDT, when everything got back to normal. All users will likely see a disruption in their data from 19:47 to 21:18.
Like I said, we take this very seriously. After post morteming, we have the following actions on our todo list:
- Set a hard limit on the columns per dataset. Yes, we can handle incredibly rich rows … but 240k columns are unusable by any mere mortals. Cap them at about 1000. Provide a more graceful experience when this is exceeded, create a trigger for ourselves so we know and can let the user know.
- Raise mysql connection limits, peg it to some multiple of our application hosts. Let the golang mysql driver handle queueing of mysql requests on a per host basis. Think more carefully about thundering herd problem.
- Improve our ability to recover from mysql issues by using privileged connections, changing throttle utility behaviors, write a utility to use RDS superuser, etc.
- Fix the slow SELECT so that we aren’t causing a filesort. (Already diagnosed!)
- Quarantine bad events instead of dropping them, so that customers can debug for themselves
- Consider changing caching technique so only one expensive per query per node can be inflight
- Make a mysql postmortem playlist, so we can learn from these investigations and find the root cause faster next time.
In conclusion: we apologize for the outage, but we learned a lot from it. And we’re actually really grateful that it happened on a Saturday night, when it was minimally disruptive to our users. Thank you!