Incident Review: Caches are Good, Except When They Are Bad

Incident Review: Caches are Good, Except When They Are Bad

 

Between Wednesday, April 17th and Friday, April 26th, Honeycomb had four separate periods of downtime affecting the Honeycomb API, resulting in approximately 38 minutes of total downtime.

At Honeycomb, we believe that visibility into production services is important, especially when service outages are making your users unhappy. We take the impact of outages on our customers seriously, and believe that transparency is key to you trusting in and using our service. We would like to take some time to talk about the technical and non-technical issues that resulted in this outage, and discuss how we are intending to avoid similar outages in the future.

Technical Analysis

If you’ve been following our blog, you know that last October, we had an outage due to interactions between caches on our API servers and our RDS database. The downtime incidents in this case were caused by similar interactions between our API servers and databases, but due to slightly different underlying causes.

Normal Operations

Our main database server runs on RDS. By default, RDS runs FLUSH LOGS every 5 minutes in order to rotate the various MySQL logs (in particular the binlog) to guarantee the ability to recover in the case of an outage. While the FLUSH LOGS operation is running, queries on our RDS database are blocked. Most of the time, this would take somewhere between 0.25 and 0.5 seconds of runtime.

Because of the high volume of incoming request to our API servers, we implement in-memory caches that cache the values of frequently requested data, such as looking up the internal ID of a dataset from the team name and dataset name. These caches have a TTL (1 minute in this case) that allow them to be updated when there is a change.

For some of the functions that use caches with TTLs, in the case of not finding a cached result, a Find function is called that queries the database to fill the cache. This means that while the cache is expired, and before it is refilled, we fall through to calling the database in order to fetch the desired value. This often results in multiple database calls to fill the cache, which is normally fine when the database is responsive.

The Problem

Four things combined to make refill load from cache expiration a problem.

  • As long as FLUSH LOGS took only 0.25 to 0.5 seconds, the additional load induced wasn’t a problem. However, during these incidents FLUSH LOGS took anywhere from 1.5 to 7 seconds to execute, probably due to higher IO utilization during those flushes.
  • A bug/misdesign in our cache: In order to not have to do more sophisticated lock management in our caching code, we implemented the actual delete of an expired cache entry using a goroutine, which meant that there was a potential race between deleting the cache key and filling the cache with a new entry. Due to the vagaries of the goroutine scheduler, under load we would sometimes fill our cache and then have the cache entry be deleted randomly even though it was perfectly valid.
  • Database queueing. The database library that our API server uses has a database connection pool with a limit on the number of connections, to limit the amount of load that can be induced on our central database by a single server. When all the connections are full, the database library just blocks (presumably in a FIFO) until a connection is available. This means that not only can queries block an arbitrary amount of time waiting for database connections to be freed, all of the resources and memory associated with a request are also not released for that arbitrary amount of time.
  • Bad luck. The amount of additional database load induced during these stalls depends greatly on which cache entries happened to expire during the stall. Unfortunately, during these windows, several high-volume cache keys happened to expire at once, generating a substantially higher than usual number of queries.

The net result of the combination of these four issues is that during the incidents, the cache fill load induced by the database stall would induce more database load, cause more stall, and cause even worse cache expiration issues, eventually resulting in 100% RDS CPU utilization, and execution times on database requests into the hundreds of seconds. This eventually resulted in all of our API servers crashing due to memory pressure, which allowed the database and our API servers to recover.

This blog post is already too long, so we won’t go into further detail here – however, if you’d like to dive in and see the details of the code, and how were were able to generate the hypothesis of how this happened, we will be following up this blog post with a more detailed technical analysis shortly.

Mea Culpa

Due to the previous incident, we had implemented new SetTTL functions for our caches that allowed for a cache refresh function, which would guarantee that at most one cache fill call was done at a time, while returning expired data during the cache fill. However, not all of our code that issued database calls was converted over to use this callback at the time. In particular, two different database queries which did not appear to have caused issues in the October incident were not converted over to using the new callback-based cache fill mechanism. Over time, our API server volume grew, and eventually, these two queries became the source of the traffic that resulted in cascading failures.

What We Learned

  • With databases, utilization is not the be-all and end-all of measuring health. Database stalls, even minor ones, are important to track and measure, as database stalls can often induce additional load on the database that can be difficult to recover from.
  • We need to invest more effort into hardening our systems against database performance issues and outages. In particular, we need to make sure that our API servers can handle brief, or even extended outages of our main database, ideally with only minor degradation of functionality.
  • Services (and code) are dynamic things – just because performance characteristics are fine now does not mean that they will be in the future. We need to be more rigorous in taking action to reduce the likelihood of future incidents of a similar nature by addressing patterns, not just specific code.

Action Items and Follow-ups

NOT FINALIZED – PENDING INTERNAL DISCUSSIONS OF INCIDENT FOLLOWUPS

Technical Followups

  • We are planning to eliminate usage of our in-memory cache without a refresh function. We have already identified and addressed the top cache refresh calls that were using our legacy API – this should prevent any immediate recurrence of the outages that we had over the last few days. But to prevent potential future issues from occurring, we are planning to eliminate all usage of the legacy caching API that does not use a refresh function, avoiding the induced load issues due to increases in database response time.
  • We are investigating changing the behaviors of our database libraries to control the number and amount of database queries that will queue. In many cases, it is better for us to fail fast on database queries rather than tie up resources waiting for a database requests that may never complete.
  • In the past, we have not explicitly discussed how we should expect our API servers to behave when the database is down – we are now actively designing and documenting behaviors that we expect of our servers during a database outage, and developing roadmaps to get us to the point where their behavior is acceptable. These roadmaps will include the creation of automated testing that examines the behavior of our API calls when our databases are unavailable, to make sure that we do not inadvertently reintroduce similar dependencies on the central database.

Non-Technical Followups

  • We did not adequately communicate the occurrence of and duration of these incidents as soon as they occurred – we are going to be working on improving our oncall playbooks to emphasize the importance of faster, public updates to let you know what the status of our incident response is.