This post is by Christine Spang, co-founder and CTO at Nylas, where they are building a better way to do email. They have a hybrid system of existing graphite and new Honeycomb analytics, and have been experimenting with using Honeycomb for debugging. Enjoy the story and screenshots … thanks Spang!!.
A little while ago one of our engineers, Jackie, was on call at Nylas when she got an alert that said:
Average API latency excluding draft_send, multi_send, & sync_deltas is over 1500ms.
This is one of our core API alerts. It’s a really important alert, because it means the API is running too slowly – either latency is elevated across the board, or latency is elevated a LOT for some customers. Either way, it’s really important because it’s impacting core reliability of the heart of our product.
The alert linked her to this graphite dashboard, which provided some clues about the cause:
It was clear to Jackie from this graph (and the alert history) that the high latency was caused by a spike in latency on the
message_search_api endpoint. She knew from asking her co-workers on Slack that
message_search_api proxies requests to an external email provider and then streams results to the client, so it might be slow for reasons outside our control.
We wondered if it might have been caused by slow search queries. But how to verify? When you have lots of customers, it can be hard to generate dashboards for every single one of them multiplied by everything that might possibly go wrong.
Enter Honeycomb. I pointed Jackie at the API logs in Honeycomb and a couple minutes later, unaided, she came back with this graph, which is a count of requests broken by api key:
“Could it be the yellow account??”
Jackie shared the URL for the search with me and I added on the P99(request_time), which confirmed the theory:
From there, I added a filter to select only that account’s rows, and headed over to data mode to get the raw HTTP requests.
Thanks to our earlier configuration while getting logs in to Honeycomb, we store stack traces directly on the log lines for all API requests, aggregating them in the request context before logging at the end of the request. So data mode tells us that something extremely bizarre is going on here:
It turned out that every single one of these extremely slow API requests was eventually throwing a SearchBackendException and returning to the client with “-502” (a negative 502) after hundreds to thousands of seconds, which is very strange and really not supposed to be happening.
We haven’t closed out the case yet on this bug, but Honeycomb led us straight to the money. We might have been able to get to this conclusion in maybe 10x the amount of time by messing around with Kibana … but the grouping & calculation features and general speed of Honeycomb saved us the hassle. Without Honeycomb, we probably would have been like
¯_(ツ)_/¯, excluded that search endpoint from the latency alert and moved on.
Postscript – As this blog was being reviewed and posted, Spang and her team deployed the fixes to production. Here is a graph of the negative HTTP results going down and vanishing from prod. Yay!!! ~ charity
except socket.error as ex: self.status = 'socket error: %s' % ex if self.code > 0: self.code = -self.code raise self.response_length += len(data)
Sign up to try Honeycomb for free!