SEDNA delivers a transaction management system to help teams collaborate with others, organize information, and manage a job from start to finish.
- REST API
- backend served in 2 places: Kotlin and Node
The Product team at SEDNA recently started an effort to consolidate and bring consistency to their monitoring and alerting. While reviewing records of past incidents for context, they realized that some of the most critical institutional knowledge for troubleshooting issues was bound up in a few individuals. This made it difficult for newer team members to find the information they needed during incidents.
For a lot of our recent incident reviews, we were talking about the same things, over and over, and not capturing the info needed to resolve them.
Without the full context of high-cardinality events, the product team was heavily reliant on the deep institutional knowledge of one or two people when it came time to dig into the data around an incident. These individuals were not always available to provide the necessary context to identify and resolve a given issue. As their team and business grew, this gap began to fail them more frequently.
What They Needed
- A platform that allowed them to consolidate the experience and query history of the entire team into one easily searchable interface
- An observability tool that supports fast querying of full context, high-cardinality events, so any team member can get to the causes of a given incident.
Honeycomb @ SEDNA
As soon as the team at SEDNA began to deploy Honeycomb, they made major headway. They began by sending in their ALB logs.
We actually got a lot to start with, without putting any new instrumentation in the app.
Right away, they noticed useful application information:
We discovered that a bunch of 403s come from this one customer; we couldn’t tell you that before.
They immediately moved on to adding further context (tenant ID, user ID, message ID, client version) to their HTTP requests, focusing on their biggest endpoint, Search.
While Grace, one of the aforementioned single sources of institutional knowledge, was on vacation, Ammar solved a major issue that had been plaguing the Dev team and their users:
SEDNA offers a list view of emails received. It populates the list based on the customer’s search through their emails. With batched infinite scrolling, every once in a while, the list would appear to suddenly miss a chunk of messages. Lots of users were experiencing it, but no one was able to reproduce it internally.
I started playing around with the Honeycomb interface and realized I knew enough to be able to narrow it down based on request path, user ID, and some other fields. We had originally thought the back-end was returning bad data—but we figured it out immediately with Honeycomb, unexpected web sockets were the cause. This was the first time we had records for a search request and what message IDs were returned, and that instrumentation/traceability is what solved the problem. I didn’t have the institutional knowledge to tie the request parts together, but tracing got me there!
From Grace’s point of view:
I was on vacation last week, but I got a text from Ammar one day because he was over the moon about a really difficult bug for which Honeycomb helped identify the cause. It had been reported for ~2 months, there were ~15 customer tickets, and multiple red herrings in our investigation. We have a weekly demo meeting, and this morning Ammar was able to show everyone how he could finally understand the issue via Honeycomb and then demonstrate the fix. There have been other smaller wins, too, but I suspect this one in particular was a big relief for many.
I keep thinking back to older problems, many took days or weeks to understand–we could have solved them in moments with Honeycomb.
We had so much reliance on institutional knowledge, and now we feel more powerful.
June 18, 2020
Fast and Simple: Observing Code Infra Deployment at Honeycomb
You don’t need Kubernetes to automatically push green builds to production – learn how Honeycomb has utilized CircleCI, Terraform, Chef, a collection of home-grown scripts, and Honeycomb itself to speed up its deployments and make them safer. We’ll talk about what went wrong along the way, and how we used our Service Level Objectives to evaluate and mitigate the risks.
August 31, 2020
Ep. #25, Reliability First with Amy Tobey of Blameless
In episode 25 of O11ycast, Charity and Shelby speak with Amy Tobey of Blameless. They explore the evolution of the SRE role, incident management, and the pains of rewriting system architecture.
December 1, 2020
Embedding Observability Into Your Engineering Culture
How can you create a safe culture that enables engineers to learn, try, and test? The data unlocked by observability is a powerful tool for your engineering teams, but it's the people and the culture that will be the real force for transformation.