Reducing Mean Time to Diagnosis: How Salary Finance Uses Honeycomb to Ask the Right QuestionsBy Rebecca Carter | Last modified on August 9, 2023
Salary Finance is a UK-based financial well-being employee benefit program. Over the last seven years, the company grew from a startup to a scaleup, earning rave reviews along the way from its more than 4,000 customers. However, with fast growth also comes natural growing pains. As their customer base expanded, so did the number of incidents they experienced, which also became harder to diagnose due to lack of visibility into their increasingly complex environment.
The team knew this was not sustainable. The risk of customer churn drove urgency for the engineering team to fix this issue of visibility. So when Aruna Koya joined the team just over a year ago as the Head of Site Reliability Engineering, she knew that incorporating the practice of observability was a must. She introduced Honeycomb.
An internal culture shift can be a challenge, especially for a busy engineering team—and adding yet another tool to their preexisting chain wasn’t going to help ease that transition. Koya sought out a flexible solution that would cause the least amount of friction and be adaptable to any new technologies their organization may implement in the future. Additionally, she wanted a solution that harnessed the maximum amount of granularity from their telemetry data—i.e., worked well with OpenTelemetry (OTel).
Over the course of a year, Koya’s team was able to get nearly 100% of its platforms (the original monolithic system and a second one that is microservices-based) instrumented and actively sending data to Honeycomb. “Now, for the first time, we have the visibility to say ‘Hey, this is wrong’ before customers actually call us,” she explained.
Getting buy-in for observability and OpenTelemetry
To get the Salary Finance team as excited about observability as she was, Koya knew it was critical to demonstrate the power of OpenTelemetry. She enabled the team by providing training around OTel and its core features like the Collector and exposure to the libraries. But what really sold the team was how Honeycomb could make that rich data actionable through interactive visualizations. Once fully instrumented, the team was able to see across all sixty services in a matter of seconds.
Before Honeycomb, the team worked completely in the dark and struggled to figure out where to start. Now, with visible API endpoints, the developers know exactly what to optimize.
The SRE team has set up Service Level Objectives (SLOs) around availability and latency, but the developer team isn’t fully convinced. “We have implemented SLOs for the service and for the platform. So we monitor them as SREs, but the developers aren't on board with the SLOs yet,” she explained. “What we need to do is to let them know when a customer event, such as login journeys or login failures [are occurring] ten times a minute. Is that because somebody presented a wrong email? Is that because there's something wrong with the application? We don't know. It's invisible to them right now. So they need to add that custom event to say, ‘Okay, now let's get alerted.’ It’s a work in progress, and I’m confident we'll get there," Koya said.
“Before, they would have five services talking to each other, and they’d have to go look in CloudWatch for each service to see what might be going wrong,” she continued. “Now they can see those errors in Honeycomb in two seconds or less. That’s the power that’s in their hands right now.”
Reducing mean time to resolution
This newfound power also translates to the team spending less time finding the source of issues. Prior to using Honeycomb, Salary Finance’s engineers spent, on average, four to eight hours on mean time to resolution (MTTR), which wasn’t ideal. Now, their MTTR is down to two hours, which is still longer than Koya would like it to be. However, their hands are tied for the moment.
Koya explained that the application is powered by a single database and is split between both its monolithic and microservices-based architectures. So Koya reframes the MTTR metric as mean time to diagnosis. “They’re not searching anymore,” she said. “They can look and right away see exactly where the problem is, and bam!, they go to work.”
Streamlining AWS complexity with Honeycomb
The Salary Finance team had issues with visibility in AWS, where the app’s database, microservices, and containers run. Koya dove into the technical details, stating:
“We use Amazon ECS extensively across our estate,” Koya explained. “Now with ECS, let's say we have NGINX on top of the containers and we then have our .NET core services for APIs. We have NGINX serving the web pages and the UIs. What's happening, for example, is we are receiving NGINX errors that are running in a container. We can't actually log on to the container to see what's going on, we only have the CloudWatch ECS logs. So what we have to do is to go to ECS, go to CloudWatch, find the container, find the task definition, look for the logs, look for the last hour or look for the last time the event happened that we knew about and then basically search for the logs there.” A time-consuming, multi-layered process, to say the least!
“Then we say, okay, that top-level service is one service, but it calls another service,” she continued. “So now we have to go into the AWS container logs for that service and then the next service and so on… and the container logs aren't easy to read.” The challenge was clear.
Tracing with Honeycomb became essential to detangling bugs hidden beneath multiple layers of logs. “Now, the team can immediately see exceptions and stack traces in Honeycomb. They see the value. They can debug immediately now.”
Leveraging observability to support compliance
Because Salary Finance is in such a heavily regulated industry, Koya needed to ensure their observability process didn’t “over-observe” into protected personal information (PI). The team needed to keep both personal details and loan data private, but also keep UK customer data within the UK, even while sending data to Honeycomb.
Koya made sure the developers understood that no personal information could ever be included in URLs. They also used AWS PrivateLink to keep other sensitive information off of the internet. Although some companies are wary about telemetry data because of concerns about data privacy, Koya said that was never an issue at Salary Finance as the application was designed to be completely secure.
The company didn’t need a third-party tool to make it compliant, but Koya was pleasantly surprised to learn that with Honeycomb,: the ability to see all the transactions in one place meant compliance, governance, and ultimately, that audits are guaranteed. “Honeycomb is not our reporting tool, but it is our forensic tool,” she explained. “We can use this to investigate governance and compliance issues in a way we couldn’t see before. It’s been super useful.”
The observability payoff and the road ahead
The team at Salary Finance put in the work to implement a culture of observability. This shift has paid off in many ways: increased visibility to get ahead of incidents, as well as the added confidence to make changes to their environment. They successfully reduced MTT
RD through tracing, enabling faster debugging of their AWS instances. Lastly, the added bonus of streamlined governance and compliance for handling sensitive PI data and audits through seamless investigation acted as the cherry on top.
Interested in seeing how other customers harnessed the power of observability? Check out some of our other customer stories. We also encourage you to see the power of Honeycomb (commitment free) by exploring our sandbox.
Our friends at Tracetest recently released an integration with Honeycomb that allows you to build end-to-end and integration tests, powered by your existing distributed traces....