Founded in 2007, ecobee is a Canadian company that makes smart thermostats, temperature sensors, light switches, cameras, and contact sensors that keep your home comfortable when you’re there and save you money when you’re not.
- Node, Java, and Go
- Instrumented with Beelines & OpenTelemetry
- Prometheus, Alert manager, and Grafana
- AWS, Google Cloud, and on-premises
For ecobee squads, using Prometheus and Grafana for system and application monitoring helped identify broad aggregate performance issues like spikes in latency. However, in-depth performance tuning proved to be elusive, especially when performance optimizations are often measured in milliseconds. The squads needed an ability to drill down along any arbitrary path to understand specific sources of system latency. They also needed better ways to focus on user experience. The team knew they needed observability tooling, but a key consideration was whether they should build or buy those tools.
When something happens on one platform or a distinct part of our service, we want more visibility into what’s happening in the other parts of the platform.
What They Needed
- Greater visibility so they could drill deep into performance issues and identify key optimizations.
- An onboarding path for highly distributed and autonomous teams organized around the squad model, since many squads use the same engineering tools.
- A way to focus engineering decisions around what actually matters to customers
- A clear way to decide whether it’s better to buy or build.
Use Case: System Optimization
The “last mile” of performance tuning had proven to be one of the hardest problems for ecobee to tackle, until they started using Honeycomb. The ecobee Beehive team manages API services for mobile consumer-facing apps. When the team started using Honeycomb, they quickly discovered that the ability to drill-down arbitrarily let them understand the source of any system latency. The team now uses Honeycomb to continuously optimize performance, often measuring improvements in milliseconds. Those seemingly small optimizations have a big collective impact when managing millions of customers.
We were observing regular spikes in our API latency that pushed us far beyond our SLO. We were spending lots of time digging through our metrics, trying to correlate the data and come up with an explanation. It quickly became clear that we were wasting time and ultimately we were just guessing and getting nowhere. We needed something that would gather more detailed data and present it in an intuitive way for us to dig into.
Observability: Buy vs Build?
For the team, a key factor in this decision was engineering bandwidth. Building required assigning engineers to create a custom in-house toolchain, burning many cycles that engineering couldn’t afford to waste. Alan Hietala, one of ecobee’s Tech Directors, and his team decided that buying Observability tooling was a smarter business decision. Honeycomb’s subscription costs were more attractive than the team’s estimated cost for designing, building, and operating their own bespoke solution over time.
We previously spent a lot of time fumbling in the dark. Some teams say they would not be able to do their job without Honeycomb. It’s absolutely critical.
Adoption Starts with Honeycomb’s Free plan
Many engineers at ecobee first learned how to use Honeycomb via the Free plan. Giving teams a chance to experience Honeycomb’s value in a low-risk setting enabled adoption to grow quickly across different squads.
The teams started with Honeycomb’s Free plan which provides ample monthly events to really try out some of the unique features including BubbleUp. We don’t have a top-down edict when it comes to tool adoption at ecobee and it’s so much better when teams collaboratively learn from each other. If I didn’t get budget approval to purchase Honeycomb, I feel like my team would have chased me out of the building because they started to depend on it daily to do their jobs.
ecobee’s Favorite 3 Letters – SLO
Developers, SREs, and team managers at ecobee all use Honeycomb today. All engineers are responsible for checking in their code and deploying new releases to production using CircleCI. Some teams are starting to use Honeycomb’s Service Level Objectives (SLO) feature, available in the Enterprise plan.
Honeycomb has made implementing SLOs easy once you agree on the criteria. Previously, you’d have to go to Grafana or Prometheus and start building backwards: you start by building the correct SLI (indicator) to inform on the stated SLO, which of course is time-consuming and error-prone.
ecobee found Honeycomb’s SLO feature intuitive and easy to use, with many examples that helped them get started. They could easily set up a new SLO, let it run for a few weeks, and iterate over time. Erol’s SRE team also works closely with business needs. Honeycomb’s SLO feature helps keep everyone informed about how production is performing at any point in time.
It’s not just pretty graphs! SLOs really tell you where to focus, based on what matters to customers. It informs the team to make sound engineering decisions. We now can decide: do we work on availability of our services or do we release a new feature? It’s really that simple and so important. I love that SLOs are baked into the core product and I don’t have to think about this as a separate tool or budget item.
Embedding Observability Into Your Engineering Culture
How can you create a safe culture that enables engineers to learn, try, and test? The data unlocked by observability is a powerful tool for your engineering teams, but it's the people and the culture that will be the real force for transformation.
What is Honeycomb's BubbleUp?
When there's a problem, which field has the clue? BubbleUp analyzes behind-the-scenes data, identifying the most likely fields with the keys to what's causing outlying behaviors.