At Travis CI, we run over 3 million builds per week across three cloud providers with vastly different operating systems and execution environments. Debugging a customer’s build, investigating a service degradation or outage, or prioritizing our engineering work was difficult without explorable data.
I’d like to share how Honeycomb has changed the way we operate Travis CI.
Metrics and logs
In the past, we mostly monitored and debugged our production system through metrics and logs. This would allow us to see what was happening (or so we thought), but not why.
On the one hand, we have low-cardinality metrics. These tell us about the quality of service, via The Four Golden Signals. That is great for alerting. However, correlation is only possible across a narrow set of dimensions. This makes it impossible to use metrics to answer questions such as:
- How many users are affected by this issue? Which ones?
- Which IPs are sending us the most traffic?
Another important source of data are our logs. Logs contain all of the high cardinality information, and if you know what you’re looking for, this can be extremely useful. In a sense, logs contain the raw information that metrics are derived from. But the interface to this is usually scrolling through a wall of text.
And as a result, even though the information for answering the above questions is present, the tooling makes it very difficult to do so.
It became clear that we needed something better.
Another big part of wanting to improve visibility into system behaviour was to more effectively debug performance issues.
What this meant for us:
- We need to be able to see the full latency distribution, not just averages or arbitrary percentiles
- We need to be able to look at specific samples to see what outliers have in common
- Ability to control cost, per-request or per-host pricing could often become prohibitively expensive
We evaluated various products in the APM / monitoring space. Honeycomb blew us away.
Honeycomb, bringer of Events
Honeycomb met our needs perfectly.
It is difficult to appreciate this level of observability when you have not experienced it yourself. There is definitely a bit of a learning curve, but it is so worth it.
So many questions that we were unable to explore in the past are now 3 clicks away. Queries that would take half an hour to run against our Postgres read-replica can now be run in seconds. Better yet, we can get more insight into how our customers are using our product.
- Which customers are the most heavy users of our service? Are they experiencing wait times? Can we improve their experience? Would they have a better experience on a higher-tier plan?
- Which open-source users are using our product the most? Are there any bitcoin miners in there that we should shut down?
- How do job boot times correlate with the image being used? Are some shards or regions slower than others?
- Is the increase in erroring jobs correlated with a particular programming language? What is a sample job id that we can use to dig deeper?
Many of these questions were unthinkable before we had the tooling and the mental framework to ask them.
While our original goal was to gain more insight in order to better operate our service, it ended up being used extensively by our Sales and Product departments, which was a positive surprise.
A surprising discovery
In order to give you a sense of how invaluable Honeycomb is to Travis CI, I’d like to share one of the earlier investigations we performed with Honeycomb, about a year ago. We perform similar investigations on a regular basis, but this one really stuck.
Our public-facing API is one of our higher-traffic services, serving 400-500 req/s at peak. That volume of requests makes sifting through plaintext logs a non-starter in most cases.
For this API service, we were seeing some strangely high p99 latency, about 6 seconds:
We were able to confirm that result in Honeycomb:
We wanted to see if this latency could be attributed to a particular endpoint. So we grouped by endpoint, and indeed,
Travis::API::V3::Services::Builds::Find was consistently higher.
We focused in on that endpoint:
Next, we wanted to see if the latency could be attributed to a particular user. So we grouped by user, and indeed, one of our users was experiencing a consistently higher latency:
It turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.
This was a huge relief and allowed us to re-classify the issue from major outage.
Prior to Honeycomb, we would have had to spend far more time and effort assessing impact, which in turn affects MTTR (mean time to recovery), incident-response escalation, and communication.
We’re really excited about getting more business intelligence capabilities in the monitoring space.
Making the switch to a data-driven process for analyzing and presenting actionable information to help executives, managers and other stakeholders in making informed business decisions helps the overall customer experience, product prioritization and development, and helps engineers quantify the progress they make when improving existing features and services.
Honeycomb has really changed the way we work by enabling us to ask new questions and get answers quickly. If you recognize some of the issues we were having, consider giving them a try!