This is a guest post by Ryan Ordway, DevOps Engineer at Oregon State University.
At Oregon State University Libraries & Press (OSULP) we have been using Honeycomb for about 18 months. We were in the beginnings of automating our infrastructure and needed an APM solution that we could scale with. New Relic was becoming too expensive, and we couldn’t afford to monitor our whole infrastructure and trace all of our applications anymore. Thus began our Observability journey.
Honestly, when we first started using Honeycomb I didn’t really know what Observability meant yet. I wanted something that was like New Relic but that we could afford. Honeycomb looked pretty cool and it had a usage based pricing model, so we gave it a try.
We had also started evaluating another vendor at about the same time, which we chose for the combination of infrastructure monitoring and APM at a price that we could afford. We didn’t have the development cycles to really take full advantage of Honeycomb yet, mainly due to not having PHP support at the time and a lack of development time for integration.
I still wasn’t done with Honeycomb. I saw what it could do, we just couldn’t take full advantage of it in our applications yet. I started testing using it for other things, which, as an Operations person, meant integrating infrastructure components and back-end services.
Then full support for tracing was added and suddenly everything changed. With the Honeycomb Beelines I could easily hook up our Rails apps and start sending events without needing to take development time away from other projects to manually instrument our code. I started integrating Honeycomb alongside our other vendor into our major projects.
What I found (or didn’t!) was interesting to me. Whenever we had an incident, it was often difficult to find useful data in our other vendor’s data. Where there should have been traces to help us see what was going wrong, often there wasn’t much of anything.
In contrast, Honeycomb always had the traces we were looking for, just not always with enough context to diagnose the problem. I don’t know about you, Dear Reader, but I would much rather manually add the right context to a trace or event I can always find! These incidents helped fuel my decision to do away with our other vendor and use those licensing costs to increase our Honeycomb capacity. We could have opted to spend even more with our other vendor to try to get a more usable experience, maybe. But Honeycomb was the sure thing.
It took us a while to get here, and we’re by no means done with our journey. If anything, we are just getting started.
What has Honeycomb helped us to do
The most important thing for us is saving money. I’ve finished migrating all of our tracing from our other vendor to Honeycomb, using the same little Honeycomb sandbox from when I started this adventure. Even better, I did it without having to pay more money. That’s huge!
Without host count limits, I have been able to integrate services that had previously only had basic monitoring. Suddenly we are learning how things ACTUALLY work, which is sometimes completely different than the behavior we had had to infer or straight up guess before.
The combination of automated event ingest along with Honeycomb’s Triggers to perform actions is a powerful callback mechanism. I’m very early into a project working on a webhook receiver that will close the feedback loop between an issue being spotted in our event data and firing off triggers to perform actions. Eventually it will be able to submit GitHub issues, restart workloads, or anything else we can dream up.
Another simple fix for us was automating Drupal incident response when we detect potential security issues with our Drupal websites. Before our web developers might not have known about a site having issues until it started impacting the service. Now we spot issues before a potential attacker can get a firm foothold.
One of our biggest wins with Honeycomb continues to be our integration of OSULP’s Institutional Repository, ScholarsArchive, into Honeycomb. The repository is built on the Samvera stack, which is a Rails based framework for building information repositories using somewhat pluggable components like the Fedora database engine, search indexes like SOLR, etc. It’s a complex enough system that it really needs observability, especially because some of our Colleges require a student’s thesis or dissertation to be deposited into ScholarsArchive before they can graduate. Student success is very important to us and the last thing we want is to cause a student’s graduation to be delayed. Honeycomb makes it easier to quickly locate submission problems so they can be fixed.
Now with the Coronavirus outbreak, we’re experiencing sudden changes. Staff and faculty, students, and researchers are all working from home, which has meant an increased reliance on remote access to digital collections and electronic resources. Honeycomb has been helping us spot potential problems with our proxies and other services that our community rely on for their work and education. Being able to point our e-resources staff to graphs and charts showing the health of the service has done a lot to alleviate their anxiety and build confidence in this strange new educational landscape.
We’ve recently expanded our storage, so now I have a few Observability-related improvements planned. Our infra team is in the midst of migrating most of our services to Kubernetes. We’re working to push those infrastructure events up to Honeycomb for analysis and to feed our automations.
We’re also planning to instrument our CI/CD pipelines, instrument our backups and preservation systems, and pouring more of that sweet Honeycomb deeper into our existing integrations.
We’re planning to do more staff training to help spread the Honeycomb debugging skills around our team, and hopefully extend that to also include staff and faculty outside of our department as we get our Inter Library Loan and other core Library systems integrated into Honeycomb — something we’ve NEVER been able to do with ANY of our other tools before.
As the team that’s responsible for supporting the library’s systems and keeping everyone connected, Honeycomb has allowed us to answer more questions than we thought possible. Even if you don’t realize it yet, you need observability in your systems—just like we needed it in ours.
If your team is overwhelmed during the CoVID-19 crisis because you’re working to keep people safe and/or at home, Honeycomb wants to help. Reach out to our support staff via Intercom (the chat bubble in the lower right-hand corner of this blog), or schedule time with Liz during her observability office hours.