Honeycomb Blog

Testing In Production

Testing in production has gotten a bad rap — despite the fact that we all do it, all the time. There’s a lot of value in testing: to a point. But if you can catch 80-90% of the bugs with 10-20% of the effort — and you can! – the rest is more usefully poured into making your systems resilient, not preventing failure. You should actually be practicing failure regularly. Ideally, everyone who has access to production knows how to do a deploy and rollback, or how to get to a known-good state fast. Everyone should know what a normally-operating…

Read More...

Observability: What’s in a Name?

“Is observability just monitoring with another name?” “Observability: we changed the word because developers don’t like monitoring.” There’s been a lot of hilarious snark about this lately. Which is great, who doesn’t love A+ snark? Figured I’d take the time to answer, at least once. Yes, in practice, the tools and practices for monitoring vs observability will overlap a whole lot … for now. But philosophically there are some subtle distinctions, and these are only going to grow over time.* “Monitoring”, to anyone who’s been in the game a while, carries certain connotations that observability repudiates. It suggests that you…

Read More...

Our First Outage

Dear honeycomb users, On Saturday, Aug 19th, we experienced a service outage for all customers. This was our first-ever outage, even though we’ve had users in production for almost exactly one year, and paying customers for about 6 months. We’re pretty proud of that, but also overdue for an outage. We take production reliability very seriously for our customers. We know you rely on us to be available so you can debug your own systems, so we’ve always invested effort into defensive engineering and following best practices for a massive, multitenant system. We learned a lot from this outage, so…

Read More...

Lies My Parents Told Me (About Logs)

Lots of us still believe some pretty silly things about logs. Most of these things used to be true! Some of them never really were. Sometimes they are “true enough” to get you a long ways, until you run into a wall and suddenly they no longer are. Any time there are changes in your scale or maturity or environment, you may need to reconsider your assumptions about logs, and these are good enough place to start. “Logs are cheap.” Not if you’re doing anything decently interesting with them, they’re not. Lots of people get intense sticker shock when they…

Read More...

Instrumenting High Volume Services: Part 3

This is the last of three posts focusing on sampling as a part of your toolbox for handling services that generate large amounts of instrumentation data. The first one was an introduction to sampling and the second described simple methods to explore dynamic sampling. In part 2, we explored partitioning events based on HTTP response codes, and assigning sample rates to each response code. That worked because of the small key space of HTTP status codes and because it’s known that errors are less frequent than successes. What do you do when the key space is too large to easily…

Read More...

Is Honeycomb a monitoring tool?

You may notice that we don’t talk about “monitoring” much, and that’s because we don’t really think of monitoring as what we do, even though it kind of is. Traditional monitoring relies heavily on predicting how a system may fail and checking for those failures. Traditional graphing involves generating big grids of dashboards that sit on your desktop or your wall, and give you a sense of the health of your system. That’s not what we do. Honeycomb is what you do when your monitoring ends. You still need some simple end-to-end checks for your KPIs, and monitoring for key…

Read More...

Instrumentation: system calls: an amazing interface for instrumentation

When you’re debugging, there are two basic ways you can poke at something. You can: create new instrumentation (like “adding print statements”) use existing instrumentation (“look at print statements you already added”, “use Wireshark”) When your program is already running and already doing some TERRIBLE THING YOU DO NOT WANT, it is very nice to be able to ask questions of it (“dear sir, what ARE you doing”) without having to recompile it or restart or anything. I think about asking questions of a program in terms of “what interfaces does it have that I can observe?”. Can I tell…

Read More...

Instrumentation: What does ‘uptime’ mean?

This is the second post in our second week on instrumentation. Want more? Check out the other posts in this series. Ping Julia or Charity with feedback! Everybody talks about uptime, and any SLA you have probably guarantees some degree of availability. But what does it really mean, and how do you measure it? If your service returns 200/OK does that mean it’s up? If your request takes over 10s to return a 200/OK, is it up? If your service works for customer A but not customer B, is it up? If customer C didn’t try to reach your service…

Read More...

Instrumentation: Instrumenting HTTP Services

Welcome to the second week of our blog post series on instrumentation, curated by Julia and Charity. This week will focus more on operational and practical examples; check out previous entries for awesome posts on Finite State Machines, The First Four Things You Measure, and more! Instrumenting HTTP Services I spend most of my time at VividCortex working with and building tools for database instrumentation. We’re a SaaS platform with lots of HTTP services, so I spend time thinking about HTTP instrumentation too. Here’s my approach. Services have three sections that we need to instrument. There’s the point where requests…

Read More...

Instrumentation: Worst case performance matters

This is the fifth in a series of guest posts about instrumentation. Like it? Check out the other posts in this series. Ping Julia or Charity with feedback! BrightRoll’s Realtime team is responsible for a service, Vulcan, which provides a set of developer-friendly APIs for interacting with the data involved in deciding whether to serve an ad. It is a threaded, event-driven application written in C. As Vulcan is in the critical path of ad serving, it has a very tight SLA: <5 ms response time and 100% availability.[1] One otherwise fine day, the ad serving team noticed that they…

Read More...