Software Engineering   Observability  

Part 2/5: Dear Software Engineers

By Charity Majors  |   Last modified on April 18, 2022

Observability is not a thing for operations or some other team to care about. Software engineers, you are increasingly the primary owners of your own services … and this is a great thing.

Developers, meet my friend devops

The devops movement is years old now. Ever since the very beginning, devops has spent a lot of energy wagging its fingers at ops and telling ops to get better at writing software. We’ve spent far less energy helping software engineers internalize the need to own and instrument and understand the consequences of the software they write and the services you own.
It’s only recently that we’re really starting to bring the dev to devops … finally.

  • “Isn’t exception tracking enough?” Nope. Exception tracking is awesome, but it still keeps you thinking about syntax errors and lines of code, not systems that interact with each other in complicated and unpredictable ways. It’s good to catch your exceptions, but the scope is limited in value.
  • “Aren’t metrics enough?” Nope. With the statsd-type model, you only get to work with counters and gauges — which is fast and scalable, but achieves speed by sacrificing context and detail. A pretty heartbreaking choice to make. With Honeycomb you don’t have to.

In the glorious future, observability is just as important as unit tests, and operational skills – debugging between services and components, degrading gracefully, writing maintainable code and valuing simplicity – are a non-negotiable skill for senior software engineers. Even mobile SWEs or front-end SWEs.

Who owns your availability?

YOU own your availability.

This means that software engineers need to feel comfortable and confident breaking systems, understanding them, experimenting, and fixing them. Instead of being scared to break things, be hungry to risk breaking it … 1% at a time, under controlled circumstances, where you can watch or add a trigger to roll back to disable the feature.

Because you know what sucks? Having all the responsibility and none of the tools to do your job, or having to maintain two technical and cultural stacks.
We make it way too hard for software engineers to own their services when you have to learn a separate code base and environment and toolset for infra as for core services. If you have to learn how to use Chef, AWS, and Graphite just to add a metric, it may or may not be reasonable to expect all your SWEs to provision their own metrics. Particularly if it’s prone to spilling over and causing outages if they make a mistake.

Old Way: Pick one: context or speed. For example, you could either pick raw logs, which are horrifyingly slow and don’t scale but let you track things like latency, lock percentages, raw queries, query families .. or you could pick counters, which tell you what the count of errors is per time window, or other ticks of data.

New Way: Don’t pick. <3 Have the best of both worlds. The reason context used to be slow is because you were dealing with unstructured logs, string processing, and not wielding vertical sampling to compensate for the increased data of horizontally wide events.

Adding a new detail about an event should be just as trivial as adding a new comment to your code — and just as risk-free. Get rid of the friction that prevents understanding by instrumenting your own systems. Take ownership for yourselves, and become way more badass engineers in the process. Honeycomb empowers you to take control over your own availability.

High or low cardinality fields, sparse or rich datasets

Forget your hangups when it comes to cardinality and just log everything, because we can take it and you can use it. Filter on high-cardinality fields (like millions of unique UUIDs), run aggregates, and dip down to the original raw events when you need to. Generate a unique request id and trace it up and down the stack, even if it loops back in multiple times.

Distributed grep is no longer your sole tool for tracking down edge cases. Honeycomb is much faster than traditional log aggregation because of our column-oriented datastore and read-time aggregation — we only read what we need to to get you answers asap.

If you’re undertaking a project with a high degree of difficulty and subtle problems, like an API rewrite or a massive migration, surface health checks simply aren’t enough. Having deep confidence in your power to debug and instrument your own software is transformational. Relying on another team to detect and inform you when your own code has problems is not good enough.

Make data-driven decisions and know the ripple effects of the code that you write, by diving straight in and adding instrumentation without fear.


Related Posts

Teams & Collaboration   Software Engineering   Culture  

What Makes for a 'Good' Pair Programming Session?

Software changes so rapidly that developing on the cutting edge of it cannot fall to a single person. When it comes to asynchronously disseminating information...

Software Engineering   Dogfooding  

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1

We recently took a daring step to test and improve the reliability of the Honeycomb service: we abruptly destroyed one third of the infrastructure in...

Software Engineering   Culture  

Staffing Up Your CoPE

Getting the right people working in the CoPE is crucial to success because these change agents must limber up the organization and promote the flexibility...