The nuts and bolts of metrics, events, and logs are really interesting to me. So interesting, perhaps that I get mired in these technical bits. I keep thinking of ways to process more data, allow for more fields or finer precision. I think about this so much that I drift in to worrying more about the work than the outcome.
I bet you do this too.
Instrumentation of your systems — making them observable — is more about the people than the work. Computers don’t really need instrumentation on their own, we only build it for humans (or control systems). Once you’ve ticked the basic monitoring boxes and have the alerts set up, are you done?
Making People Awesome Is The Product
I recently read Baron Schwartz’s “Product Market Fit” post and thought about how I spend most of my day thinking about time series databases, metrics protocols, sampling, and approximate percentiles. In reality though, my actual product is empowering the engineers of Stripe to be more awesome. You see, it doesn’t matter how fast my queries run, how accurate my percentiles are or how well that GIF fits in to my email if at the end of the day an engineer can’t determine if their recent deploy is working the way they expect.
Observability is about inferring the state of systems, right? Then as a team and industry are we measuring and improving how our users are able to gaze in to the crystal ball of our tools and find that clue that unlocks that performance problem, outage or efficiency gain? That is what we’re selling after all, as an internal team or a startup. Making people awesome at that work is our goal. It’s about leveling the playing field for veterans, juniors or specialists. It’s about bringing that spark of insight.
How To Make Awesome
I’m sorry to break it to you, but I don’t know the answer to “how do we make people awesome at observing”. I have lots of ideas though, and I bet you do too! I’d love to hear them but since this is my post I’m just talking at you. I’m going to throw out ideas to get us going:
- Can we capture how long it takes for our user’s to find the “clue” as to an outage’s source? A “Mean Time To Clue” seems like a great way to gauge our success and weight improvement.
- How can we get quick clues? Can our tools be more interlinked? Does our metrics dashboard link to our log indexer? Does that alert have a link to a runbook? Is there a way to fire up the bat-signal for help?
- Can we make things easier? Can we measure and reduce toil in our common interactions? Could steps be automated or removed? Could we draw a picture instead of making so much text?
- How foolproof are we? Could we take away some work? Can common operations and be pre-instrumented in libraries to avoid users having to even think about them? How can we make users more aware they exist?
- Can we empower more? Are upstream and downstream services clearly known and can their health be surfaced?
We’re spitballing, but what I find so invigorating is that every one of these questions we thought of in 20 minutes can be mined deeply. If we keep focusing on our user’s needs then we’ll be less likely to lose ourselves in implementation and instead focus on delivering more value.
Reflecting On Your Awesome Widget
Whatever widget you make, it’s really valuable to step back and look at your user’s goals and less at your technology. Metric versus log or tags versus dotted names are less important than your user being confident in their ability to develop, deploy and debug. Every one of those verbs has a whole host of clever improvements we can make. All we’ve gotta do is focus on the making those users awesome.
Thanks again to Cory Watson for their contribution to this instrumentation series!