Ask Miss O11y: How Can I Add o11y to Databases?By Charity Majors | January 6, 2022
How do we bring observability to the DB world? In the SQL Server world, you can marry up perfmon and extended event traces but is that enough? What about the other DBs?
Oh goody, I’m so tickled to get this one. *rubs hands gleefully*
Funny story, back in 2016–2017 we thought we were building Honeycomb primarily for DB use cases. The use cases are that killer. I’ve never seen another tool do the kinds of things you can do on the fly with Honeycomb and databases.
You can sum up the write lock time, break down by normalized query to see what’s actually making you slow, or sum up all the execution time, then maybe group by raw query and normalized query to see what’s slow. Or sum up rows_scanned and break down by query to see where you’re definitely missing indexes. Or just do a substring search for whatever error you’re seeing in the logs. Or check to see if your indexes are getting hit on the slow queries or if the query planner is inconsistent. That list keeps going!
Seriously … anybody can be a DBA with the right tooling.
But first, you have to feed Honeycomb the right data to make that possible. Let’s talk about that for a minute. I know you asked about databases in particular, but your question could just as well be about ANY software you run or administer in production when you don’t have the ability to add instrumentation.
There are, roughly speaking, two kinds of software in the world: the code you ship and the code you manage.
If it’s software you ship, that means:
- You (collectively) wrote it and maintain it
- This is your secret sauce—who you are as a business
- You deploy it regularly
- You deploy it swiftly
- You instrument it as you update it, just like you comment the code
- The feedback loop between writing the code and seeing your changes via instrumentation is brief
- It’s a good fit for observability
Conversely, if it’s software you manage, that means:
- Someone else writes and maintains it
- Hopefully you have the source code, tho :)
- This is infrastructure. The stuff you have to do to get to the stuff you want to do
- You (should) manage it with package managers
- You rarely need to upgrade or change the software
- You can’t make changes to the code (well, you shouldn’t)
- It’s a black box to you, except for the logs and/or metrics they chose to emit
- E.g., ssh, haproxy, nginx, mysqld, memcached, etc.
- It’s a good fit for monitoring tools
In my opinion, you should never try to solve both problems with the same solution. Don’t ship all the code you have to manage every time you merge a diff, ffs. And don’t try deploying the software you ship with the same tools you use to provision infrastructure. There might be some exceptions out there, but it’s a good general rule.
Back to your original question: Unless your business is writing database software, databases clearly belong in group two—the software you manage, aka the black boxes. That means you aren’t going to be able to instrument the source code and get elegant arbitrarily wide structured data blobs out, one per request, are you? Which leaves only one option: it’s time to jerry-rig it.
Jerry-rigging a black box
No two jerry-rigs are the same. There’s no universal way to make a janky solution. But when it comes to databases, there is a general pattern to follow.
The first thing to check is for log files—the more verbose the better. Turn that verbosity all the way up. If there are extra modules that add more debugging information, consider adding those, too. If there are user-defined logs, even better!
In a non-database example, with nginx, a whole bunch of stuff isn’t logged by default. Yet there’s some great stuff in your log_format spec ($request_time, $server_name, etc); you can add values to the RequestQueryKeys option that creates queryable columns for any of the query strings in your URL—there’s even a field where you can print out arbitrary headers that you defined in the service and passed back on the way out. With nginx, populating the status.log is pretty much all you care about.
With databases, it gets a little tougher. First, you should absolutely start with the logs. With mysql, capture the slow query log (print out all queries, if your I/O can take it!). In particular, Percona adds a ton of diagnostic information you can configure to print out. Of course, you’ll need to run a tailer that converts that data into a structured log line with lots of k/v pairs.
That’s a start. But not everything can be output in a server.log or slow_query.log. For example, you can’t get the current queue length or how long queries in flight have been running from a log file. You can, however, get that information by connecting to a shell via the command line, dumping out all internal statistics, and sending those metrics to Honeycomb. You could set that up via cron, maybe once every 10 seconds?
A third thing you can do is open a tcpdump over ssh from another host, and parse the output of that into a nicely formatted wide event. THIS actually gives you the ability to reconstruct transactions. (Well, this works with mysql, anyway. YMMV)
You get the drill. Enrich the logs by wringing as much interesting data out as you can, every possible place you can. If there are internal stats, dump those out. If there’s something else you need but you just can’t get at it, take out your trusty rusty set of command line tools. When there’s a CLI, there’s a way.
And, lastly of course, from the application perspective, you should be tracking things like time elapsed, raw query, normalized query, etc. (You get this for free using one of our OpenTelemetry distros.)
Wait, is that really observability?
Well—no, not really.
The process I’ve described is not about crawling deep into the heart of your code and reporting state via telemetry generated by first-hand instrumentation. This process is most definitely not about the rapid feedback loop of observability-driven development, where you’re shining a light on your code as you write it. And it certainly isn’t about having the ability to “ask ANY question,” which is what you need in order to handle unknown-unknowns. So definitionally speaking, what I’ve described doesn’t really get you database observability.
This process is shaped much more like monitoring. You’re generating a lot of second-hand data from a variety of sources. If you squint and munge the data a whole bunch, and if the software in question allows you to expose tons of inner detail, then you can see a lot more detail than you were seeing before. You can shove that data into an observability tool. And when you do that, you can correlate what’s happening inside that black box of a database to how it’s affecting your application’s performance. Which is what really matters.
Observability is really meant for the code you write and ship every day: your crown jewels. But you can get some pretty powerful insights out of black boxes nevertheless, with a bit of elbow grease. :)
Have a question for Miss O11y? Send us an email!
Dear Miss O11y, I’ve been told I need to use the OpenTelemetry Collector, but I have no idea what it is, or why I need...
More often than not, as developers, when we get a report that a large customer is hitting 500 errors, there's a flurry of activity. What's...