Lots of us still believe some pretty silly things about logs. Most of these things used to be true! Some of them never really were. Sometimes they are “true enough” to get you a long ways, until you run into a wall and suddenly they no longer are. Any time there are changes in your scale or maturity or environment, you may need to reconsider your assumptions about logs, and these are good enough place to start.
- “Logs are cheap.” Not if you’re doing anything decently interesting with them, they’re not. Lots of people get intense sticker shock when they look at SaaS pricing, because they think what they’re doing now is free. Lots of these same people have tens of thousands of dollars per month in submerged AWS hosting costs, just to run the instances to power their own bare-bones logging infra. Once you factor the human costs in of keeping that pipeline limping along, it’s usually more expensive.
- “I can run it better myself.” As engineers, we tend to chronically underestimate the cost of our own time. Not just the time we devote to designing, building, maintaining, and fixing support services like logging infra, but the cost of the interrupts and context switching. And then there’s the compounding cost of ripple effects: when your logging breaks just at the time you need it most due to cascading outages and ripple effects. It pays to decouple this from your core infra and outsource the cost to a vendor. You should devote your core engineering resources to your core engineering problems: the things that make your business unique. Engineer cycles will always be the scarcest resource.
- “Leveled logging is a great way to separate information.” Take Eben’s handy guide to leveled logging:
ERROR: "your code is really broken"
WARNING: "your code is broken, but only in a subtle way that will bite you later"
INFO: "irrelevant trivia"
DEBUG: "information that would help you understand what's going on if you had enabled debug logging in production which you didn't"
Separating logs into err, warn, info can be helpful, but the upwards of a dozen different error log levels we frequently see out in the wild? Nope, this is not a useful thing.
- “Environment controlled logging is a terrific way to selectively filter logs.” Or as Toshok says, “oh … you didn’t include
export DEBUG_MEMLEAKS=1, so it didn’t log those important subsystem messages, oops!” What gets even worse? When it grows to include both log level AND component/subsystem masks, e.g.
export LOGS=subsystem1:level1, subsystem2:level2.
- “A standard logging format is good enough.”” Nope. It’s 2017, it’s time to stop parsing strings with regular expressions. Use structured logging.
- “Logs are basically the same as events.” No, logs are like shitty events. Nerfed events. You want structured data generated at the client side. It isn’t any harder to generate, and it’s far more efficient to pass around. It forces you to be more clear about what information you’re emitting. It lets you be more flexible about what you’re doing with the output, and it often saves you oodles of resources and/or money by bypassing a ton of processing work. Just use structured events. If you need more convincing, go read everything @kartar has to say about it.
- “They’re just logs, they can’t impact production reliability.” See also, “Logging to disk is awesome!” You can take down nearly any production database or I/O intensive workload just by sharing a filesystem with its logging.
- “They’re just logs, they can’t possibly impact my code efficiency.” More bad news: oh yes they can, particularly in the critical execution path. Lots of people seem to think that there is no cost to dropping DEBUG statements around their code willy-nilly. This is usually true on your laptop, which is one more reason you don’t know your own code til you watch your code in production. Strings are expensivel Pulling out all the args you need to construct the log line is not zero work, esp if the log library is building a map or allocating things underneath. A terrific description with benchmarks here.
- “They’re just logs, they can’t possibly cause/fix/change race conditions.” Hahahahahaha. Oh grasshopper. Expensive logging, especially combined with levels or env vars, can totally change the context and surface or mask (or fix or cause) all kinds of racy behaviors.
- “Lossless logs are a reasonable goal.” Uh … sure, if by goal you mean “something we all super agree with in theory but will trade off for nearly anything else in reality”. When it comes to logging, good enough is good enough. Your billing system probably should not rest on the assumption that every single log line ever emitted will be captured and never dropped. The freedom to drop logs is a glorious thing, and logging systems are one of the only places where you can consistently make that tradeoff. Use it, if only for the jaunty, devil-may-care attitude.
- “You can predict what information you will need in the future to debug a new problem.” Probably not, if your problems are halfway interesting. That’s why Honeycomb takes the approach of capturing as much detail as possible, letting you slice and dice any arbitrary combination of events in real time. Logs tend to be aggregated using things like indexes and schemas, both of which are good at returning predictable results and terrible at helping you tease out unpredictable results, or unpredictable questions.
Logs are definitely better than nothing. But only a teeny tiny bit of extra effort or changes to the way you think about logging can have a massive impact on your applications’ ability to surface new problems, point your attention at the right things and prevent the wrong things from distracting you. And this means your systems will be friendlier and more resilient and receptive to operating.
Yes, we’re a vendor and we’re trying to sell you something. But we’re building this and trying to sell it because we’ve been caremad about this for years and we think it will make the world a better place, not the other way around.
Honeycomb handles structured events and is focused on answering questions like “um, what’s happening?” and “what just happened?”