Structured Logging and Your Team

This guest blog post from Anton Drukh of snyk.io is part of our series on structured logging.

From 1 service to over 50 today

We run a distributed engineering team at Snyk. As we scale our team and infrastructure, we want to make sure that we don’t make a mess of the daily operation our system. We don’t want to fall into the trap of committing code that makes sense when it’s written, but not when it’s run in production.

We started with a single service in our stack. Two years into adopting a micro-services architecture, we span over 50 services. One aspect of this growth has been a constant replication of best practices throughout our team and codebase. We want to make sure that what helps us succeed as a team is ingrained in our code.

In this post, I’d like to show how we embed trust and transparency into our services, reflecting these qualities in the way we work as a team.

For example, you just finished working on a feature and pushed it to production. How likely is it that another team member can effectively troubleshoot this feature the following day? Does the effort involved in troubleshooting grow as the system grows, and require more context to reason with? What are you doing to fight this, and is your system designed in a way that helps you?

A big part of the answer to this is in your logs. Teach your code to communicate with your team, and you will reap the benefits for as long as the code lives!

cabin made of logs

But we already do logging!

To support the examples in this post, I published a simple demo server that is very close to a boilerplate for a microservice in our stack. It is based on koa and bunyan; popular JS packages that we use extensively on the team. Please try it out yourself!

Our logs conform to a set of rules, supporting a strong standard for our team that helps us understand what is going on with any service in our production environment:

  • Each request hitting a server is logged at least twice (start and end).
  • Each request being handled by the server gets assigned with a request ID that accompanies all its logs.
  • Each response sent by the server is logged with:
    • a constant string message ('Reply sent').
    • a log level identifying the response status (eg info for 2xx, warn for 4xx, error for 5xx).
    • a duration for which the request was being processed by the server.

The server has a single API endpoint, and most of the logic around it is about logging. The logging scheme produces the following output for each request hitting the server:

{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"query":{},"originalUrl":"/","method":"GET","msg":"Handling request","time":"2018-01-27T16:07:51.505Z","v":0}
{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"sleepDurationInMs":100,"msg":"Going to sleep for a while","time":"2018-01-27T16:07:51.505Z","v":0}
{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"sleepDurationInMs":100,"msg":"Woke up!","time":"2018-01-27T16:07:51.611Z","v":0}
{"name":"demo-service","hostname":"foo-bar-host","pid":58465,"requestId":"5251fd1a-a41c-4bd7-be1d-c121ef38ebea","level":30,"query":{},"originalUrl":"/","method":"GET","duration":110,"status":200,"msg":"Reply sent","time":"2018-01-27T16:07:51.613Z","v":0}

That’s a bit hard to read for us humans, but knowing our logging standard, you can see the following:

  • Each message is a single-line JSON… Yummy for machine parsing!
  • Each message has the same requestId value of 5251fd1a-a41c-4bd7-be1d-c121ef38ebea.
  • Each message has a log level of 30 (info in bunyan-speak).
  • The first message has msg: 'Handling request', and the last has msg: 'Reply sent' – these are the basic two logs per request I mentioned.
  • The last message has duration: 110, indicating the processing time of this request in milliseconds.

What seems like a lot of logging for a few lines of code is actually observability heaven. The following important questions are easy to answer once these logs are parsed by an ELK stack:

  • How many requests have been served in the last 24 hours?
  • What are the 10 most frequent errors occurring for a specific request type?
  • What are the 99th, 95th and median percentiles for duration of request handling by this service?
  • What percentage of requests are failing with 4xx status codes? 5xx status codes?
  • Does the service’s error rate correlate with time-of-day? Count of requests served?
  • Let’s say a support request came in from a user stating that they experienced a service outage at approximately 2pm today. What errors occured during that time? Can we identify the initial request made by that user, and trace it through all of our services to understand what went wrong?

Ok, that’s a deeper-than-usual level of logging, but why, really?

I’ve been told that not everyone is as excited about dashboards and automated alerts as I am, so allow me try and convince you too – if all of your code conformed to this logging standard, how easy would it be to troubleshoot it in production?

Let’s see an error case, included in the demo server example. Making an error-inducing request results in the following log (expanded for readability):

{
  "name": "demo-service",
  "hostname": "foo-bar-host",
  "pid": 58465,
  "requestId": "48eb72b0-088b-4470-b31b-ae19bfd561f0",
  "level": 50,
  "originalUrl": "/?error=yesPlease",
  "method": "GET",
  "error": {
    "name": "NotImplementedError",
    "message": "BOOM!",
    "stack": "NotImplementedError: BOOM!n    at module.exports (.../koa2-bunyan-server/src/controllers/sleep.js:15:17)n    at <anonymous>",
    "sleptInMs": 100
  },
  "duration": 104,
  "status": 501,
  "msg": "Reply sent",
  "time": "2018-01-27T16:25:58.787Z",
  "v": 0
}

What do we have here? A quick glance reveals the plentiful appearance of n in that last log message. It makes me feel somewhat fuzzy inside, knowing the logs are practically screaming at me: “Here! Look here! The issue is here!” with that clean stacktrace. But stacktraces are not enough. They can tell you where you got, but not necessarily how you got there. This is where the rest of the logging structure and standards come in. Seeing the request ID on the error response log allows us to backtrack through the problematic flow all the way to its beginning. This, in turn, brings the root cause analysis within reach.

When was the last time you said ‘let’s deploy this debug logging and see what happens’? With this type of structure in your logs, you’ll say that much less frequently.

Count me in, I’m sold!

Getting both high-level visibility into your services combined with a laser-focus ability to get to the bottom of every issue that may occur? You should not expect anything less from your logs.

What makes this really worthwhile for me is the ROI. Set this up once for your service boilerplate, and reap the benefits with every new feature you’re shipping. What’s not to like?

The team aspect

All of the above shows the immediate benefit of structured logging. But we’re in this for the long run! Working in this way over the past 2 years has positively affected how we collaborate and share responsibilities as an engineering team.

logging team and logs on a cart

Onboarding

New hires joining our team spend a sprint getting to know the ins and outs of our system and codebase. Knowing that all services ‘speak the same language’ adds a very effective angle to this onboarding. New team members learn our system not just from the code and documentation, but from observing the production environment. Our preset dashboards and alerts tell a story worth listening to, helping new team members get up to speed with the real deal – our impact on our users.

Oncall

There are 12 developers in our team right now, each of us partaking in the oncall rotation for 7 days, approximately every 3 months. The week consists of both making sure our system is healthy, and working on customer support cases.
We do not have an ‘ops team’, and are not planning to have one in the near future. Our shared responsibility is to build and operate our service for the benefit of our users.

Expecting every developer on the team to be able to efficiently troubleshoot any type of issue seems intimidating at first. Prior to their first oncall week, every developer pairs with a more experienced team member to learn the oncall tools. Having structured logging as a standard makes this training experience worthwhile. Instead of focusing on the individual intricacies of each part of our system, we train on the common tools to be used for almost every kind of problem. Learning what to expect of our structured logging, as well as how to use it, is a big part of this training.

Impact

My main responsibility on the team is to make sure we are making an efficient and positive impact. I’m always on the lookout for black holes that consume a lot of energy, but produce little result. One of those could be troubleshooting! Not knowing what to do to answer ‘why?’ given a symptom of the system can be very disempowering. When a customer support issue comes in, our logs are usually the first thing we look at. Their structure makes it easy to get answers, and also to ask the right questions.

I am really proud to say that our problem-to-root-cause identification cycle takes minutes in the vast majority of the cases. This motivates me to invest in this infrastructure component of our system (this post included!) so I can amplify the impact our team makes on a daily basis.

Structured logging helps build structured teams

Standardisation and structured logging is a form of terminology that glues teams together. It’s a critical part of joint ownership and breaking down siloed knowledge.
Consider troubleshooting a vague error or symptom. Knowing what to look for is the difference between a confident resolution plan and a wild goose chase.

Covering operational aspects is a natural part of how we run our engineering team at Snyk. Structured logging is a key factor enabling us to share this responsibility across the whole team. I hope sharing this practice can help your team as well–and if you like the sound of how we are doing things, check our jobs repo to see if anything looks interesting to you 🙂

Posted in ,