At Honeycomb, we love Chef! We wanted to know more about what was going on with our Chef runs, and what better tool to use to find out than Honeycomb?
There’s some server that’s taking way longer than all the others to run chef! That’s terrible. I wonder who it is and what’s taking so long! A quick filter for runs over 100 seconds yields one host –
doodle-0eaabcabc14343c1e is the culprit!
Upon closer inspection, ~65% of that time is spent converging the node. That’s super slow! Looking at the node shows a process has spiraled out of control and is consuming 100% of available CPU, so the Chef run was just starved.
OK, nothing too terrible this time. Let’s set a quick trigger to let us know when some other server takes too long or fails a Chef run so we can go make sure that all our infrastructure stays up to date.
Find out what your Chef runs are doing! We have put the cookbook used to send Chef run data to Honeycomb up on github: https://github.com/honeycombio/chef-handler-honeycomb
Load it into your Chef config and try it out!
p.s. huge thanks to coderanger for all the help sorting Chef internals!