The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation
The CoPE is made to affect, meaning change, how things work. The disruption it produces is a feature, not a bug. That disruption pushes things...
Destroy on Friday: The Big Day 🧨 A Chaos Engineering Experiment - Part 2Â
In my last blog post, I explained why we decided to destroy one third of our infrastructure in production just to see what would happen....
What Makes for a 'Good' Pair Programming Session?
Software changes so rapidly that developing on the cutting edge of it cannot fall to a single person. When it comes to asynchronously disseminating information...
Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1
We recently took a daring step to test and improve the reliability of the Honeycomb service: we abruptly destroyed one third of the infrastructure in...
Staffing Up Your CoPE
Getting the right people working in the CoPE is crucial to success because these change agents must limber up the organization and promote the flexibility...
Navigating Software Engineering Complexity With Observability
In the not-too-distant past, building software was relatively straightforward. The simplicity of LAMP stacks, Rails, and other well-defined web frameworks provided a stable foundation. Issues...
Investigating Mysterious Kafka Broker I/O When Using Confluent Tiered Storage
Earlier this year, we upgraded from Confluent Platform 7.0.10 to 7.6.0. While the upgrade went smoothly, there was one thing that was different from previous...
Independent, Involved, Informed, and Informative: The Characteristics of a CoPE
In part one of our CoPE series, we analogized the CoPE with safety departments. David Woods says that those safety departments must be: independent, involved,...
Establishing and Enabling a Center of Production Excellence
Software is in a crisis. This is nothing new. Complex distributed systems are perpetually in a state far from equilibrium, operating in what Richard Cook...
Simulation Theory, Observability, and Modern Software Practices
The 1981 book Simulacra and Simulation by Jean Baudrillard is widely read and cited within academic circles but also permeates popular culture, influencing films, literature,...
What Is Application Performance Monitoring?
Application performance monitoring, also known as APM, represents the difference between code and running software. You need the measurements in order to manage performance....
Where Does Honeycomb Fit in the Software Development Lifecycle?
The software development lifecycle (SDLC) is always drawn as a circle. In many places I’ve worked, there’s no discernable connection between “5. Operate” and “1....
Product Managing to Prevent Burnout
I’ve been thinking about a risk that—if I'm not careful—could severely hinder my team's ability to ship on time, celebrate success, and continue work after...
What Do Developers Need to Know About Kubernetes, Anyway?
Stop me if you’ve heard this one before: you just pushed and deployed your latest change to production, and it’s rolling out to your Kubernetes...
What Happens to DevOps when the Kubernetes Adrenaline Rush Ends?
Kubernetes has been around for nearly 10 years now. In the past five years, we’ve seen a drastic increase in adoption by engineering teams of...