Software Engineering   Culture  

Establishing and Enabling a Center of Production Excellence

By Nick Travaglini  |   Last modified on May 14, 2024

Software is in a crisis. This is nothing new. Complex distributed systems are perpetually in a state far from equilibrium, operating in what Richard Cook has called a “degraded mode.” It’s through a combination of technical artifacts, organizational practices and policies, and pure gumption that they manage to maintain themselves through time.

However, there are some organizations that seem to have an easier time of it than others. Resilience is an activity they perform, rather than a property attributable to the organization. They don’t take that achievement for granted. 

Software is both a product of and part of a sociotechnical system. Unfortunately, the “socio” bit tends to be underappreciated. In this post, we’ll talk about the concept of a “Center of Excellence” or “Observability Guild” or just a plain ol’ bunch of people who have questions for their system, and how such a group can bolster the greater organization and help it achieve production excellence.

Change from within

There are at least two more or less formal institutions that an organization can create to coordinate their observability efforts. The more formal of these is a “Center of Excellence,” while the less institutionalized version is sometimes called a “Community of Practice” or “Guild.” For our purposes, we’ll define them as a functional subsystem within an organization meant to adjust that organization’s behavior, typically to improve some designated dimensions. In other words, the goal is to understand an organization from the inside so well that the group can engage in constructive criticism. For the sake of concision, I’ll refer to this as a Center of Production Excellence (CoPE).

In this sense, they need to have a certain degree of authority and autonomy. This is analogous to safety departments in organizations that attempt to do resilience engineering. According to David Woods, those departments must be:

  • Independent
  • Involved
  • Informed
  • Informative

The people who participate must have direct operating experience and come from different parts of the organization so that they can cross-check each other when evaluating the current work processes and preparing interventions and recommendations.

Yet, just because this group has that experience and operates for the sake of improving things, doesn’t mean that rocking the boat might not lead to trouble. Some parts of the organization may understand them as a blessing and others a curse. As such, the group needs access to independent funding and other guarantees that grant it sufficient power to do its work in order to sustain itself—in case certain factions try to impede it.

To build upon this analogy to safety departments, Provan et al. propose that the CoPE should work to induce resilience by “creating foresight about the changing shape of risk, and facilitating action” proactively. They call this “guided adaptability.” What does this look like? We’ll get into that below, but the outcome should be fewer customer-facing incidents for a similar amount of work because they’re preempted. Since those incidents are theoretically not occurring, we can’t count them except maybe as near misses. Besides those near misses, other telltale signs may include qualitative changes in what count as incidents and better response when we do have an incident. We’ll want to track these things to evaluate the efficacy of the institution, since they’ll provide a better signal about how the organization is doing than a lossy metric like the number of incidents.

Differences that make a difference

When it comes to making changes in an organization, Hébert-Dufresne et a.l found that it takes both bottom-up and top-down practices. Bottom-up practices are those which are taken up and transmitted horizontally through a network while top-down practices are interventions which introduce an accelerant or a dampener. The bottom-up spread is the main driver of the process, and the top-down inflection makes it easier or harder for the spread to occur; both aspects must be understood and evaluated reciprocally. Furthermore, the researchers note that success will result in a marked qualitative change in the organization: it should be clear once the phase shift has happened.

Let’s assume that an organization has already decided to start using Honeycomb, the Center of Production Excellence has authority to act on its mandate, and someone wants to make a change. Using the model from above, here are some activities that a Center or Guild can do to start and spread good observability practices with Honeycomb and what they may need from the greater organization.

Start from the bottom

The first thing to do is find as many of those intrinsically-motivated individuals and bring them together in a regular meeting to talk about the CoPE’s mandate, what they want to achieve, and why. This initial group forms the basis of a sociogram, which can be enriched with information about things like periodic sprint cycles, regular post-incident meetings, widely-adopted standards and tools, and other things that don’t get the grease because they’re not squeaking.

Collecting this information up front is crucial because it will inform the development of a twinned strategy involving both passive and active tactics.

Passive tactics

Passive tactics hitch onto already-existing habits and motions. They inflect and latch on, letting another motive force propel them while slowly transforming it with each cycle of repetition. Ideally this is symbiotic, and the best chance of achieving that is to gather sufficient information about how the existing motion is detrimental to the organization’s production excellence goals and introducing Honeycomb to address that deficiency.

I’ve noted several ideas that might be characterized as “passive” in this blog post. All of those tactics are attempts to proactively establish what Laura Maguire calls common ground and to lower the cost of coordination between people. That happens when people study what’s happening in production and communicate regularly about it, and have the time in a low-pressure, psychologically safe situation to ask questions and grow familiar with each other’s knowledge, proclivities, and working styles.

The members of a Center of Production Excellence should therefore learn about these organizational patterns and use their influence to change them ever so slightly—and reinforce that change until it has become a part of the routine. Once those passive maneuvers are up and running, the CoPE members merely need to perform regular check-ins, or play their own part if they are participants of the relevant social institution. That light touch frees them to sprinkle in more targeted interventions.

Active tactics

The set of active tactics which a CoPE could perform are much more generic and targeted. These include running regular training and enablement sessions like Lunch & Learns, publishing a newsletter, advising on or collaborating on custom instrumentation libraries, and supporting the general observability toolset.

More broadly, since the role of a CoPE is to support the performance of resilience by the organization, this is a good space to consider designs for incident response. This can span from:

The increasing number of irreducible dimensions within software systems and the changing relations between them requires contributors to adopt an attitude of humility about what they know or may expect to happen. It therefore makes sense to build in mechanisms to foster continual learning and to create an environment where people can and will stretch to fill the gaps that inevitably open.

Part of how the organization can act as a support in this case is to grant the Center of Production Excellence power to make these changes.

Top to bottom

As for the top-down approach, Hébert-Dufresne et al. are not specific about what behaviors would constitute ones which promote or hinder the spread of the changes that the CoPE makes. In the case of rolling out a product like Honeycomb, we can venture a few ideas backed by research.

One thing that an organization’s management can do is to build the CoPE to be as autonomous as possible. This includes independent funding sources and protection from capricious actions by those whose interests may conflict with the changes that the CoPE is pushing. Related to this is a deference to frontline expertise. As Woods says, this expertise can be expressed as initiative and observed as people deviating from established plans or routines. In a case where deviation does occur, it’s important for management to validate it as appropriate and to have safeguards in place that prevent blame and punishment.

Returning once again to the safety analogy from above, a common shorthand for this is letting anyone halt production if something ‘on the ground’ seems dangerous. In software, this can manifest in ways like empowering anyone to declare an incident or deferring/dropping planned work in favor of things with longer-term benefits, like instrumentation or conducting incident retrospectives. Management must create buffers which support the Center of Production Excellence if it deems these activities necessary.

Another way that management can contribute is to look at unusual backgrounds or skills as assets, not distractions. Scott Page has famously demonstrated in books like The Diversity Bonus that teams with diverse experiences and “lower” individual aptitude perform better than homogenous ones where each individual is “better.” As such, organizations should seek internal and external candidates for roles who don’t pattern-match well with the group as currently constituted.

Finally, people in management should consider voluntarily leaving their current roles. Copious research, like studies conducted by Julian E. Orr and Ruthanne Huising, have found that knowledge silos form between hierarchical levels in organizations, not just between departments or teams. This is a major problem for organizations because ossified power dynamics inhibit good communication, those who wield power for too long may use it to protect themselves instead of for the good of the organization, and those people may personally suffer debilitating effects like falling behind in their own technical skills. To counter this, one might consider adopting the model of the Engineer/Manager Pendulum or other techniques of rotating leaders like sortition.

Conclusion

A Center of Production Excellence can be a powerful means for an organization to initiate transformations which foster resilience as it matures and its environment changes. In order to do this, its design, activities, and supporting structures require careful consideration. An organization’s agents must have a true desire to change in order to make appropriate decisions in those regards, and they must empower it to do the work of guiding adaptability.

The precondition for guiding adaptability is adaptability. This is the capacity or potential for agents to modify their behaviors, mental models, and priorities as the complex, dynamic system that they participate in changes. It occurs when agents re-evaluate their circumstances and, determining that the current state of affairs is insufficient to achieve their goals, draw upon heretofore novel resources which change what they can affect. A Center of Production Excellence helps to increase that capacity by making those resources available and preparing agents to use them when needed.

Organizations can aid in this process by adopting policies which amplify that work and avoid dampening it. In doing so, they demonstrate and enact their commitment to production excellence and should expect to reap the benefits.

If you enjoyed this blog post, keep your eyes peeled for the next in the series—it should be up next week. In the meantime, feel free to try Honeycomb today—it’s free.

 

Related Posts

Software Engineering  

Simulation Theory, Observability, and Modern Software Practices

The 1981 book Simulacra and Simulation by Jean Baudrillard is widely read and cited within academic circles but also permeates popular culture, influencing films, literature,...

Culture  

A Day in the Life: Customer Success

We thought it'd be fun to give you some insights into what certain teams at Honeycomb do and how they spend their days, and who...

Software Engineering   Monitoring  

What Is Application Performance Monitoring?

Application performance monitoring, also known as APM, represents the difference between code and running software. You need the measurements in order to manage performance....