Software Engineering  

How We Define SRE Work, as a Team

By Fred Hebert  |   Last modified on June 1, 2023

Last year, I wrote How We Define SRE Work. This article described how I came up with the charter for the SRE team, which we bootstrapped right around then.

It’s been a while. The SRE team is now four engineers and a manager. We are involved in all sorts of things across the organization, across all sorts of spheres. We are embedded in teams and we handle training, vendor management, capacity planning, cluster updates, tooling, and so on. After growing the team to a point where we could get a better grasp on our mission and identity, we decided to revisit our charter. It is a living document after all! It was exciting for me to let other folks get their hands in it.

What changed?

Some changes revolved around wording: do we "own" reliability, or are we only a broader type of advocate for it? Others had to do with whether we made the charter reflect work we did de-facto own, or whether it should be aspirational—or a bit of both? Do we make it our charter to cover platform concerns that the platform team is sometimes unable to prioritize, or is it part of our role to cover these issues? Could we come up with examples to anchor that work?

Discussions around that don’t always feel productive. However, they turn out to be useful to make sense out of how team members feel about their work. For example, we used to have a full third of the charter dedicated to “provide tools and assistance,” a category I felt represented typical SRE work around automation and eliminating toil. But looking at the responsibilities we were taking around vendor negotiations, voluntarily cross-pollinating silos and impacting engineering teams roadmaps in anticipation of scaling needs, we decided to rework that whole category into a broader “system-level” perspective.

After some back and forth, we came up with a shared understanding that represents the work we currently do and the work we want to do.

The new SRE charter

We believe our work at Honeycomb should fall within the following categories:

  • Champion reliability and scalability
    Take a long-term, holistic view of the system. Lead and influence practices in the organization that lead to greater operational experiences. Be in charge of the continuous improvement feedback loop around uptime and reliability, including scaling.
    Think of: Public health officials impacting broad policies to improve the overall population’s health.
  • Lead on-call and incident practices
    Take charge of how we respond and adapt to incidents. Adjust work so that people feel comfortable and confident running our systems, propagate good practices, and ensure we do these things sustainably. Influence work (both upstream and downstream) of incidents at all times, not just during incidents.
    Think of: Fire marshals’ roles of investigation, inspection, and coordination.
  • Provide a system-level perspective to the organization
    As the organization grows and teams specialize, the SRE team has the opportunity to cross-pollinate silos, find patterns worth spreading across the organization, leverage economies of scale in engineering, and provide operational feedback to influence work practices and roadmaps toward sustainability. This also includes platform-level work around shared infrastructure, tools, and vendors that support SREs’ field of concerns.
    Think of: Urban planners’ roles around public consultation, transportation management, sanitation infrastructure, and sustainable growth.

As a secondary objective, SREs at Honeycomb may write incident reviews, blog posts, and participate in webinars and conference talks. The patterns we come up with internally are often useful to our own customers. As a result, to properly align our practice with what provides the most value to the organization, we should look for opportunities to generate content and guidelines that are possible to use internally for Honeycomb employees, but also, externally for our users.

What the charter now lets us surface

There’s a lot of stuff we do that was somewhat hard to track, or was seen as just an ‘aside’ that was important, but not measured. There’s also a bunch of stuff we took on because we knew it would be beneficial for the organization and we were in a great position to do so, but it either didn’t fit the previous charter, didn’t categorize well under a roadmap, or was strategic and somewhat experimental.

Examples include:

  • Embedding with product-engineering teams to create strong relationships and asking the right questions at the right time
  • Facilitating conversations about reliability and scale within teams, and making these a part of the organizational culture
  • Having a silo-crossing perspective to engineering within the organization, and taking a learning-oriented approach to interventions
  • Being accountable for customer quota changes, early signals, and creaking to anticipate inflection points around the system and its use
  • Maintaining shared platform and infrastructure components, such as those surrounding Terraform, Chef, and EKS, and of cross-team concerns and practices like feature flag hygiene and instrumentation norms

We decided that tooling and support could still fit within the existing categories as a means of scaling and having leverage into a growing system, but the new charter has a more obvious place to surface any of the responsibilities above. It therefore feels like a more adequate fit to our function.

Much like last year, we still expect our charter to be a living document. As our organization changes, and as our team changes, the priorities and high-leverage activities are also likely to change. Keeping track of the important work we do and aligning it with the definition of the team is a lot of discussion, but it creates decent alignment down the road.

Get another sneak peek into how we do things

Now that you’ve read about what we define as SRE work, do you want to learn more about how we do things internally at Honeycomb? I recently wrote an article about how we manage incident response at Honeycomb, and I’d love for you to read that too.

Happy SREing,

Fred

 

Related Posts

Software Engineering   Culture  

Staffing Up Your CoPE

Getting the right people working in the CoPE is crucial to success because these change agents must limber up the organization and promote the flexibility...

Software Engineering   Observability  

Navigating Software Engineering Complexity With Observability

In the not-too-distant past, building software was relatively straightforward. The simplicity of LAMP stacks, Rails, and other well-defined web frameworks provided a stable foundation. Issues...

Software Engineering  

Investigating Mysterious Kafka Broker I/O When Using Confluent Tiered Storage

Earlier this year, we upgraded from Confluent Platform 7.0.10 to 7.6.0. While the upgrade went smoothly, there was one thing that was different from previous...