How We Define SRE WorkBy Fred Hebert | February 15, 2022
At the time of writing this post, I have officially been at Honeycomb for one year as a site reliability engineer (SRE). I had shared my initial experiences and impressions in this post and thought it would make sense to check back in now that I’ve had the opportunity to spend time learning about the team, the culture, and the code base more in depth.
When I asked Honeycomb folks “What sort of SRE role is this going to be?” during the interview process, the answer included this specific line: “This is our first SRE hire and the role will take shape with whoever fills it.” I probably didn’t give this sentence all the weight it deserved at the time, but eventually my task became to literally define what we think SREs should do. This was a surprising opportunity—Honeycomb is filled with talented people who have been SREs, have managed SREs, or are public faces of what SRE stands for—and I was given the autonomy to still put it into my words according to what I thought made sense.
After a few drafts and some back and forth for feedback, I had a short charter that gave a general orientation of SRE work here at Honeycomb. It is intended to be a living document that we can modify as we see fit, but so far it hasn’t changed too much. I’ll share it here.
The charter (aka what site reliability engineers should care about)
In most companies, it feels like all SREs will fill in a role that more or less fits these categories:
- Own the reliability roadmap
Take a long-term, holistic view of the system. Lead and influence practices in the organization that lead to greater operational experiences. Be in charge of the continuous improvement feedback loop around uptime and reliability.
Think of: Public health officials impacting broad policies to improve overall populations’ health.
- Lead incident practices
Take charge of how we respond to and adapt to incidents. Adapt work so that people feel comfortable and confident running our systems, propagate good practices, and ensure we do these things sustainably. Influence work both upstream and downstream of incidents, not just during incidents.
Think of: Fire marshals’ roles of investigation, inspection, and coordination.
- Provide tools and assistance
Develop software and write fixes to help improve the reliability of our systems in order to let engineers focus on their primary tasks. This provides a focus on eliminating toil and supporting/optimizing daily work that makes our people successful. Participate in design so that operational excellence is built-in.
Think of: Well, this one is pretty natural to the software world.
It is expected that someone filling the SRE role will have some responsibility in all these categories, although which will be their main focus will vary from organization to organization and, within each org, from person to person. In fact, as part of joining the company, an SRE should be able to figure out how they believe their strengths and motivations are spread across the categories, and team growth could be guided or aided from there.
How site reliability engineers shape their work
This charter represents a set of unrelated, sometimes conflicting priorities. As stated above, the objective is not necessarily to cover all of them fully, but to make sure that we stay aligned and give at least a decent amount of care to each of them. At Honeycomb, that work is interwoven with actual project assignments (with the platform team) to make sure I stay aware of all the challenges that may exist for our engineers.
An additional purpose of the charter is to act as a communication tool. I send it as a reference for peer reviews, and as we grow the team and hire more SREs, we can better think about their aspirations, the demands of work, and how it would all fit together. I don’t think we’d ever want to fully specialize, but it sure can help to be able to have stronger areas of focus and ownership.
To help keep track of that balance, I loosely and voluntarily maintain a grid, on a month-by-month basis, of how I believe my time has been allocated. Here’s a look at my 2021 grid:
|Period||Owning Reliability||Leading Incident Practices||Tools & Assistance|
There’s a very obvious portion of the work that is impacted by organizational roadmaps, challenges of growth and scale, and whatever problems we feel at any point in time as an engineering team. However, this charter and grid have been useful signposts to choose which SRE initiatives to prioritize or focus on, and now that I feel more grounded here at Honeycomb, I’m hoping I can share more of the work we do around operations and system health in the upcoming months. There’s a hidden column where I keep examples of work done, but you can see that there’s a lot of variation from month to month. Some of that is my on-boarding, some is reactive work (we had large spates of incidents in some months), and some of it is a deeper focus on projects when the pace allows (or demands) it. The idea is that as I fill this grid, I can figure out when I need to keep things in check and re-balance work, regardless of review cycles.
While I use the charter to align my work, I also expect that the charter will evolve and be adjusted to match new work demands. This is because our team will grow and change, our organizational dynamics will shift, and the challenges we face are going to keep surprising us.
The underlying attitude that ties everything together is to be ready and willing to adapt. In that spirit, we'd love to hear about your thoughts and/or experiences as an SRE in your org—share with us on Twitter or send us a message in our Pollinators community Slack.
On November 18, between 00:50 and 00:56 UTC, an update was deployed that improved Honeycomb’s business intelligence (BI) telemetry available from our production operations environment....