On Building a Platform TeamBy Jess Mink | Last modified on October 31, 2022
It may surprise you to hear, but Honeycomb doesn’t currently have a platform team. We have a platform org, and my title is Director of Platform Engineering. We have engineers doing platform work. And, we even have an SRE team and a core services team. But a platform team? Nope.
I’ve been thinking about what it might mean to build a platform team up from scratch—a situation some of you may also be in—and it led me to asking crucial questions.
What should such a team own? What are the important metrics to track? How can I justify hiring such a team? How will I know if they’re successful?
A platform team shouldn’t own product features. They’re supposed to write as little code as possible. So what do they do? Turns out, it’s a lot:
- Handle the relationship with your cloud provider
- Provide guidance on the best cloud provider usage
- Test out new cloud features
- Look for hosting cost opportunities
- Manage Kubernetes infrastructure
- Managing development environments
- Testing frameworks
- Deployment framework (maybe?)
- Consult with other teams around load impacts
How’s that different from a DevOps team? This is all inside the DevOps model of abstracting infrastructure concerns so that developers can own their own infrastructure. It’s just that more of those abstraction layers are provided by vendors these days.
There’s a whole class of things you just don’t have to think about if, say, your system is running on lambdas. Suddenly, setting up CPU and disk monitoring isn’t your problem, for example.
Vendors vs. internal platform teams
What’s the difference between a good vendor and a good internal team building you those abstractions? A vendor should be the best in the world at doing the thing you’ve hired them to do. Hopefully you’re getting more features faster, in a more robust way, with well-defined APIs. Best of all, you don’t have to support that system besides occasionally upgrading your integration.
The downside is that a vendor’s product may not always fit your needs exactly. If you aren’t using the product like they expect they may even accidentally break your use case. There’s also likely to be a gap between what they provide and what you need. That’s where a platform team comes in: to fill the gaps and make it seem like you have a vendor that fits your needs perfectly.
The best vendors feel like internal teams. The best internal teams feel like vendors.
We had to have ops, because someone had to rack the computers, run the wires, and install the operating systems. We still need people doing those things, but it’s a bit more centralized now. We needed DevOps because someone needed to provision machines, manage the AMIs, and set up basic monitoring. As the abstractions get better, we can get closer to a world where a company only has to focus on what distinguishes it in the market. Platform teams fill the gap that’s left, and I don’t think they’ll go away.
Let’s imagine a wonderful world of puppies and unicorns where every AWS, Azure, and Google Cloud feature you can think of has been built, and where any engineer with a credit card can set up complicated infrastructure without needing to know anything about sockets. I’d argue that even then, there would still be a place for platform teams in this world.
What would a platform team do then?
They might work with product teams to understand how architecture choices impact COGS. They might run A/B tests to see if a different instance type will give better performance. Or, they could evaluate vendors, choose observability tools, and make development easier. They might not even be writing code.
This space keeps shifting as we build better tools, but it isn’t going anywhere. Instead, as the primitives we work with become more powerful, this work gains more leverage. Doing platform right is something that can get you on the main stage of AWS re:Invent. It’s something that can profoundly impact your availability, latency, and developer speed. Doing platform right is something that can make or break your company.
This is profoundly important work that requires deep context and experience. Far from disappearing, this newest evolution in the operations space is more important than ever.
PS: If you missed Charity’s post on platform engineering, it’s definitely worth a read.
How Do We Cultivate the End User Community Within Cloud-Native Projects?
The open source community talks a lot about the problem of aligning incentives. If you’re not familiar with the discourse, most of this conversation so...
How We Define SRE Work, as a Team
The SRE team is now four engineers and a manager, and we are involved in all sorts of things across the organization, across all sorts...
Deploys Are the ✨WRONG✨ Way to Change User Experience
I'm no stranger to ranting about deploys. But there's one thing I haven't sufficiently ranted about yet, which is this: Deploying software is a terrible,...