Modernizing Availability Thinking at Babylon
Babylon has a vision to bring affordable and accessible healthcare to the world. A key part of that vision is having a highly available global platform for healthcare delivery that is a delight to use for both customers and clinicians. In this talk, Tom will cover their ongoing journey to adopt an alternative (or complementary) SLI & SLO-based approach, the advantages of that approach, what they’ve learned along the way, and what’s left to do.
Tom Ford [Senior Engineer Director|Babylon]:
Hello, I’m Tom Ford. I’m a Senior Engineering Director at Babylon. A little bit about my work modernizing and thinking about availability here at Babylon. A little bit about us first. We believe it’s possible to put an accessible and affordable health service in the hands of every person on earth. A little more about us. We’re a health services company. We write our own software but deliver healthcare too. Born in the UK, 5-6 years ago by Abby Plaza. We’ve expanded to other countries, including US and South East Asia, and another we’re extremely proud of like Rwanda where we do thousands of consultations in that great African country.
For much of our history, we have thought about availability in a fairly straightforward way – systems are up or down and if so what is our percentage of availability was, We just count that someone is on a p1 incident call, divide by the hours in a month and there is your availability.
But of course, we wanted to challenge ourselves to be a little bit better than that. We’ve built a robust serviced delivery which we use. But to get to the next level, we wanted a better definition of availability. And really the driver there, we wanted to better reflect how customers experience. There’s a chance that somebody gets a funny error on the web browser, or a clinician has to keep clicking reload until the platform works and we wanted to capture that and put that data in the hands of our teams and work at where we can be doing better for our members, for our staff, for everyone.
At this point, I want to give you a very brief and simplified overview of our architecture here at Babylon. So top, you’ll see our applications here. We have a mobile app. Based on React Native. And that’s what our users, our patients might be using when they engage with our services. And we have a number of web portals. And again, one big service we deliver here at Babylon is live face to face with a doctor, and any follow-up you might have. And the doctor or clinician will be using a web portal. In addition, if you call us to ask us where the prescription is or something, we have support staff that uses a specialized web portal as well. So that makes up our applications here.
That’s all powered by GraphQL API gateway. And a series of microservices that power the platform. We have a number of what we might call first services that provide user face and functionality. And then platform side, we have some slightly deeper platform microservices that do things like tenancy, identity, et cetera. Mainly called by our first microservices. So just bear that in mind as I go through the next slides.
We wanted a new definition of availability and wanted to learn from best practices in the industry, SLI, SLOs, and error budgets but we also wanted to get something that we could trial quickly, trial a new definition, see how it works without a whole load of build-up. We wanted a simple definition. We wanted to test it not on the whole of Babylon but part of the system but a bit as complex and obviously is all in place so we can see how it really works with the systems we use every day.
We wanted to use our existing tooling. And as I say, we wanted to get something out there, test, and iterate, and learn with our teams.
So what did we choose? Well, we decided to go for 99% of well-formed requests coming into our system should have a well-formed response within three seconds. That’s our service-level objective. And where did we apply that? We applied to GraphQL API level. And the rest of APIs with our microservices. We looked for every API transaction. We did that to try as much as we could simulate what our users are seeing. We liked this definition because it’s a simple aspirational one-liner. Again, because this is something new to Babylon, we wanted something easy to understand, easy to communicate. We liked it because compared to our previous definition, as it really captures data about each one of our customer’s interactions. But it’s also straightforward in each. Customer interaction is either a pass or fail and we add them up at the end of the day. And finally, it was measurable with our current tooling. So we could get going quickly.
A little bit more detail, here’s the calculation we carried out. So our service-level indicator, we took the number of successful requests over a time period and divided it by all the requests. And our successful requests were, well, requests capturing, minus the client errors, as I say, it’s a well-formed request that needs to come in to be counted. And tracked our failed requests which were our service-side errors and any requests that were slow beyond three seconds where we set our objective.
And all for all of our requests, again, that was similarly all the requests we counted minus client errors. We didn’t include the client errors in the SLI definition obviously because we write most here in Babylon, we did look at that number because it was actionable for us. If it percentage was too high, probably it was a mistake for our client.
As I say, we set the objective that would be greater than 99.9% of our requests would be successful.
Again, how does the system improve in the previous definition? Well, I put together a few simple scenarios. And tried to describe how the old definition and new definition would differ. This would be bad news on the left, a one-hour outage taking place in one of our regions during a high-traffic period so during the day. Let’s assume we have 100% successful requests going along. And then oh no, nothing is successful.
In our old system, we have gone one hour out of the month where we’re unavailable. Which would be effectively minus 0.13% but in the new system, it looks worse. Why is that? In our model here, simple model, we’re saying day traffic is twice the night traffic. Now the outage is a bad time, actually on average more of our customers’ requests are affected.
On the other hand, let’s look at a one-hour platform outage that occurred at night. And let’s assume we have 50% of our traffic. Here, our old system wouldn’t have distinguished that. This is now an outage. It’s always a .13%. In the new system, because the traffic is lower, the number of requests which have failed is lower, so it comes in at .09. So we think this is an improvement that fairly represents if the outage is less bad when the traffic is lower.
And let’s look at degraded service. Again, here’s a scenario during the day we have one hour of degraded service. Now in our old system, we would probably have to figure out is this an outage or not? If it was, it would be .13. If it wasn’t, maybe actually maybe it’s a P2, and it is still up, even though some customers are being affected. The new system takes away the argument by looking at the number of requests which haven’t been successful. We appreciate this is a loss of availability to a certain subset of users and we can quantify that as a 0.09% of availability. Which again we think is a big improvement. It’s a lot less arbitrary than the previous system we had.
How do we actually go about implementing this? Well, we used our current in-place application performance monitoring suite, and that registered for the error codes, client, and server 5xs we had and our request duration. Super important for us to get our teams engaged so we spent time making sure all our teams had out-of-the-box dashboards which show this SLI and the breakdown. They got all the data and they got that in real-time. And we formed this for each squad. So they were ready to go.
One thing we did as well was we hand-curated a list of really key transactions that would affect our customers significantly if they ran slow or didn’t work. And we produced that subset as a leadership report which we produce monthly. And that contained key data and changes month to month. So let’s go through these.
Here’s an example of how our squad or squads on the team here at Babylon, our team-level dashboard, the team for the area they’re interested in, accountable for, they receive dashboard, and you can see down here on the bottom right, there’s the SLI. In this case, counted over seven days. And we break that down into the number of server errors, the number of transactions that are slow, and the number of errors that don’t form part of the SLI. But if that’s too high, something is probably wrong, and the team would investigate it.
To encourage and aid the teams in getting to grips with this, we set up a monthly 30-minute ops review meeting with each team where we talk them through and introduce the dashboard, set objectives, and really have a good feedback loop on improvement actions taken each month.
This is an example of the key transactions report. Again, this was a subset report where we hand-curated really key transactions which are important to our users. And this is a report for leadership. So what we tried to do here is make it super clear what each transaction was. Here for example it might fetch patient details, which product is affected, and which team. Because in the microservice architecture, often there’s a number of different teams involved in any transaction. So make sure that the team that primarily owns this is clear. This is product-focused– It’s meant to be used to understand, accountable, give accountability, sees clearly where you can put some time, some action. And we find this super helpful in making this project seem real to product and engineering management. And to make a sort of really clear outcome of where we were dedicating resources to improve some of the SLIs for the key transactions.
So overall what did we learn? First of all, pleased to say we had a great engagement, feedback from the engineering community. We identified with this, they like the tools. Monthly meetings are great to keep up the momentum. Very well attended. And people went above and beyond to fix the things that we were highlighting. We certainly found hidden pain points within some of the customer journeys. It allowed us to identify and prioritize those for fixes and allow us to deliver customer value.
We identified some unexpected interactions. An example would be one of our back-office tasks was to upload a significant amount of data to our system. And we noticed this broke a subset of what we thought were unrelated user journeys, just for a short period of time. We turned out to be due to database retention.
And we liked our monthly report because it drove engagement with our senior stakeholders and product leadership by making what could be an esoteric topic aligned to customer value and the product we’re delivering. And of course, our members.
So where do we go from here? Well, we’re very happy with the outcome of our project. So we’re pushing for adoption in our engineering community across Babylon. We’re engaged in upgrading our GraphQL API layout at the moment to use federation. And we’re keen to make the SLOs a first-class citizen and our adoption of that. Longer-term, we would like to move from an API-request-based system to look at journeys using distributed tracing. We have correlation IDs on our journey so we can trace the user, the user’s request for a number of microservices by looking at the journeys. We should get something that represents our user experience better than just looking at requests at the edge. And we want to tie these SLOs and move to the Error Budget way of thinking into our alerting framework. We have good alerting at the moment. And with externals and transactions but we would like to get alerts okay, this endpoint is running a bit slow, “you’re going to miss your SLO if you don’t do something in the next few days.”
So that’s the end of my talk. As always, we’re hiring. Babylon’s looking for exceptional talent globally for our technology organization. We’ve got key locations in London, Austin, Angalor, and San Francisco. We’d be super happy to hear from you if you are interested in our mission or anything you heard today. I’m here for any questions you might have. Thanks very much for listening to my talk.
Corey Quinn [Chief Cloud Economist|The Duckbill Group]:
Thank you very much for taking the time to go through that. One thing that I appreciate first is that you are a — you’re Babylon Health which turns out does not mean babbling on. So egg on my face on that one. But it’s nice to see folks coming in and having conversation regulated in the industries whereas harkening back to my snark earlier day, you’re not Netflix. If you mess things up, the consequences are, I’m not saying worse than if a video stream fails, just different. It’s going to be a different experience. Also, what I find neat, you’re coming at this from a Senior Director position. Which is — I’m going to open with, the people I see talking the most about SLOs, some selling SLOs as a service are effectively speaking SRE language and SRE language only for a lot of it. So I’m curious how you wind up articulating that up the stack to folks who are not already bought in on the concept with a clear understanding of what it is. It feels like it’s a hard sell to upper-levels of management of which you are teetering on the brink if not already into yeah.
Yeah, thanks for that. So I think we’re lucky at Babylon to have a very sort of enlightened management at the CTO level. Hasn’t been too hard a sell but what I would say is the organization, as it’s grown, has moved on. We’ve gone through a number of different leaders. We’ve gone through a number of different philosophies. It almost feels like at the scale we’re at which is reasonable in the terms of engineers, several hundred engineers across five different regions, looking at the U.S., our global footprint, you need more sophisticated in some of these concepts. And of course, we, you know, we don’t want this to start from nothing. We want to stand on the shoulders of giants here. So as a technical organization right up to CTO level, we went out looking for what we could apply from, this was a best practice, you know, really sort of engineering thinking and looking how we could apply it.
Yeah, and with the challenges — sorry, go ahead. No, please.
That brought me back to your first point around regulated industries. Like, software engineering is software engineering fundamentally. And although, yeah, there are challenges in regulated industries, and I spend my time thinking about this and making sure we have the right paperwork and processes and all of those things. But fundamentally, what we’ve learned about delivering great highly-available systems is it’s applicable to Netflix and Babylon is applicable to Google, right? We learned a lot in the last 20 years about how to run these things and those are the best practices. And the trick is to take those best practices and apply them into a regulated industry in a way that ticks the boxes, right? You got to do what you got to do from a regulation point of view. But that also means you’re really best off taking the best practices and making them work.
We have a follow-up question here. Several questions, dive into them. Because any nonsense is not as entertaining. Is the monthly review you did with all teams as in was there social pressure to improve the indicators or more one on one or team by team basis?
Interesting question, we did a one-on-one basis. This was a project we carried out with part of the organization. So the teams knew me pretty well. We had about 10 or 12 teams involved. And we spent half an hour with each team every month. So we weren’t looking to, you know, in any way even implicitly set teams against teams. We were looking to do was to engage our teams in the metrics and encourage them to want to improve and want to do better independently. And I would say that worked really well. And a number of teams really impressed me by going above and beyond what I expected.
Ethan asks, do you have strategies when the outage doesn’t generate error traces to subtract from the SLO, networking issues come to mind, try to extrapolate missed traffic or get better tooling. And I’m going to interject now for you but especially the members of the audience who might work at Honeycomb, no, buying a product is not an acceptable answer here. Go.
It’s a good point. And I glossed over some of the technical detail. When we do some of our measurements at a service level, that’s great for capturing degraded traffic. But it doesn’t capture full outages. In addition to checking things that at a request level, the error code, we use a combination of synthetics that come externally from running GCP or in AWS and we also use health check endpoints in our services. So complimenting those together, we effectively can put the synthetic blackouts on top of the service-level requests to get an overall SLI…SLI which counts full outages. I would say our tooling is still certainly in the early stages here and that’s a process we carry out fairly manually. But that’s definitely an improvement for us. We basically have the data, we just need to manage it correctly to get the right answer.
So in my day job, I fix the horrifying AWS bill. Small problem but affects someone eventually. Looking at companies’ experiences attempting to manage actual budget with the SLO approach. Take the thing that you’re bad at when it’s actual money and then apply it to error. Do people find that managing Error Budgets is something that they actually get to? Or one of those more aspirational things that mostly just causes drama on Twitter, especially over the last 24 hours?
Interesting question. I would say making teams accountable for their SLIs and their error budgets and particular management bit, you know, oh, okay, I’ve got Error Budget, I can do this extra A/B test, or I can push ahead with experiment, we’re not really there yet. And I’m not sure how far we would want to go given where we are as an industry. So we’re really aspiring, obviously 100% is not in the industry even, 100% is not where we need to be.
It’s one of the aspiration things you never achieve in engineering. Like a perfect uptime. 100%, everything is up all the time. Exactly execution. Or happiness.
Yeah, absolutely. So we want to be better. We think this is a great way of getting metrics to continue to improve. But the concept of spending the error budget, that’s not something we want to go to necessarily.
A question here, how to convince the customers to think about the number of transactions versus the good minute, bad minute, SLO definition. As in, when the customer doesn’t really care how many requests were dropped in the outage vs how long their team wasn’t able to use the product.
Yeah, that’s definitely an interesting question and certainly where we are at is using our SLOs as internal tools. We will have a different SLA with our customers in terms of corporate partners which might not necessarily resemble our SLO.
It does feel like it’s an internally aligned tool. Because if it sounds like, well, things should be going down at a certain amount and you should have plans if that, communicating the general public when they can’t get to your website, it doesn’t go very well.
I think that’s true, right? And you know, an SLA agreement between your partner with a certain degree of come back and should things go wrong, my philosophy here is by having detailed information internally, really aspiring to continue to be better at SLOs and SLIs, you get to a place where you’re not dropping your SLA very much and you don’t have a detailed picture internally, you’re going to be in maybe a blissful ignorance but see more outages when you would otherwise. So even if you, and I don’t know necessarily that every company would try to convince your customers to accept Error Budgets and SLOs, even with conventional SLA, I think the pitch still has value internally and makes your organization better at meeting the SLAs.
It does seem, anything in a B to C-style scenario, gives up on some level. I agree with what you are saying, the last question… As you look back across the process of adopting this, is there anything that you could identify that would dramatically shorten the path, to adoption that would make it easier, things that at the beginning that you wish you had discovered that now in hindsight make more sense?
Probably a good question to ask me in a year’s time.
That’s right, it’s a journey, not a destination. How many people implementing ERPs? Everyone. Who has completed it? No one.
I think it’s fortunate to have brilliant technical assistance here from my team. And the things which sort of surprised me, we were able to do a hell of a lot with the tooling we had. There’s a lot of extra scope with all the great stuff we built up over a number of years. We just haven’t brought it together into one easily understandable metric. That was a key insight. And the other thing which I mentioned in the talk was again there’s a huge amount of data. A lot of data is really relevant at the individual team level. But by putting together in the overall reports, a subset set, really key bits, that made conversations between engineering and product and product leadership easier to say, look perhaps we didn’t know about before, perhaps we didn’t understand in this much detail. So moving to that made conversations and being able to put the timing to this an easier process. That’s something I would advocate the right subset of data in terms of communication is key.
Thank you very much. No one knows their external reputation, but Babylon very much has a reputation for technical excellence and now that we’re seeing how you’re getting there, a lot of things we see in the outside world in certain circles begins to make more sense. Thank you so much, Tom, we appreciate it.
Thanks, Corey. Appreciate it.