Graph Observability: Honeycomb and Apollo GraphQL With OpenTelemetry
Learn how to use Honeycomb, Apollo Studio, and Open Telemetry to optimize GraphQL performance for a federated graph. See how to debug a federated GraphQL query across subgraphs and down to the database layer using Honeycomb.
David Pickavance [Senior Sales Engineer|Apollo GraphQL]:
Hi. I’m David Pickavance. I’m a Senior Sales Engineer at Apollo GraphQL, and I’m here today to talk to you about graph observability, understanding the performance of your graph using Honeycomb with OpenTelemetry and Apollo Studio Product. I’m going to be using the following setup to talk through this. It’s a very simple setup, but it reasonably mirrors what people would actually do in production. So I’m going to imagine that I’m running a very simple, online bookseller and we’re going to be imagining, for the purpose of this demo, just how I’m going to populate the home page of my site.
And I will just be giving visitors to my site a list of books that they might want to buy. So in order to do this, I have built a graph based on Apollo’s open-source libraries, and I have two subgraphs, split into domains, books, and authors. The idea is, in a real company, each one would be owned by a separate team. And then, the schema for each of these or both schemas are composed by the gateway so that the client, in this case my website, sees the composed schema and doesn’t have to worry about how it’s going to pull data from two different backends.
Each subgraph is then backed by its own data source or data sources so that here, in this demo, the request flow for populating the page on my website would look something like this. I’ve then implemented OpenTelemetry tooling on gateway, subgraphs, and Postgres databases and I’m pushing those spans into a collector and then into Honeycomb which I will show you in a bit. And then we’ve also got the native GraphQL tooling and observability from Apollo Studio. So the gateway is pushing information about operation shapes. There was a request that came through. What was the request called? What fields were asked for? And metrics about how long it took to serve that query.
So let’s see what that actually looks like. So I have two, here we go, here’s my graph. I have two variants in my graph. I have this complete variance which contains all possible types and fields on my schema, but not every part of my schema is something that I’m ready to use to power my actual production site. So I’ve also split off this contract there into my graph, and that’s the one I use to power the production version of my site.
So to populate the home page on my site, I have this GraphQL query called home book recs. It has a P95 service time of 78 milliseconds. That puts us well within SLOs. And we can see that, this query. Each time someone comes to my site, the web page is asking for a list of all books, and for each book, we’re pulling the author, also the book title, and then the author’s name. So that’s the standard production page.
Now, my product team has come to me, and they’re asking for, they want to try a change to this home page. What if somebody is visiting our site and they’ve already read one of these books that we’re suggesting to them? They might be interested in other books by the same author. So what if each book that we’re showing, where we have the author, perhaps under that, we could also have a little image of other books by that same author.
Okay. I think I can build this query. So I extend my schema. And on my books’ schema, I extend the context of an author to include a list of books. I’m tagging this as experimental so that that does not get included in the schema on my production graph. I don’t want any front-end engineers accidentally starting to use this experimental feature on my production graph. And so then I can also use Explorer to build out what I think the query is going to look like. So this, without this, all books with each book title, author name, that’s what the query I’m running for my production site. And now I’m adding for that author, please give me a list of all books and their titles. Great, so I can go in, I know what the query looks like, so I can build that into some test clients and start running some traffic.
Let’s see what that does to the performance of this query. All right. So, my home book recs query, using, adding this new data to the home page, it now has a P95 service time of 133 milliseconds. Let’s go back and compare that to my production version which had a P95 of 78. All right. We’ve definitely slowed things down here, I’m in this highly simplistic example. The slowdown is not enormous, but we’re still looking at, you know, almost a doubling of the time it takes to provide this information for the home page. So, my next step would be to try and figure out why it’s taking so much longer and see if there is anything I can do to fix this.
Let’s have a look at the trace for this request. And we’re going to go and look at. Alright, let’s look at one that’s actually taking the sort of P95 time of 123.4 milliseconds. So here is Studio’s graph native view. Okay, it’s looking at it in terms of graph objects and resolvers of what it’s doing to finish the results of that query. So it is pulling some information from the book’s graph, and it’s pulling some information from the authors’ graph. And I can also, you know, sort of break that down for more information if I want. Let’s go and compare to what that looks like for the production version. See if we can spot any immediate differences. Again, through traces lets.
There’s not an immediately obvious difference in the shape of the request that I’m seeing here in the studio. What I’m mostly seeing is on the experimental home page, we’re definitely spending more time pulling a list of books, 50 odd milliseconds versus 4 milliseconds. But otherwise, the shape of how I’m hitting the service has not changed. This isn’t necessarily what I would have expected since I now have a query of depth 3 rather than depth 2. So I want to dig into some other system that can give me more information about what is happening here and why. So this is where I’m going to hop over to Honeycomb. So I’m going to go here and I’m going to find the idea for this trace, and I want to look at it in Honeycomb because that contains additional information about what’s happening with this request end-to-end.
All right, so. Let’s search for this particular trace. I don’t think I need to heatmap it. Okay. So here we go. Here is this one trace that matches to this trace from Studio and let’s take a look. Okay. Now, I can see that, yep, it’s querying a bunch of stuff from books. And if I keep scrolling down, it’s then pulling a bunch of stuff from authors. Okay. So that matches. So can I try and understand what is happening with this book service that’s making it take so much longer? Let’s start to step through this.
Okay. We’re doing a GraphQL execute. We’re pulling all books. Yup. What does that mean in terms of actually pulling data from a Postgres database? We are selecting all books. Nice. But what about here? Okay, well now we’re pulling all books where the author ID is one. Okay, so we’ve pulled all books. Now we’re pulling books for a specific author. And, interesting. We’re running exactly the same query again and again and again. And now we’re pulling all books where the author ID is two and again and again. I think I’m starting to understand what is happening here. The way I have built my resolvers means that my graph is pulling a list of books for each author multiple times, which is obviously pretty inefficient. So let’s go back to Studio.
And before I jumped over to Honeycomb, one of the things I mentioned, I’m a little surprised that I’m running this depth 3 query, but we only seem to be pulling from the books subgraph and the author’s subgraph whereas I would have expected us to then be going to the books subgraph. But if I run this query, we’ll take a look. Yeah. We’re just fetching books and then authors. So, obviously, the query planner is intelligent enough to realize, yeah, I don’t need to go book subgraph and author subgraph and then book subgraph. I just need to hit the book subgraph once and then the author subgraph, but my resolvers are not yet intelligent enough to say, well, actually, I only need to get a list of all books once and then, in memory, filter them out by author ID. So that would be my next step, is to think about how to make my resolvers more efficient.
Perhaps I could use a data layer. Perhaps I could add some caching to my subgraph. Perhaps I could try a mix of both. Books are pretty cacheable since who wrote a book doesn’t change very often. So definitely some work to go away and do on my graph, but I wouldn’t have been able to figure it out without this additional information from Honeycomb. So, as you can see, Apollo Studio provides the schema registry and insights to help you manage your graph, but sometimes you need the additional data that, you’ll get from using OpenTelemetry and the powerful tool of Honeycomb to understand just that little bit more about your graph performance and the underlying data layer.
This really gives you that deeper observability and the additional information you need to really hone in on how to improve the performance of your graph. So, here are some links if you want to learn more. I’ll leave that up for a second.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org