Michael Simo [Platform Engineer|Honeycomb]:
Hi, everybody. I’m Michael Simo, I’m a Platform Engineer at Honeycomb. And today I’m going to be giving a talk going over instrumenting Honeycomb into dedicated game servers, and then gaining insights and seeing what problems you can find, identify, and resolve within the infrastructure.
All right. Let’s get started. The talk is titled Insight Into Instrumentation for Dedicated Game Servers.
And a bit about me. So I’ve been in the system administration, cloud DevOps space for the past eight years. And then, as of the past five years, using some more modern cutting edge technologies like infrastructure as code — so Terraform and Kubernetes and container orchestration platform of choice. And then I run a bunch of dedicated game servers on the platform itself.
Some of my experience with game servers: I started around 2008 doing a third-party self-hosted website. I purchased a game server through there and maintained it by having an FTP connection, uploading files manually. More archaic. Later on, I started self-hosting on Google Cloud using Google Compute Engine and some of the container resources they have there. I built the Docker image and hosted it myself. And now to the next step and as of 2019, I’ve been doing all of my workloads on Kubernetes, everything is containerized. I have a CI/CD pipeline where I can deploy servers and get the newest version up pretty quickly versus doing everything related to files and everything shipped off. We will go more into that during this talk.
The roadmap for the presentation — we’ve got four sections here: The first one is going over server architecture. Secondly, we’re going to go over how to add Honeycomb so you can send your metrics about your Kubernetes cluster to Honeycomb. The third is going to be my CI/CD process for deploying servers — once I have a change that I want to deploy to the cloud, how does it get there, what are the steps it takes. And lastly, observing some of the data, we’re going to do a rudimentary, brief, basic view of the metrics that we’re collecting.
So — server architecture, how the game servers are deployed, and then what components are involved. Two prerequisites: I’m using Kubernetes as mentioned earlier. I wanted to learn more about container platforms and how the components interconnect. Kubernetes stateful set services and some other networking components in there. I wanted to see how compute network storage can all be accessed from one location so, Kubernetes has all these resources to interact with the components, it was good, all in one solution for me. And I also wanted to go with the “cattle, not pets” approach. Instead of having your specific VM or server that you feed and very carefully curate, I wanted to have more of a dynamic workflow, so I have something that’s prebuilt. And I can restart, delete, recreate, spin up, spin down resources and I’ll have the same output. Instead of going, spinning up a server and configuring manually, taking care of that, I wanted to automate that process.
So I went with Docker. I had the docker file that bootstrapped that image in the place of me taking the approach. “Cattle, not pets” is the approach I wanted to take there.
And the second use case is for Honeycomb. Previously, I didn’t have good insight into how I could figure out what’s going wrong. I had logs from the server and that was about it. I used Kubernetes events as well to see what was going on in the container platform. But those can be basic. They’re not super in-depth. I wanted to extend that. So I use Honeycomb’s Kubernetes agent and then this allows me to collect metadata from the Kubernetes cluster, provides me with searching capabilities for all my labels and then I get other information about Kubernetes resource status and some other interesting data that I’ll delve into later. Cool.
This is the architecture for my game servers. I’m going to go briefly over the diagram on the left, kind of breaking this down. And then the dataflow on the right side. Let’s just start with users. A user connects to one of two things, connect to a website or they connect to a dedicated game server which can be of several types. If they connect to a game server, then the game servers are constantly talking with the Kubernetes collection agent which is a daemon set, so there’s a pod on each node. Whatever node this dedicated game server happens to be on is allocating or sending its logs to that agent. That’s automatically taken care of for me, which is great.
I don’t have to necessarily do anything about logging. I get it for free. Persistent volume claim. And then an external IP associated with that. I can put DNS in front of it. So you have the port like 25565 for Minecraft. That’s an example. But there are several components involved to create a game server resource type. Those are kind of all codified. And then there are Kubernetes resources for that. And then that is like one isolated unit.
Users connect to the frontend, they just receive various information from a cache that I have here. And that’s mostly like player statistics, latency, if there are leader boards for how well you’re doing on the server, that’s loaded from a cache. That is managed by the backend API.
And then the back-end API itself is a server that players don’t necessarily connect to, but game servers are constantly talking to it. The API sends data to the database. If you want to record a player score count, I can send that and then persist that in the database. If I want something that’s going to be cached or something that I want to pull up frequently, like current player count, I want to cache that value for a certain period of time and refresh so it’s near real-time. Every five minutes it refreshes, better to cache that instead of rewriting a value in a database every five minutes.
Then for the dataflow, I went over that a little bit. Stats and player profiles are retrieved from the cache here. And game servers send data to the API. Then API does periodic queries on servers so whatever game server query that the server is using, like Minecraft has its own, all the Source dedicated game servers have Source query. There are all different various ways of getting information from the servers but it’s abstracted to be the same kind of structure and format, data format. And then all mentioned is just stored in the backend.
Various games are supported — Counterstrike, Minecraft, Valheim, Rust, etc — these can be expanded but these are the games that I enjoyed over the past few years and have been able to port to Docker and host them on the workflow. And then as of right now, Minecraft and Valheim and Rust are playable right now. Cool.
Instrumenting with Honeycomb: how do we get more information about what’s going on in the infrastructure? How do I monitor or gain insights on what’s going wrong, what’s happening on the servers I just mentioned?
For v1, it was very easy. I had a Kubernetes cluster. I wanted to do getting everything from the cluster, sending to Honeycomb and what can I do with the data? We have an open-source Honeycomb Kubernetes agent that you can download. Just a bunch of Kubernetes manifests that you can, we have a quick start example that you can go ahead and run on your Kubernetes cluster. And that will create a service count, role binding, config map, a DaemonSet which will create one pod per node. It will be the Honeycomb collector and most of the logs from a specific node, all the pods will be sent to the Honeycomb agent and then shipped off to Honeycomb. Then you have to have the secret which contains your Honeycomb API key. And that’s how you confirm it’s going to your account. This chart here going, this is lifted directly from the Honeycomb Kubernetes Agent GitHub. This goes over how the log files are stored and shipped out to the agent. So you can see a pod, a collection of containers, just recording stdout. And it’s being logged on the file system under /var/log/pods and then you have your pod name. And then, all these files contain a name one, two, if you have rolling logs, those will be shipped off to the agent. It’s essentially an aggregation of logs and sending off those logs to Honeycomb. But at the same time, still recording Kubernetes metadata about this pod. And container statuses and things like that. We’ll get into that later.
What are some of the types of data that we’re collecting using the Honeycomb Kubernetes agent? I mentioned there were Kubernetes labels that we can — all the resources that are labeled, we can aggregate or facility every through those using the selectors. There’s Kubernetes metadata. There’s Kubernetes resource status. So if I want to check how often a pod has been restarting over a certain period of time, that status is then extracted from Kubernetes, uploaded to Honeycomb, and I can figure that out. We’ll see a graph of that later.
And yeah, so there’s — a bunch of data that we can delve into. And some of the features that I like about the Kubernetes agent, myself. There’s event sampling, so if I only want a percentage of the events that I’m sending from Kubernetes, maybe I want every hour, I only want 20% of the data, you can do, you can create a sampling specification and then add that to your configuration and you can have that amount of data flowing into Honeycomb instead of overwhelming it with a bunch of traffic. You can omit containers — say I have some game server and it has a sidecar that does initial bootstrapping or pre-world creation, and I don’t want to collect any information about that pod specifically, I can omit it so there’s a config I can write to specify, this is a deny list. These are the pods in my infrastructure that I do not want metrics from or information about. Please omit them.
I think that’s pretty interesting. So as far as using the Honeycomb Kubernetes agent yourself, you just need to sign up with Honeycomb, create a Honeycomb team, get a Honeycomb API key, and then deploy it to your Kubernetes environment of choice. I use personal, like a managed Kubernetes environment, GKE. There’s EKS, there’s self-hosted. Azure has its own offering. And there are several third-party providers that you can use this on. So yeah, a quick start example is a great way to get started. And then productionize and create a version that suits your use case based on that.
Now we’re going on to deploying servers. So the architecture I showed earlier, how do we actually get the servers into Kubernetes so all the components can interact properly?
I have a simple three-step process that I take for my personal infrastructure. Code, whatever changes I need to make, I’ll make sure I define it as code. And there’s a PR in GitHub staged and ready for commit. Pretty much once that’s ready, that’s when my automation starts. And I’ll create a — I’ll do a buildable based off that PR. And then once that is merged, then I will actually get additional artifacts. Create a container image and then kick off another process that will stage changes to my Terraform code, which allows me to deploy to Kubernetes. And I’ll break that down on the specific tools in the next slide here.
This is like a little graphic of what exactly I use. As I mentioned, let’s give another example. I’ve been using Minecraft, so let’s continue with that. If I want to change something simple — let’s say I want to double the player size on my Minecraft server — all I have to do is get a PR, commit my changes and all this would be a simple Kubernetes manifest config map change really. I would change the player limit from, say, 30 to 60. And then I would create a PR. I would submit that PR with the YAML changes and merge it.
I have another process, since my game server container binaries and my configs are hosted in the same repository, it would then do a new build for that specific PR. Though there’s not actually much changing with the container itself because I’m changing the config setting, but it does update all of the Java prerequisites and updates to the latest Minecraft server. So there is some use in creating a build without having any changes to the Docker file or anything on the container side. But it’s just an automated step. Part of my process, and if I wanted to, I could overwrite that and just use an older build or something like that. I could skip that step.
After the container is built, I deploy everything through Terraform. What I would have to do here, since I host everything as a Kubernetes YAML manifest, I have a tool I use (k2tf, it’s open-source on GitHub) that converts Kubernetes YAML into HCL that can be read by Terraform. I manage everything through Terraform. Those manifests are then rendered into Terraform. And that contains all of the information that I need to deploy. There are some Terraform variables that have to be manually added to the specs. But I do have some automation — using sed and stuff — to automatically insert those variables, like the container tag and the hosts, the service IP addresses I use within Kubernetes. Those are inserted through additional automation. And once I do a Terraform plan, I see all the resources.
In this case, my Minecraft server would see a change to the config map. And then really nothing else since the service, the stateful set, all the other components are already in place since the server is running right now. I would do that one change, update the config map so, Terraform applies, changes the config map. Then that would allow me to be ready, deployed on Kubernetes. And we could just start playing as soon as that Terraform apply is done. Which is great. It would be like one to two minutes, maybe up to five minutes for the world to reload from its PVC because the container would shut down and recreate itself. So the spin-up time is how long it takes for the Minecraft server to load its world. And potentially I would need to pull the latest Minecraft image. There’s a little delay but within 10 minutes it’s online. Compared to old school, if I had to take down a server, rebuild and do everything manually, that would be four to five hours to get one server online which was much rougher.
To the next slide here. Here’s an example. I’ve been talking about Minecraft a lot, but in the Minecraft example, if I wanted to do a small config change, it’s relatively simple. It’s a persistent world so all the resources are in my cluster running there forever. Until I want to do something with it. So that’s kind of just no hands-on — ever since I created that world, it’s been running.
In Valheim, there’s a victory condition — you can defeat all the bosses in the world. And while that means you can still explore the world of Valheim but there are fewer things to do. Every now and then I will want to do a conditional reset. I’ve identified that the server is completed, or the game has been finished and then I’ll manually go in, recreate a new seed, and regenerate a new world. Delete most of the stuff that had the world and player state.
Thirdly, I have a different type of example here for Rust. Every week I do reset this because that’s kind of generally how Rust servers go. You want to do a frequent reset so you can reset the world and all the prefabs, and all the players can go from none to great throughout the process of the server life cycle. So like during Rust, every Thursday for me, I will generate a new map. Generate a new seed. Update the leader boards. I have scripts to parse information about player stats and send that forward.
And these are like three different real-life use cases that I use. And I wanted to get more insight into all these steps. Instrumenting with Honeycomb allows me to get a little bit more information; at least now I have the ability to start tracking my deployment timers and seeing all these various deployment use cases and ensure that they’re running smoothly.
With Minecraft for example, it could be up within less than 10 minutes and now I can start adding markers the next time I do a full deploy, start time, stop time, see it easily in Honeycomb. It’s great.
Lastly, we’re going to go through the observe section. Finding information about our servers and what questions should we ask about how to improve game server performance here.
I have a few questions that I wanted to find for myself. Like what are my resource usages based on player activity? When a server is near full, what does that look like over the time the server is near full? Versus when the server is empty? Versus half capacity? Doing a lot of different comparisons based on resource availability and resource usage. And seeing how I can fine-tune and allocate the correct resources for my specific type of server.
Identifying latency: Since I host everything in the U.S. central region, I wanted to see if I could find, if I initiate a connection from U.S. West or elsewhere in the world if I can find player latency issues and then also latency in — in case the server hangs or something like that. This is a tricky question because there are different ways for each different type of game to identify latency. But I have some rudimentary ways of identifying this in my case. There’s, I guess, more to expand on in that case.
And then as I mentioned earlier, metrics in relation to player load. So server empty versus server full, what does it look like? And this is going back to installing the Kubernetes agent for Honeycomb. You get a bunch of information just out-of-the-box by having the Honeycomb collectors there. Metadata and metrics, you get resource statuses about the various Kubernetes resources, and then you have label selectors in place, and you can use those, especially for doing matching or WHERE clauses in our query builder.
I can go over this real quick. Yep, so you have the Kubernetes indicator. And then we have a cluster name. Like a bunch of names about all the resources. Then we have statuses so, these are more so like Kubernetes statuses, restarting, this would be like a pod restart and restart count over the lifetime and nice to have and then we get like Kubernetes event messages here too. And then the active phase that it’s in for whatever status which is cool too. I’m glad it adds that.
And then all the labels, most of these, I use “app”. Most of these are kind of Kubernetes metadata and other ways to aggregate information. They’re all useful, but I mostly stick with the app because I want to query by my specific game server type, and all are labeled that way under app. Metrics.
And here are some of the metrics that you can get. So you can see CPU, file system, memory, network, so these are some good ways you can identify things going on within your Kubernetes platform and traffic between your interconnected services.
Lastly, this is just like a super basic query that I have used a few times. But it’s kind of like, not necessarily brute force but rich information at a high-level, player statistics based on player count, I should say.
Here I have like my query is just a min-max on CPU usage and memory usage, and I want to count distinct restart times. If something is going on in Kubernetes, I want to keep track of that. Those were things that, when I was doing deploys, this was useful for me. The game servers are hosted in the default namespace in Kubernetes. And then I GROUP BY the pod name, you can see here, like Minecraft and some of the game servers and other services that support that are there.
Then I ORDER BY minimum metric CPU usage. I see this column. Then we can see a few charts here. I guess the chart’s small in this case and we can see where I was looking as far as deployment goes. But at least in COUNT DISTINCT, the fact that I see the chart here, we can see a bunch of restarts going on for one server, so this would be something that I need to diagnose. So just mouse over this. Actually, since it’s green, it correlates with the green here. This means that the Rust server is having issues — at a glance. Because it shouldn’t be restarting, it should be just running persistently. And up here, you will see the status restart is just one referring to it is actively restarting, and this is the count of restarts over a period of time.
These are some ways that I’m able to figure out deployment issues and if I need to do server tuning or increased resource usage. This “at a glance” helps me do this. And this is where I want to go next with instrumenting Honeycomb.
I have two other services that are mentioned in the architecture. We have the API on the frontend.
On the API side, if an endpoint is being called, how long does it take for the endpoint to be run. And how many times is it being run, by whom? And then I can do some cool things about cache stat tracking to ensure that my cache is working properly. Do some counts or have a trigger saying that if the cache was not refreshed in the last X period of minutes or hours, raise an issue because that should be a behavior that I want. There are things I can do later on with these. Mostly be through LibHoneyJS and the Golang SDK. Then I would have full coverage of all of my services, all the game servers that are running on Kubernetes. I could have all that data either in one data set or multiple data sets and use those pieces of information to collectively get multiple points of okay, this server is having an issue. Let’s delve into it even further so that would be awesome for me.
The next steps after that, is CI/CD, like the deployment pipeline that I mentioned earlier integrating Honeycomb metrics or deployment markers or using some Honeymarker for example, using those within the deployment process to figure out how long it takes from deployment to actually playable, that would be really cool. And we can do some more things as well, API front-end can also be integrated. These things are not present in my infrastructure at the moment, but that’s the next step for sure.
Yeah, that’s pretty much all I have to go over instrumentation with dedicated game servers using Honeycomb. But hopefully, a good overview of how you can implement this into your Kubernetes infrastructure, and maybe you have thoughts, like what questions can I start asking Honeycomb to get more information about how my system works. For more information, check me out. All my stuff is there. Check out Honeycomb. Thank you for Honeycomb and everybody at hnycon, appreciate the event. It’s been great. Yeah, thanks to everybody.
Ben Darfler [Engineering Manager|Honeycomb]:
Thank you, Michael. That was great. Yeah. We’ll just jump into the questions because we’re going to have to move to the main stage pretty quick here. But we have one question in Slack, are you still using things like metrics server or Prometheus, or is just all of this going to Honeycomb from Kubernetes?
Yeah, using the Kubernetes collector, I have that installed there’s a Daemon Set that runs on each node within the Kubernetes cluster and that manages shipping of all the information with the Kubernetes pods and cluster itself. So besides the Kubernetes collector, I don’t use anything else with the infrastructure.
Wow, that’s great. Cool. And I’m curious, Honeycomb is known for our event support more than metric support, but we clearly do have metric support, even beyond what we announced earlier today. What was your experience using Honeycomb for metrics? Anything much different than the use case?
In my case, I wanted the additional events and traces as well. It’s like what I expected to use with standard metrics. I had the same kind of tools and I guess some nicer ways to visualize it. And using the query builder, I was able to access familiar queries that I would in other systems.
A good experience for all.
Yeah so, got all the tools you need but now it’s all in one place to keep jumping back and forth.
Right, that’s a big thing for me. Having one tool.
Yeah. Right. That’s awesome. Cool. Well, thank you again. And thank you to all of the folks who presented here on the Mystery Solved track at hnycon: Frank Chen, Michael Ericksen, Glen Mailer, and Michael Simo. And drop in Slack, give them some love. And we’re going to move to the main stage. So join us on the main stage and this stage is now going to be closing. Thank you.