Derived Columns: Adding additional "virtual" fields to your data
Learn how to use Honeycomb Derived Columns to create “virtual fields” that are calculations based on the value of other fields.
Nathan LeClaire [Sales Engineer|Honeycomb]:
Welcome to another episode of Honeycomb training. I’m Nathan LeClaire SE at Honeycomb. And, today, we’re going to talk about using derived columns to add additional virtual fields to your data.
A little bit about the motivation for derived columns and what they’re used for. You might find yourself in situations where the instrumentation that you added just doesn’t provide enough data, or data in the right format that you expect. You might have, for instance, a field that’s just an unstructured text field like this that says, “Received an error from goroutine 1337,” blah, blah, blah. And you might want to transform that into a much more groupable, or queryable field value like just context deadline exceeded that you can see here. And really there are all kinds of use cases where you might want to take the values of existing columns and turn them into new ones. You might want to do math on them and that kind of thing. So, derived columns are a way to create virtual columns that are composed from the contents of other columns.
To access the derived column creation menu in the UI, you have a couple of options. The first is that right from the query builder in the group by dropdown, you can click on this little Create derived column that you see over on the left, and that will pop you right into the menu for creating them. You can also go to the dataset settings in overview. So, on the right-hand side of the query builder in the side panel, there’s this little go-to dataset settings link that you can see on the right of the slide here. And within that dataset settings, on the schema tab, there’s a little derived column thing that you can unfurl, and manage your derived columns.
When you get to that creation menu, let’s take a second to look at the properties that it has. Very first is the definition of the derived column. And we’re going to cover some of the syntax for that in a second. It’s really important to note that in this definition, you can only reference literal fields that exist. You cannot reference other derived columns, so that is something to keep in mind, you cannot nest derive columns. The second input that you see there is, what we call, the alias, or what you’re going to call this derived column as you’re querying it. I often like to have a prefix on field names, so you might call it DC.field. Or if you’re using this for service level indicators, I always like to say, SLI.blah, SLI.foo, and so on and so forth. You can also give it an optional description, which is just a human-readable version, so everyone can know what this column is all about. And you see a little preview area, where you can workshop your derived column on some example data from your dataset.
The syntax for derived columns looks like this. It kind of reminds me of Excel formulas, so we have a bunch of functions that you can call, which you can see a couple of examples here. Function with an open paren, and those closing parens at the end. And when you reference columns, they always have a dollar sign in the front. So, it’s sort of this PHP style notation. You can still have dots in column names and that kind of thing. And when you reference literals, if it’s an integer you just put the number in or float. And if it’s a string, you can use either backtick, single quotes, or double quotes, two delineate that string.
And once you’ve created a derived column, you can then query this, and look at it in the UI, just like you can with any other field. So, here we can see a derived column that splits the dataset into two values. We have that orange line and the purple line there, and those both reflect groupings where we have different values of that derived column because we have that group by.
Before we get into some examples of what derived columns can do in practice, just a quick note on the supported operations. We do have functions for conditional logic, like if. We also have comparison operators, like less than, greater than, and so on, and so forth. Boolean operators, so that you can have conditional logic. Math operators for doing math operations. And, of course, a couple of other utility functions for casting and doing stuff with timestamps.
Now, we’re going to look at five examples of derived columns in practice. Sample number one is reducing the number of groups. A lot of the time while it’s great that Honeycomb supports very, very high cardinality groupings, you might have kind of a tangled mess that looks like this that you just want to reduce down to a much more limited number of groups without losing any of that data.
Here’s an example of what we’re going to look at, which is the user agent field. You can see that these user agents are kind of similar in that sometimes they use the same integration, but they also have other things appended, and they have versions and that kind of thing that makes it so that there’s just too many groups. We want to reduce this down into just one field for a type of integration that these user agents represent.
You can start off by thinking about using the IF and CONTAINS operators. So, IF will accept a few arguments, either two or three, and then it will allow you to do conditional logic to determine the value of a derived column. This all starts off with a contains where we’re saying if the
request.user_agent field contains Kubernetes, then we want the value to be a Kubernetes agent. Otherwise, the value will be other. So, of course, we can expand that basic idea into a giant monster like this bad boy. And we plugged that derived column in, now we’ve reduced the number of groups very successfully to more of a high-level way that we want to think about it, which is just that underlying integration.
And another way we can reduce the number of groups is with the bucket operator. So, I would be remiss if I didn’t mention this excellent derived column operator. It will take in a column, of course, and then bucket it into a predetermined number of groups based on the numerics of how it breaks down. So, maybe you have a field for file size or something like that in your data, and you want to bucket it out into small, medium, large, and extra-large. Using this bucket operator, you could then chunk out this numeric field that previously would have created just way too many groupings, and limit it down to a pre-configured set of buckets.
The second example we will look at is log scale heat maps. So, a common problem that will happen as users are analyzing heat maps in Honeycomb is that the outliers in their data will kind of blow out the chart, which you can see here. There are all these scattershot, things that had a lot of latency that we want to squish together. Well, one way we can do that is with a derived column log10 operator. If we had a column that was log10 of this field, then we could look at that heat map and review that in more detail.
Looks a little something like this when you add that in. And you can see there are still some outliers at the top there, but it’s not blowing the scale-out nearly as bad. And we have a much better feel on this heat map for where the actual median, or sort of the hump of the data is. That dark line indicates where most of the events fall in terms of latency. It just kind of smoothed out this chart and did that, that squishing down operation we were talking about.
Example number three, if you have any aspirations at all to do SLOs in Honeycomb, you will need to first define your service level indicators using a derived column. We can B, make sure that we’re dealing with only the subset of spans that are relevant for that SLI, and also b, define what that actually means. So, we have infinite flexibility to define our service level indicators based on the data in Honeycomb. And it’s really only limited by our imagination and the data that we’re sending in.
Here, what we’re doing is we’re saying that if the service name field is set to shopping cart, and field trace.parent_id does not exist, that is to say, this is a root span in that trace, and the status code is not greater than or equal to 500 we will return true. This will ensure that for the shopping cart service, we can create an SLI that will be true when there is not an error, and false when there is an error. And when we’re working with a span that’s not for the shopping cart service at all it will just be ignored.
Example number four, derived columns can be used to do a fancy little trick to calculate the error rate of how often errors are happening relative to the rest of your traffic in your services. If we define a derived column, sort of like what we see at the top, where we say, “Well, if app.error exists, then the value of this should be one otherwise, the value of this should be zero. Then, the average of all of those will be an error rate. So, if it’s 0.05 that’ll mean we have 5% errors of all of the different spans that we have in our infrastructure.
Here’s an example chart showing this on Honeycomb where you can see, using markers, that the error rate is spiking up when we deploy. We can see that mostly our error rate is pretty low, there were a few incidents where it spiked up. And then during and around deploys, that error rate is spiking up. So, using derived columns made it possible to visualize not only the raw count of how many errors there were but what was the rate of errors for everything in the system.
And last, but not least, you can do regular expressions. So, as the saying goes, I used to have a problem, then I used regular expressions. Now, I have two problems. You can have as many problems as you want using derived columns. You might have a field that’s along these lines. Maybe it’s just a raw text field. We might want to just turn that into something a little more structured. We can define the reg_value operator. And what that will do is it will match the first capturing group in a regular expression that you pass to it, and use that as the value for the derived column. Take a long unstructured string and start transforming some data out of it, and we can do that at query time. You can post hoc apply these operations using derived columns, and that allows a lot of flexibility.
That was a tutorial on using derived columns. I hope you enjoy and post all of your wacky use cases in the Honeycomb Pollinators channel, go forth instrument, and enjoy observability.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.
Why We Built Our Own Distributed Column Store
Building Honeycomb, we needed a low-latency, distributed, schemaless database. NBD for Facebook, but we faced multi-tenancy, cost to serve, and the limited resources of a startup. This talk describes Retriever, a custom-built database inspired by Scuba. Retriever ingests events from Kafka and chooses disk over memory, using an efficient column-oriented storage model. Hear interesting aspects of the implementation, and lessons learned from operating a hand-rolled database at production scale with paying customers.