Building Better Go Systems with Logs, Context, and Profiling

[00:00:00]
**Mat Ryer:** Hello, and welcome to Grafana's Big Tent, the podcast all about the people, community, tools, and tech around observability. People and community are the same thing. Tools and tech, the same thing. And what are we doing?

Okay, I'm here today. I'm joined by Donia, Charles, and Bryan. Donia is a software engineer and works in community at Isovalent, at Cisco now, and also co-author of the Learn Go with Pocket-Sized Projects book with some of the people. Hi, Donia. How are you doing?

**Donia Chaiehloudj:** Hey. Yeah, I'm happy to be on the podcast. Thank you for the invitation. I'm so happy to see you again here.

**Mat Ryer:** Yeah, it's going to be great. How's the book? I love your book, by the way. I learn by kind of building things and actually doing things. And yeah, I think learning Go like that, with these little tiny projects that you can pick up and finish quickly while learning lessons, I think it's a great format for that. So I'm a big fan myself.

**Donia Chaiehloudj:** Thank you so much. Yeah, I guess that's the way I like to learn too, just to explore and get my hands dirty or have tiny projects. So that's how it came up on the table. And yeah, people are quite happy. They're saying that they're not so pocket sized sometimes, but they're learning things anyway. So that's cool.

**Mat Ryer:** Well, big pockets, especially because your book's so successful, so that's fun.

We're also joined by Charles Korn. Charles, you're principal engineer at Grafana Labs, and you're working on the Mimir query engine at the moment, aren't you?

**Charles Korn:** Yeah, that's right. Thanks for having me on, Mat.

**Mat Ryer:** Yeah, pleasure. Where in the world are you, Charles?

**Charles Korn:** I'm in Melbourne, Australia, so the other side of the world.

**Mat Ryer:** Yes, I went to Melbourne last year. I loved it.

**Charles Korn:** It's all right.

**Mat Ryer:** Yeah, that background hiss that just joined the conversation comes from Bryan Boreham. Bryan, hello. You're a distinguished engineer at Grafana Labs. That's pretty good, isn't it?

**Bryan Boreham:** Well, I like to think so.

[00:02:01]
**Mat Ryer:** Well, maybe it just means I'm old.

**Bryan Boreham:** No, it's good. It's because you're good.

**Mat Ryer:** Well, thank you very much.

Yeah, so in this episode, then, we're going to talk specifically about Go. But obviously, it's the Big Tent podcast. At Grafana Labs, we care a lot about observability and want to see if we can give people a sense of things that you should be paying attention to when it comes to this in Go, mistakes maybe we've made in the past, and hopefully a lot of other useful bits of information along the way.

Maybe we could start out by, if we were starting a brand new Go project today, all four of us here, we all get fired because of this podcast. It's just imaginary. But this podcast, let me tell you, it doesn't go well, okay? We all get fired for whatever reason, probably Donia's fault. And what we're going to then create is a new Go project. What do we need to think about when it comes to observability? What should people care about if you're starting a project from scratch, in Go?

**Donia Chaiehloudj:** I would say that I would go simple to start with. We know that we are always refactoring along the way, that priorities change, like real life. But I would try to go for Go standard library as much as possible, because we know that they're stable and not going to be archived tomorrow, basically.

I would also go for well-known libraries that are not standard ones, but they're used by most of the people and well maintained, even though we know that contributors are less and less in the open source world.

So yeah, I would also think about some standardization from the beginning, the way you want to put your data, your context, the way you want to trace, that kind of thing. So yeah, that's the basic, I would say, to start with.

[00:04:04]
**Mat Ryer:** Yeah, I think that makes sense. I like your point that you're going to refactor. Things are going to change. So that kind of takes a bit of pressure off. It doesn't have to be perfect straight away.

But Charles, would you start with metrics? Would you just have logs? Would you start with something and have profiles and things, or add that later?

**Charles Korn:** For me, the thing I use probably most often, at least in the stuff that I'm working on at the moment, is logs. From logs, you can derive metrics if you really need to. So that's probably where I'd start. They're really easy to get started with. You can dump them into a file, you can dump them to the console, and you can start shipping them off to a system like Loki or something like that.

So yeah, if you're going to start with something, start with logs. And then, as you get more and more sophisticated, add things like metrics and profiles as well.

**Mat Ryer:** Yeah, I think that's quite natural, isn't it, that you would start with logs? I heard someone say, might have even been Bryan in one of his talks, that the first thing you do in a program is log with the Hello World. Bryan, did you put it through?

**Bryan Boreham:** So I've got, here, I put "MONKEY." There is software running somewhere that just has "MONKEY" printed out occasionally in some edge case still. Just "MONKEY" in all caps.

**Mat Ryer:** I had a friend at a workplace, and he chose a swear word just so it could grab his attention. And that did, unfortunately, not get spotted in review and did get into production. And I think he did get a little bit in trouble for that.

So Bryan, did you agree? Start with logs? It's all good tips.

**Bryan Boreham:** Yeah, I think so, because it's very natural, just kind of print out. I mean, you start typing and you probably think you know what you're doing, and when it gets above 20 or 30 lines or something like that, you lose track. So you print something out, you know?

[00:06:05]
**Bryan Boreham:** It's done this bit, yeah. But then when you turn that into a so-called real program, you need to make some decisions. If you're printing out two or three lines every operation that you do, then you can't do thousands or millions of operations. You've just got way too many lines.

So segregating them into debug logs, info logs, error logs, that's a pretty common pattern, although one which the Go standard library did not agree with.

**Mat Ryer:** Hmm. Yeah, you just get kind of a flat log line, don't you? There's nothing really special. Of course, there's the new slog structured logging. Would you reach for something like that straight away?

**Bryan Boreham:** Yeah, I think slog's very good and gives you that little bit of extra flexibility in log levels. And it also lets you have key-value pairs, but it's still in the log line.

For anyone that doesn't know, yeah, there's structured logs. That's the kind of thing that divides people, like choice of editor or something like that, whether you want an extremely structured log line, maybe put the whole thing in JSON, for instance, which I personally don't like because it just seems to get in the way of understanding when you have to kind of fight your way past all the curly brackets and the escaping things and so on.

So I tend to just have a line of text that's meant for humans to read in my logs, but it's certainly good. And so I would kind of structure the conversation like: you start out by just printing some lines that are meant for humans, and then you realize you've got way too many lines, and some of those go to metrics because the interesting thing is just to count it.

And then some can go to traces if you have the infrastructure set up, or use a SaaS maybe tracing system. Then that's kind of like logs, except they all live together.

[00:08:07]
**Mat Ryer:** Right. Yeah, traces having a trace ID in a log line, is that enough to say that these log lines essentially belong together?

**Bryan Boreham:** Yeah, it can be, but tracing adds that explicit parent-child relationship, and everything's always got a beginning and an end. So I think tracing is kind of the superpower of figuring out, with any complicated program, what happened.

It's not very good for an audio medium to start describing a Gantt chart visualization.

**Mat Ryer:** Yeah, you can try. You could describe a flame graph, though, couldn't you?

**Bryan Boreham:** Well, flame graphs are different. Flame graphs are not ordered by when things happened, for instance. So a Gantt chart is ordered left to right by the time that things happened, and then top to bottom in a sort of waterfall of what caused what, what triggered what.

And you can see patterns in that visualization. You can see when there's a very linear sequence of things. I'm waving my hand for the benefit of the tape. I'm waving my hand in a diagonal motion. You can see when a lot of things kick off in parallel. You can see when a lot of things end in parallel, which might mean that they're all bottlenecked on a mutex or something like that.

So yeah, that visualization, a time-based visualization, is why I call it the superpower to really see what's going on.

And a flame graph, I mean, I love flame graphs. A flame graph is an amazing idea, but part of what it does is it dispenses with the time axis and says, we're going to try and show you what's important, what there's a lot of. We're going to use the x-axis to mean quantity rather than time.

[00:10:10]
**Mat Ryer:** So the idea being it's spending more resources in that bit, and therefore, if you're going to optimize something, go for the big hanging fruit, would you say?

**Bryan Boreham:** I suppose so.

**Mat Ryer:** So you mentioned that you would do that only in advanced projects or complicated projects. How do you know when it's time to reach for tracing?

**Bryan Boreham:** Well, for myself, I said 20 or 30 lines, but it gets complicated. So my bar is very low. My ability to concentrate on things is quite poor.

But it also depends, because tracing is quite a complicated thing, or people find it complicated to set up. And whereas logs, just stick it in a file and then read it, it's orders of magnitude different.

But tracing really comes into its own when you have multiple bits in what you call a distributed system, multiple front end and back end, or multiple bits of back end, or something like that. You pass the same idea around to everything because it's related, and then they're all logging or they're all reporting using that same idea, and that allows you to then tie the whole process together across all these multiple systems.

**Mat Ryer:** Yeah, it's very cool, and of course makes sense at scale.

So we talked about, sometimes you just want to count something that's happening in logs and turn it into a metric. I just want to dig in on that so that people really understand that, because that turns out to be really quite powerful. Can we think of a specific example of where a log is turned into a metric? And what would you use that for?

**Charles Korn:** I've got one example from last week, actually.

**Mat Ryer:** We've been around a lot for many things recently, Charles. Well, last week's fine. We'll settle for that. Go for it.

**Charles Korn:** So yeah, we've got a bunch of Go services at Grafana Labs, and unfortunately, occasionally they panic, and they dump the trace to their logs and they get stuck to standard error, and they get picked up by our logging system.

And it's really useful to be able to show that on a graph, how often a thing's panicking. So yeah, we actually have a system where it'll look at the logs, count the number of things that look like a panic, and turn that into a metric. And then we can alert on that metric just like any other metric. That's really helpful. And yeah, I mean, so we can get paged when stuff breaks.

[00:12:10]
**Mat Ryer:** Yeah, so you literally then get a graph that shows you how many panics you're having.

**Charles Korn:** Exactly. And you can have alerts on that.

**Mat Ryer:** So that is quite interesting. So you sort of like, we'll tolerate some panics. Will you?

**Charles Korn:** No, we get paged for every panic now. Right, because yeah, we had a couple of issues where we were doing that, and things broke. So now we pay attention to that stuff a little bit more.

**Mat Ryer:** Yeah, I see. That's not the same as errors, is it? Like, you might have a ton of errors, right?

**Charles Korn:** Yeah, exactly.

**Mat Ryer:** Yeah, that's very cool. I did work with somebody who had so many alerts firing that they basically then had a metric to count how many alerts were firing. And that is what they would then alert on.

So they were happy with thousands of alerts firing, but once it reached a certain point, it's like, now come on, that's too many now. And then they would... so that's kind of amazing.

But I do like the idea, and it's one I think people that are sort of new, or if you haven't done much observability yourself, this idea that, and we obviously have technology that does this very well, there are others available, but you can create this numerical data from what's happening inside your log data, especially if it's structured logs, although of course I think Loki does some good work of extracting meaning from even unstructured.

Yeah, so that's very cool. So does Go help here? Does Go make this easy? Or is it tricky in Go? When they created the language, testing was a first-class concern. Was observability a first-class concern? And did they do anything good or anything not good?

**Bryan Boreham:** So they didn't do a lot, I would say. The Go standard library had a very, very simple log interface, arguably too simple, so everybody replaced that with some sort of third-party thing.

But really nothing for tracing, and it didn't even have context in the first few releases of Go. And then when they did add context, you have to remember to put it in yourself. So that's what lets you thread a trace from one call to the next and therefore lets you do that visualization to see what called what.

[00:16:14]
**Bryan Boreham:** And again, really nothing for metrics in the standard library. There are some internal metrics inside the Go runtime, which are really, really good if you need to know that kind of thing about, say, how the garbage collector is running.

But yeah, I think that certainly originally was kind of a TBD. And then they added structured logging in slog, which I think has pretty much taken out most of the reasons why you'd use a third party. So Go is pretty good now for logs.

And then it's got context, which lets you do tracing, but you still have to bring something in, like OpenTelemetry is the sort of standard. It used to be a thing called OpenTracing, but they sort of folded it into the new standard, OpenTelemetry. So that's what people should be using for distributed tracing.

And metrics, you're still left with needing some kind of third-party provision for that. And we could get into that.

I just remembered, I mean, there's another kind of tracing built into Go, which is called execution tracing. Or I usually talk about `go tool trace`, which is what you type in if you want to use it. And that's a very, very detailed trace of what's going on.

So the distributed tracing that I was talking about earlier, that's intended for production systems. You can capture millions of requests over a day, say, and get a picture of all of them. The `go tool trace` execution tracing is more intended for capturing like five seconds of what happened in your program, or maybe a bit more.

And they're certainly working on reducing the overhead, but it's a vastly different level of detail. So that is built into Go. The `go tool trace`, the execution tracing, I would describe that as extremely cool. And if cool things attract you, then go find a video where someone's demoing that, because it's quite remarkable what you can get out of it. It is hard to get into. It's very hard to understand at a glance what's going on in that tool.

**Mat Ryer:** No, no, no, yeah. So the tracing with the context is an interesting point, and this could be our first gotcha warning. Because it's possible, if you...

So the way it works, I guess, is when the request comes in, maybe there's a trace ID in a header or something, and you'll have some kind of middleware or something that will grab that and put it into the context, into the Go context as a value, I suppose.

And then that gets passed through all the execution everywhere, and then anything that's logging or contributing data, we need to tie it back to that kind of transaction or that particular operation. The trace ID should do that.

So if you kind of just use `context.Background()` suddenly in your package somewhere, or you just don't respect the passing of that context, I suppose that breaks the tracing, doesn't it?

[00:18:16]
**Bryan Boreham:** It absolutely does. So there's first gotcha warning. Yeah, try to pass context. I wouldn't say absolutely everywhere, because there's a little bit of cost in terms of extra code and runtime and so on, but pretty much everywhere else, except at the very, very lowest level of the very inner loop of your program, you might want to think twice about it.

But pretty much everywhere else, you will benefit from passing the context because you can trace things, you can cancel things.

Do I wish it was built in, like beyond just having it as a value that you pass around? I do. I mean, there are caveats to that, but life would be so much simpler if you could just ask the Go runtime, how did we get here?

**Donia Chaiehloudj:** I just wanted to add that, depending on the architecture of your project, for example, if you're using domain layers and domain architecture, it's very easy to identify where you should put your context, depending on where you are in the different layers.

And I feel like at some point you know that you're so domain and low level that you might not need the context anymore. You're just playing with data, let's say, basically.

So I would say that above, you have to use the context because you're playing with the request and all these things that you want to trace, definitely.

There is actually one rule that we define in the book about context. It's like, pass it almost everywhere, yes. And if you don't need it, just have the blank operator, and maybe one day you're going to use it. So you don't have to change the signature of your functions all the time.

**Mat Ryer:** Yeah, I find that often, that I want to do more logging while I'm still building the thing. And I tend to build in the wild sometimes. I'll have something that's alongside the code. I'll have new code that's not yet being touched, but we can test it in the sort of real world. And then later maybe I'd tidy up some of that logging and things where I don't need it.

So that is a nice trick. And the underscore, the little underscore trick, tells you, yes, it's an argument, but we're not going to use it in this particular take on it. So that's a nice tip. Top tip.

Charles, what were you going to say?

**Charles Korn:** I was just going to go back to Bryan's point about, it could be really nice if it sort of happened automatically, if you didn't have to pass the context around like that. That's what happens in a lot of other languages, right? So the JVM ecosystem, you can do things like that, and then it magically does it based on the thread.

[00:22:19]
**Charles Korn:** So yeah, I wonder if something like that could be done in Go, because that'd be really cool.

**Bryan Boreham:** Yeah, but I suppose the Go philosophy really is kind of against the magic and against hidden stuff.

**Charles Korn:** It's true. It's true.

**Bryan Boreham:** But yeah, I don't know. I have seen some proposals where you could have some kind of new functions where you just say, get the context.

There is some complexity around it. Sometimes you do want to change the context or add to it, and sometimes you do just want a completely different context for some reason, don't you?

**Charles Korn:** Yeah, or you don't know what the heritage of that thing is, right? Like you've got a pool of goroutines servicing requests. It's not clear what the path is into that one and what the link is for that trace.

**Mat Ryer:** Yeah, yeah, we just have to sort it out. Just sort it out.

Yeah, it's one of these things the designer of the system has to kind of bear that burden of understanding what Charles was talking about. If you have a pool of threads really, I mean, yes, it'd be goroutines in Go, but the point would be to try and run a bunch of things in parallel. And tracking where did those come from, what is the context for that, is something that ultimately the designer of the system has to understand what that means and where that goes.

And so the choice that Go has right now is you just always have to do that. You always have to track the context. Yeah, it's a little disappointing.

Well, yeah, but the benefit is you really can see what's happening, which is nice.

And errors has the same thing, right? So same thing. Some languages have exceptions, or they have an alternative kind of chain for managing and handling errors. In Go, they're just normal values that you return. Does that get in the way as well, or is that useful? How do we feel like that design impacts when it comes to observability?

**Charles Korn:** One thing I do miss sometimes coming from other languages, you've got an exception type, and each of those exceptions is a particular type. It's a file-not-found error or a network error or whatever it is.

Whereas with Go, most of those things are just strings. So if you're going to do any kind of analysis, like how many file-not-found errors did I get, that could be quite tricky in Go because they're just a whole bunch of strings. Each of the strings might be slightly different. Maybe some of them prefix something, some of them prefix something else. Some of them have the file name in them.

You could have some way of categorizing them and merging those into one group, and that can be quite tricky with Go's focus on strings, I guess, sort of stringly typed errors.

[00:24:20]
**Charles Korn:** But at the same time, it makes it really simple to create these really rich errors. They're really easy, as an engineer trying to solve a problem, to get that context of what's going on. It's a bit of good and bad.

**Mat Ryer:** Yeah, and whatever you come up with as well, it has to be easy to maintain. As people come in, new people come and add things, if you've got these kind of complicated rules that you're following, or even special error types can be kind of tricky, you have to then do the work of making sure that everyone kind of follows this as well, don't you?

**Donia Chaiehloudj:** Yeah. So I started my career with Go. And for me, it was very natural to have error types. That was like, no, I want to create new types of error if I had something specific. And it was a reflex to just check if there was the type of error that I wanted already in the library where I was.

And I did one year of Java in a company, and I was playing with exceptions, and I was confused, actually. I was like, I want to define my own error type in that case because it's something very specific, you know? And I want to type it for that type of library that I'm dealing with.

So that's very interesting, what Charles is talking about, the way exceptions can be in general more easy, maybe, let's say. But I find that error types and being more granular is easier to read in the code and to understand when you're debugging too.

**Mat Ryer:** Yeah. And then what happens to errors? Do they just get logged out, or do we count them? Does it depend?

**Donia Chaiehloudj:** Yeah, I guess that in our cases, it was more for debugging purposes, and it was not for counting. So it was at the time where we were dealing with processing of satellite images, so we wanted to take a massive, basically process each step, what was going on. So it was not about counting.

**Mat Ryer:** Right. So in Go at the moment, because an error is just another value, you have to write the code somehow to, if you want that in your logs, or your traces, or you want to count it as a metric, you're going to need some code somewhere that does that.

I mean, certainly at Grafana Labs, we put that in a framework, or imagine it's typically called the server framework, which is part of our dskit, distributed systems kit library. But yeah, again, it's one of these things that is not built in, is not kind of batteries included in Go, that you have to have some code that figures out an error happened.

Yeah. So then profiles. I know Go has pprof. What is pprof? Charles, do you know what pprof is?

[00:28:23]
**Charles Korn:** Sure. Yeah, so it's a tool that allows you to measure the performance of your Go application. And it can show you a bunch of different profiles.

So the ones that I use most often are CPU time. It's literally just how much time is spent in different functions. That's presented usually in a flame graph, in a flame graph Bryan was talking about before.

And the other one I spend a lot of time looking at is memory consumption, like peak in-use memory consumption. And again, it's exactly what it sounds like. It shows you which functions are holding the most memory alive at any one time.

And yeah, that's really useful for working out how to cut those things back, make things faster, make things less resource intensive. And yeah, this is really cool.

So you can run this as part of the Go testing command. You can run `go test` and ask it to spit out a profile as part of that invocation, and then you can load it up into pprof, explore it either in the web browser or in your terminal, and have a look at that data and use that to make optimizations to your code.

Or you can send it off to a continuous profiling system, and then get that from production, which is really cool, because then you get to see how your application behaves with real users throwing real traffic at it and all the weird and wonderful things that customers do.

**Mat Ryer:** Yeah, okay. So do we get that for free then? You can just run the test and ask for the profile, or do you have to go in and do something special in the code?

**Charles Korn:** There's nothing special in the code. You have to pass a command-line flag to `go test`, and then it spits out the profile file.

**Mat Ryer:** Yeah, that's very cool. That's helpful.

[00:30:01]
**Mat Ryer:** Usually with things that are sort of out of the box like this, they're good for the basic case, but then if things get more advanced, they become kind of a pain sometimes. Does that happen in this case with profiles?

**Charles Korn:** At least not in my experience. I find profiles really useful. I look at profiles every couple of days, at least with what I'm working on at the moment, which is really cool.

Something that's quite nice is that you can actually set tags on profiles. So just like you add labels to metrics, or structure metadata in your logs, it's possible to add tags like a trace ID, for example, to profiles. Then it will be propagated with that profile.

You can then do things like, "Show me the profile for this particular request." So if you've got a really painful, really long-running request and you've captured the profile for that, you can go and look at that profile and understand where it was spending its time, which is great if you haven't got tracing, for example, that shows you where the time is.

**Donia Chaiehloudj:** I have a question. I've always been very intimidated at the beginning of my career by `pprof`, actually. Do you have any advice for someone who would start with it?

**Charles Korn:** I was very intimidated as well, actually. When I first joined Grafana Labs, I went to a session that Bryan ran and introduced me to `pprof`.

**Mat Ryer:** That's why you're intimidated.

**Bryan Boreham:** So Bryan is formidable. Where's the scary man?

**Mat Ryer:** You may be able to find a version of that on YouTube, actually. I've given one or two talks over the years.

**Bryan Boreham:** Yeah, I basically search for my name on YouTube because it's fairly unique. Let's see if I'm talking about profiling. There's a lot of information there.

I was just thinking to myself, actually, there were one or two gotchas. The big one, I think, that catches some people is that they dive into CPU profiling when they don't actually have a CPU problem.

[00:32:06]
**Bryan Boreham:** They're not running out of CPU. They've got a program that's slow, and so they think, "Oh, profiling." Then it turns out that this program is slow because it's waiting on some other program, like maybe a database, and a profile will not show you that.

A CPU profile in Go will show you the time it's spent on CPU actually doing things, maybe reading the results back from the database. So that's one to watch out for.

The simplest way to watch out for it is to watch your CPU meter. If it's ticking along at sort of 0.1 CPU usage or something like that, then it's very unlikely that profiling is going to get you anywhere. Whereas if the fans are all running, 18 CPUs going in parallel, then that's probably a good one to point the CPU profiler at.

So I guess the next thing is it's almost always memory allocation in Go that is causing... if you do have a CPU problem, look at the memory profile, is my next top tip.

First of all, make sure you have a CPU problem. Secondly, don't look at the CPU profile. Look at the memory profile.

Look at the... there's four different profiles that come out of the Go memory profile, and the one to look at is called `alloc_space`. `inuse_space` is the one to look at if your program is crashing because it's out of memory.

[00:34:08]
**Bryan Boreham:** The other two are `inuse_objects` and `alloc_objects`. People think that the number of objects they've allocated is going to be the thing, right? "I've allocated millions, billions of objects. Surely that must have a cost."

But it turns out that the Go garbage collector is really very, very efficient at allocating any number of individual objects. What's actually slow is tracing all the pointers in those objects.

So actually, the cost of memory allocation is much more driven by the space that you allocate, because that space generally has pointers in it, pointers to it. `alloc_space` is the one to look at if you have a performance problem.

Only once you're utterly convinced that your memory allocation is really well optimized, go back and look at the CPU profile and see what else is going on.

**Mat Ryer:** Yeah, those are good tips.

The other one, I think, is for tracing as well. Sometimes it's possible that you sort of trace the wrong bits. There's another gotcha. So you mentioned thinking about where the actual problem is. I saw a case once where we basically just had the tracing set up, but it wasn't capturing the expensive bits. So everything looked fast, even though in practice it wasn't.

This was before Go, though, because I think the way that you do that in Go is probably all right with that.

**Bryan Boreham:** Because if you can get kind of dark space in a trace, hopefully a trace will have a big long line if something took five minutes to run. You'll have a bar in your tracing view that's five minutes long.

Then the problem you might have is you don't know what happened in those five minutes, right? Probably it was waiting on something else, but if you don't have a trace for that, if you don't have some indication of what was going on, what do you do?

[00:36:10]
**Bryan Boreham:** In Grafana, as part of the Grafana Cloud product, if you have both traces and profiles, it will actually bring up the profile under a trace. If you're looking at a trace, it'll just automatically do that. You don't have to ask for it.

It surprised me the first time I saw it, but suddenly I've got a profile in the middle of this trace. That can be really cool.

But again, you have to look to see if it was actually burning CPU at that time. If something took five minutes, and the GUI has presented a beautifully formatted view of 100 milliseconds, that's not where the problem is. You've got to have a sense of proportion. You've got to look for the big... the biggest fruit also hangs the lowest, if I can invent a quote.

**Mat Ryer:** I might just speak to HR about this.

**Bryan Boreham:** It's not a Charles issue.

**Mat Ryer:** This is why we all get fired.

**Bryan Boreham:** It would just rip off this now, and it gets worse.

**Mat Ryer:** I didn't say it for dad.

Okay, so yeah, this is very cool. And eBPF, so eBPF is very interesting. Donia, you and the work you do, and the company you work for, there's a lot in this space. Is Go a good eBPF candidate?

**Donia Chaiehloudj:** Yeah, I would say that, okay, it depends what, again, you want to observe. But let's say that eBPF and Go pair really well, I feel.

[00:38:12]
**Donia Chaiehloudj:** eBPF, for people who maybe don't know what it is, is a way to write C programs, BPF programs, in the Linux kernel to dynamically observe or secure your kernel. So you don't have to recompile your kernel. You can just write a program, run it, and it will dynamically load in the kernel.

That's very powerful. But it can be very daunting and out of cost to write BPF programs. So having Go wrappers on top of that is very interesting. That's one aspect where Go and eBPF work very well.

eBPF in general is interesting for observing parts of your system that you can't observe otherwise, that you don't have access to between your user space and your kernel. That's interesting.

Something I personally like about eBPF is that you can actually access dark sides of your kernel that you can't access from user space. That's what I find pretty interesting. You can access dark sides, I call them, in your Linux kernel that you can't access otherwise.

**Mat Ryer:** Like the dark side of the moon.

**Donia Chaiehloudj:** Yeah, yeah. I mean, for example, XDP. You can access requests coming in and packets coming to your laptop before even reaching your CPU, before anything is happening. That's actually very powerful for security especially, or that kind of thing, or for performance too.

So yeah, your question was, is eBPF interesting to observe a Go system? I would say that today we have many tools in the open source ecosystem using eBPF and Go, deploying easily and running as operators on Kubernetes clusters running Go applications. So that's definitely pretty standard now.

[00:40:16]
**Donia Chaiehloudj:** Cilium, for example, is based on eBPF, and it is used as a CNI by default on most of the cloud providers now, and you have all the primitive metrics coming with that. Then you can just run your Grafana and add that world, basically, and observe your packets, your policies, what you're running.

So I think that today it pairs pretty well, and it comes as a default if you're running a very large distributed system on any cloud provider, basically. I mean, you can run on Docker too, but not a lot of people do that.

**Mat Ryer:** Yeah. So that's very cool. Cilium and Grafana dashboards go together well, do they?

**Donia Chaiehloudj:** Yeah, definitely. You have one option and you can just explore your metrics and run Grafana. Just do your port-forward, and that's very easy to do.

So you can observe policies, networks, what's happening on your system from a network point of view, but also in terms of requests since you can observe on layer seven. You have this granularity, so that's very powerful.

What is interesting, I didn't maybe mention it, but eBPF is running in the Linux kernel, so it's very performant for that. You're not using, again, the user space and CPU aspects that you could use with most of the operators or agents you have today. That's why it's very interesting. And you bypass all the network layers.

[00:42:20]
**Mat Ryer:** So what does it look like? Are you essentially listening to events down in that layer?

**Donia Chaiehloudj:** Yeah. Basically how it works is you have hook points on events, and you have available events on the Linux kernel. For example, if you're opening a file, you can observe any file opens on the system. Or you can observe any packet incoming or outgoing, ingress and egress.

There's a very famous diagram of all the things that you have on the system and all the hook points that you have on memory or CPU or whatever, all the things that are happening on the system. You can hook basically anywhere. You have different hook points.

I actually gave a talk like two weeks ago about hook points and gotchas. That can be a bit tricky sometimes. You can actually miss logs. CPU and context switching can make your system miss some probes, so you have to be careful with what you're observing and the way you're counting or observing, actually.

There is a basic command where you can see all the missed probes, missed-count probes on your Linux kernel. I think even when you're doing anything, there is a bunch of things that you're missing. So that's very, very interesting.

Anyway, if you're doing that in production, there are some tools to see that, observe that, read that, show programs, and stuff. But yeah, that can be a little bit tricky, definitely.

**Mat Ryer:** Yeah, I didn't realize that. But there's a lot of things like that, I find, where you don't fully get the full story, but it's just good enough, so it's okay. But yeah, it's kind of crazy that you think, "Oh yeah, you can just miss this stuff."

Anything else on eBPF that we want to chat about?

[00:44:24]
**Bryan Boreham:** I think it's got its problems with Go related to the stuff we were talking about earlier, like context. Because you start in the kernel with eBPF, and then you've got to try and figure out what was going on in this program, which first of all might not be a Go program. You've got to realize it's a Go program and then kind of hunt around for what's going on, because it's not something that's exposed very well to the eBPF running inside the kernel.

There might be a context, there might not be a context. The way that things happen in a Go program is not exposed very well to eBPF running inside the kernel, so people struggle with that.

Grafana actually worked on a thing that is now part of OpenTelemetry called OBI, or eBPF Instrumentation, I don't know if it's a Star Wars reference. Probably. Am I allowed to say that?

**Mat Ryer:** Other space shows are available.

Yep. We should list some to be kept just to be sure. Star Trek, Space: 1999. Yeah, there are others.

**Bryan Boreham:** What's it stand for? OpenTelemetry eBPF Instrumentation. Okay, so it should be OEI, I guess, that's why they went for OBI. It was a good name.

So yeah, I guess there were a bunch of different companies working on this kind of tooling. It's not just for Go, but Go is one of the things that it does, and one of the things that Grafana Labs people were working on.

Putting it in OpenTelemetry, which is a project owned in common by the CNCF, the Cloud Native Computing Foundation, means it's not owned by any one company, and all the companies can kind of collaborate on that one thing and try and improve that.

[00:46:33]
**Bryan Boreham:** It's currently required some very bright people to wrangle that code and get eBPF to figure out what's going on inside your Go program.

**Mat Ryer:** So does it add metadata to it so that eBPF can access that, or is it new functions that you use in conjunction with the data you get from eBPF? What does it actually do?

**Bryan Boreham:** Yeah, I don't think it's changing anything in your code. It's looking at it from the outside and then trying to figure out what's going on.

**Mat Ryer:** Hmm. Yeah, but it's funny because eBPF is just so low level, and computers do get weird down at that level, don't they? They're quite unusual to say that they're pure logic machines. You know what I mean? It's meant to be a pure logic machine, but sometimes when I open a window, it just doesn't work.

But that's what you said earlier, Bryan, about it not taking long before you have a few pieces interacting in ways where the number of possible combinations are so great that it kind of becomes almost impossible to guess what they're going to do, and it's why we need observability. But that is just interesting, I think, at the very least.

**Bryan Boreham:** No, I think we're in Ant-Man in the quantum realm.

**Mat Ryer:** Right. Yeah, I mean, when you think of the scale down at that level, I read a book recently and learned about a particular combination of logic gates, NAND gates specifically, that make a bit. It shows how you make a bit, where you can store information, set information, retrieve it, and these kinds of things, just in a very mechanical way.

[00:48:36]
**Mat Ryer:** And it is from that we're going to run out of time. Yeah, we can't do all of it all the way up to keyboards and mice.

So if someone's writing a library or a package in Go, do you want them to do all the instrumentation for you and just provide that? Or should they leave that to the user?

I always say, if you've got concurrency, maybe leave that out of your package. Let the user do the concurrency so they can manage it, and just give them the pieces they need, because I feel like that's easier and they get more control. What about instrumentation? Do you think packages should provide their own instrumentation?

**Donia Chaiehloudj:** I would say that it depends on the size of the project. Again, if you're running a big project with a lot of resources, people, I would say that it can be interesting to go with your tools and have control of what you're doing.

If it's not the case, if you're running a small project or if you're just starting somewhere and you want to have something out, I would go with building libraries and have everything baked in if possible, just to go straight to the point of having an app running.

So yeah, in the end it depends where you are in the project, I would say, but it comes with a cost all the time. It has to be a trade-off.

[00:50:37]
**Bryan Boreham:** It is one of these things that historically was not well baked into the system, like logging, for instance, because the base Go log library was too naive for most people. Then people picked a third-party one, and different people picked a different one.

It basically got to the point where the only thing... well, basically libraries would not log. I think that was the most common reaction to that.

So I think because so many people have congregated around OpenTelemetry for tracing, certainly, that we're in a better place there. The previous standard, OpenTracing, you can put in a bridge to turn that into OpenTelemetry.

So that's a situation where it's definitely better if the library does its own tracing. But there are still questions. Sometimes it does too much, sometimes it does too little. That's what I find, that I'm sometimes frustrated that I'm getting way too many low-level traces.

Just a simple example: in OpenTelemetry itself, they give you a way to trace HTTP calls that your program might be making. So that's nice. The standard behavior is you get tracing spans, so each one of these comes out as a different line on your GUI. There's one when it opens the connection, one when it makes the DNS request and gets the address it's going to talk to, then one where it starts talking on the connection, and then one where it gets the first bit of data back from the other end. You end up with like five lines for one call.

[00:52:46]
**Bryan Boreham:** Hopefully there's an option, but you have to know about this. In the OpenTelemetry HTTP tracing library, you can ask it just to put those points in time, I don't know what they're called, events, within the one span. So that's another tip.

But the basic idea applies to every library. You're never going to satisfy everyone. Some people are going to think you've got too much detail. Some people are going to think you've got not enough detail. So you effectively need a kind of verbosity: how much tracing do you want out of this library?

I think that's going to be the next level of convenience that people want.

**Donia Chaiehloudj:** Do you know why the Go team waited so long before having `slog` out and waiting?

My hypothesis is that they were waiting for third parties and big companies writing libraries, and then take all their ideas and have something at the end, simple, and just steal everything. Either that, or it was just not in the priority of the Go team.

**Bryan Boreham:** I was going to say there was a talk at GopherCon UK that I went to two years ago about structured logging and the `slog` library. I know the name of the person who gave the talk is escaping me.

**Mat Ryer:** Was it Jonathan Amsterdam?

**Bryan Boreham:** Yes. There you go. I knew it was Jonathan.

[00:54:49]
**Bryan Boreham:** That name does not escape me because it's one of the coolest names I've ever heard. It sounds like the kind of name I'd come up with to get out of trouble as a fake name. They're like, "What's your name?" and I'm like, "I don't know, Jonathan Amsterdam?" Something like that.

But yeah, I can recommend that because Jonathan worked on the `slog` library, and he talks quite a lot about the backstory.

I think the basic reason why it took so long is priorities. The Go team funded by Google is quite small. They have essentially an infinite list of things that people might want, whether they want faster compilers or better observability or templates or so many things that people want.

At the end of the day, it comes down to priorities and a little bit, do they have a good idea? I think your point about being able to look across the landscape of third-party solutions that people would come up with and cherry-pick what seemed like good ideas, there's definitely a benefit to waiting for that to happen.

So I think that's what goes into why `slog` took so long. Why was it such a slog?

**Donia Chaiehloudj:** Well, generics were more important than `slog` at the end.

**Bryan Boreham:** Yeah, I'd have to agree with that.

**Mat Ryer:** Well, I suppose you could always be doing the `slog` stuff yourself, and there were third parties that let you do structured logging. But yeah, I like that they wait and see what emerges and then pick from the best.

**Bryan Boreham:** It's harder for them because once they've made decisions, they're sort of baked in to some extent. So they do have to be a bit more careful, don't they?

Although there's always a lot of exciting things to look at in either `GODEBUG` or `GOEXPERIMENT` settings that I always think are worth a look. Each time a new version of Go comes out, take a look at what's there as an experiment, because that's probably something that they're going to put into the next version or the one after, and you turn it on through the `GOEXPERIMENT` setting and you get to play with it and give feedback.

[00:56:54]
**Mat Ryer:** Yeah, okay, cool. Well, talking of feedback, let's finally end on what we want Go to change so that it makes our lives easier when it comes to observability. Other things that we wish were different in Go.

We talked a bit about some of them. Context being automatic is one option, potentially. Any others? Bryan, you had an idea about actually printing out a struct properly, didn't you?

**Bryan Boreham:** Well, that was in the debugger, which I suppose is another slant on observability. I use the debugger in VS Code, which is a thing called Delve, really, the debugging tool under the covers, and it makes certain decisions about how things are going to be drawn. In a really big program, those get frustrating.

So I would love more flexibility, and it's probably more in the debugging tool than in Go itself, although it might be something that goes into, for instance, a struct tag.

Just a simple example: you might have something that's a slice of bytes, and when you're looking at the debugger, should it be printed out as a string, as a hex number, as a series of decimal numbers? Does it mean something else? Is it an IPv6 address? Donia knows all about those.

So that kind of being able to add a little bit more semantic information somewhere to tell the debugger, and then have the debugger show it to you in a much more convenient way, that's on my wish list.

**Mat Ryer:** Could it not just be a custom type with its own string method?

[00:58:58]
**Bryan Boreham:** Well, the debugger doesn't really want to call into a function like `String()`. Technically it could call the string function.

Right now it will really try to avoid that because there's no way to just call one function and come back out. What you have to do is fire up the whole Go runtime, including background goroutines, including the garbage collector. So you can't just do a simple thing.

What seems like a simple thing, just call the `String` method, very simple, nice, neat idea, the amount of work required for the debugger to do that is absolutely unthinkable.

Now, the Go runtime could conceivably add a concept of "just call a simple method" that doesn't have anything running in the background, no garbage collector running. Conceivably the Go runtime slash compiler could add that concept and then make it easier for the debugger to do that thing.

But yeah, the one thing that seems really easy and obvious

[01:00:04]
**Charles Korn:** It's utterly impossible where we are today.

**Mat Ryer:** Well, there we go. That sounds like a place to improve. Any other ideas for things where Go could get better?

**Charles Korn:** Two things. One of them builds on what Bryan was just talking about, like talking about panics before. One thing that would be really helpful is if, when you put out a stack trace, it could say, give me the pointer address of this pointer, print out this value from that struct, you know? Because, for example, I spend a lot of time looking at queries users send us. It'd be really useful to know what the panic was and what the query was, like what was the expression that they sent us, instead of getting a random memory address that means nothing to me. That'd be really helpful to be able to debug that problem.

The other thing that's kind of related is errors. I'd love to be able to get a stack trace reliably for an error. So when I'm looking at, like, something went wrong, I mean, that's great. But it'd be greater to know exactly where they came from. What was the thing they called? What was the path they took to get there? Because that's often really helpful context to work out why it broke or why it did what it did. And yet, that can be quite tricky to do at the moment in Go.

**Mat Ryer:** Yeah, they feel surmountable. Are they?

**Bryan Boreham:** It builds on some of what I was talking about, but possibly yours is more surmountable.

**Mat Ryer:** Yeah. I look forward to seeing that next Go release then.

**Bryan Boreham:** Of course, yeah. Go is open source. People can contribute. So yeah, and they do.

**Mat Ryer:** Okay, well, there we go. Unfortunately, that is all the time we have today. Thank you so much for listening, and thanks to our guests Bryan, Charles, Donia. And we'll see you next time on Grafana's Big Tent.

@ 2022 Grafana Labs