Grafana's Big Tent | Transcript: Monitoring Kubernetes: Why traditional techniques aren't enough

October 18, 2024 • 69 Minutes

Monitoring Kubernetes: Why traditional techniques aren't enough

Mat Ryer: Hello. I'm Mat Ryer, and welcome to Grafana's Big Tent. It's a podcast all about the people, community, tools and tech around observability. Today we're talking about monitoring Kubernetes. Why do we need to do that? Why do we need an episode on that? Well, we're gonna find out. And joining me today, it's my co-host, Tom Wilkie. Hi, Tom.

Tom Wilkie: Hello, Mat. How are you?

Mat Ryer: Pretty good. Where in the world are you doing this from today?

Tom Wilkie: Well, I guess for all the podcast listeners, you can't see the viDio... But out the window you can see the Space Needle in Seattle.

Mat Ryer: Okay, that's a clue. So from that, can we narrow it down? Yeah, we'll see if we can narrow that down with our guests. We're also joined by Vasil Kaftandzhiev. Hello, Vasil.

Vasil Kaftandzhiev: Hey, Mat. How are you doing today?

Mat Ryer: Oh, no bad. Thank you. And we're also joined by [unintelligible 00:01:05.12] Hello, Dio.

Dio: Hey, Mat. It's nice to be here.

Mat Ryer: It's an absolute pleasure to have you. Do you have any ideas where Tom could be then, if he's got the Seattle Needle out his window? Any ideas?

Vasil Kaftandzhiev: He's definitely not in Bulgaria, where I'm from.

Tom Wilkie: I am not, no.

Mat Ryer: Yeah. Okay, is that where you're dialing in from?

Vasil Kaftandzhiev: I'm dialing from Sofia, Bulgaria. This is a nice Eastern European city.

Mat Ryer: Oh, there you go. Advert. Dio, do you want to do a tourist advert for your place?

Dio: Yeah, I'm based in Athens. [unintelligible 00:01:31.28] almost 40 degrees here. It's a bit different.

Mat Ryer: Athens sells itself really, doesn't it? Alright.

Tom Wilkie: Well, I can assure you it's not 40 degrees C in Seattle. It's a bit chilly here. It's very welcoming to a Brit.

Mat Ryer: I don't know how the ancient Greeks got all that work done, honestly, with it being that hot there. It's that boiling... How do you invent democracy? It's way too hot.

Tom Wilkie: Is that a global warming joke, is it, Mat? I don't think thousands of years ago it was quite that warm.

Mat Ryer: Oh, really? No... It must have still been. Actually, that's a great point. I don't know. Okay, well... Right. Tell me. Why do we need to do a podcast episode on monitoring Kubernetes. Aren't traditional techniques enough? What's different? Dio, why do we need to have this particular chat?

Dio: Alright, that's interesting. First of all, I'm leading a DevOps team, so I have a DevOps background; I come like out of both ways. It can even be like an engineering background, or a sysadmin one. Now, if we're talking about the old way of doing monitoring things, I don't know. So I'm based on engineering. In my past positions I was writing code, and I still am... But the question "Why do we need monitoring?" is if we are engineers, and we deploy services, and we own those services, monitoring is part of our job. And it should come out of the box. It should be something that -- how we could do it in the past, how we can do it now... It's part of what we're doing. So it's part of owning your day to day stuff. It's part of our job.

Tom Wilkie: I mean, that's a really interesting kind of point, where like, who's responsible for the observability nowadays? I definitely agree with you, kind of in the initial cloud generation, the responsibility for understanding the behavior of your applications almost fell to the developers, and I think that explains a lot of why kind of APM exists. But do you think -- I don't know, maybe leading question... Do you think in the world of Kubernetes that responsibility is shifting more to the platform, more to kind of out of the box capabilities?

Dio: It should be that 100%. Engineers who deploy code, post code, they shouldn't care about where this leaves, how it's working, what it does, does it have basic knowledge... But everything else should come out of the box. They should have enough knowledge to know where the dashboards are, how to set up alerts... But in our example, most of the times we just deploy something, and then you have a ton of very good observability goodies out of the box. Maybe it wasn't that easy in the past; it's very easy to do now. The ecosystem is in a very, very good position to be able, with a very small DevOps team, to be able to support a big engineering team out of the box.

Tom Wilkie: I guess what is it about Kubernetes in particular that's made that possible? What is it about the infrastructure and the runtime and all of the goodies that come with Kubernetes that mean observability can be more of a service that a platform team offers, and not something every individual engineer has to care about?

Dio: Alright, so if we talk about ownership, for me it shouldn't be different. It should be owned by the team who writes this kind of stuff. Now, why Kubernetes? It's special. Maybe it's not. Maybe the technology is just like going other ways. But I think now we're in a state where the open source community is very passionate about this, people know that you should do proactive monitoring, you should care... And now Kubernetes - what it did that was very nice, and maybe spinned up, like maybe make this easier - healing. Auto-healing now is a possibility. So as an engineer, maybe you don't need to care that much about what's going on. You should though know how to fix it, how to fix it in the future... And if you own it, by the end of the day things will be easier tomorrow.

So what we can have -- we'll have many dashboards, many alerts, and it's easy for someone to pick this up. By the end of the day, it's like a million different services and stack underneath. But all this complexity somehow has been hidden away. So engineers now they're supposed to know a few more things, but not terribly enough [unintelligible 00:05:59.03] Maybe that was not like in the past. But it is possible now. And partially, it's because of the community over there. How passionate the community is lately when it comes to infra and observability and monitoring.

Vasil Kaftandzhiev: [06:15] It's really interesting how on top of the passion and how Kubernetes have evolved in the last 10 years, something more evolved, and this is the cloud provider bills for resources, which is another topic that comes to mind when we're talking about monitoring Kubernetes. It is such a robust infrastructure phenomena, that touches absolutely every part of every company, startup, or whatever. So on top of everything else, developers now usually have the responsibility to think about their cloud bill as well, which is a big shift from the past till now.

Dio: You're right, Vasil. However, it's a bit tricky. One of the things we'll probably talk about is it's very easy to have monitoring and observability out of the box. But then cost can be a difficult pill to swallow in the long run. Many companies, they just -- I think there are many players now in observing the observability field. The pie is very, very big. And many companies try to do many things at the same time, which makes sense... But by the end of the day, I've seen many cases where it's getting very, very expensive, and it scales a lot. So cost allocation and cost effectiveness is one of the topics that loads of companies are getting very worried about.

Tom Wilkie: Yeah, I think understanding the cost of running a Kubernetes system is an art unto itself. I will say though, there are certain bits, there's certain aspects of Kubernetes, certain characteristics that actually make this job significantly easier. I'm trying to think about the days when I deployed jobs just directly into EC2 VMs, and attributing the cost of those VMs back to the owner of that service was down to making sure you tagged the VM correctly, and then custom APIs and reporting that AWS provided. And let's face it, half the teams didn't tag their VMs properly, there was always a large bucket of other costs that we couldn't attribute correctly... And it was a nightmare.

And one of the things I definitely think has got better in the Kubernetes world is everything lives in a namespace. You can't have pods outside of namespaces. And therefore, almost everything, every cost can be attributed to a namespace. And it's relatively easy, be it via convention, or naming, or extra kind of labeling and extra metadata, to attribute the cost of a namespace back to a service, or a team. And I think that for me was the huge unlock for Kubernetes cost observability, was just the fact that this kind of attribution is easier. I guess, what tools and techniques have you used to do that yourselves?

Dio: Right. So I don't want to sound too pessimistic, but unfortunately it doesn't work that nicely in reality. So first of all, cloud providers, they just - I think they enable this functionality to be supported out of the box (maybe it's been a year) [unintelligible 00:09:22.25] And GCP just last year enabled cost allocation out of the box. So it means you have your deployment in a namespace, and then you're wondering, "Okay, how much does this deployment cost? My team owns five microservices. How much do we pay for it?" And in the past, you had to track it down by yourself. Now it's just only lately that cloud providers enable this out of the box.

So if you have these nice dashboards there, and then you see "My service costs only five pounds per month", which is very cheap, there is an asterisk that says "Unfortunately, this only means your pod requests." Now, our engineers, like everyone else, it takes a lot more effort to have your workloads having the correct requests, versus limits, so it's very easy by the end of the day to have just a cost which is completely a false positive. Unfortunately, for me at least, it has to do with ownership. And this is something that comes up again and again and again in our company.

[10:25] Engineers need to own their services, which means they need to care about requests, and limits. And if those numbers are correct - and it's very difficult to get them right - then the cost will be right as well. It's very easy just to have dashboards for everyone, and then these dashboards will be false positives, and then everyone will wonder why dashboards [unintelligible 00:10:44.11] amount of money, they will pay 10x... It's very difficult to get it right, and you need a ton of iterations. And you need to push a lot, then you need to talk about it, be very vocal, champion when it comes to observability... And again, it's something that comes up again and again.

If you are champion on observability, sometimes it's going to be cost allocations, sometimes it's going to be requests and resources, sometimes it's going to be dashboards, sometimes it's going to be alerts, and then they keep up. Because when you set up something right, then always there is like the next step you can do - how can we have those cost allocations proactively? How can we have alerts based on that? How can we measure them, and get alerted, and then teach people, and engage people? It's very difficult questions, and it's really difficult to answer them correctly. And I don't think still cloud providers are there yet. We're still leading.

Tom Wilkie: I'm not disagreeing at all, but a lot of this doesn't rely on the cloud provider to provide it for you, right? A lot of these tools can be deployed yourself. You can go and take something like OpenCost, run that in your Kubernetes cluster, link it up to a Prometheus server and build some Grafana dashboards. You don't have to wait for GCP or AWS to provide this report for you. That's one of the beauties of Kubernetes, in my opinion. This ability to build on an extend the platform, and not have to wait for your service provider to do it for you is like one of the reasons why I'm so passionate about Kubernetes.

Dio: Completely agree. And again, I don't want to sound pessimistic. OpenCost is an amazing software. The problem starts when not everything is working with OpenCost. So for example, buckets - you don't get [unintelligible 00:12:27.22] Auto-scaling is a very good example. So if you say "I have this PR on TerraForm, and then it will just increase auto-scaling from three nodes to 25 minutes. OpenCost, how much will this cost?" OpenCost will say "You know what? You are not introducing any new costs, so it's zero." Fair enough. [unintelligible 00:12:48.12] is going to be deployed. But then if your deployment auto scales 23 nodes, it's going to be very expensive. So while the technology is there, it's still -- at least in my experience, you need to be very vocal about how you champion these kinds of things. And it's a very good first step, don't get me wrong. It's an amazing first step, okay? We didn't have these things in the past, and they come from the open source community, and they're amazing. And when you link all of these things together, they make perfect sense. And they really allow you to scale, and like deploy stuff in a very good way, very easily. But we still -- it needs a lot of effort to make them correct.

Vasil Kaftandzhiev: I really love the effort reference. And at the present point, if we're talking about any observability solutions, regardless if it is OpenCost for cost, or general availability, or health etc. we start to blend the knowledge of the technology quite deep into the observability stack and the observability products that are there. And this is the only way around it. And with the developers and SREs wearing so many superhero capes, the only way forward is to provide them with some kind of robust solutions to what they're doing. I'm really amazed of the complexity and freedom and responsibilities that you people have. It's amazing. As Peter Parker's uncle said, "With a lot of power, there is a lot of responsibility." So Dio, you're Spiderman.

Dio: [14:25] I completely agree. One of the things that they really work out, and I've seen it -- because you're introducing all these tools, and engineers can get a bit crazy. So it's very nice when you hide this complexity. So again, they don't need to know about OpenCost, for example. They don't need to know about dashboards with cost allocation. You don't need to do these kinds of things. The only thing they need to know is that if they open a PR and they add something that will escalate cost, it will just fail. This the only thing they need to know. That you have measures in there, policies in there that will not allow you to have a load of infra cost.

Or, then something else we're doing is once per month we just have some very nice Slack messages about "This is a team that spent less money this quarter, or had this very big saving", and then some it could champion people... Because they don't need to know what is this dashboard. By the way, it's a Grafana dashboard. They don't need to know about these kinds of things. They only need to know "This spring I did something very good, and someone noticed. Okay. And then I'm very proud for it." So if people are feeling proud about their job, then the next thing, without you doing anything, they could try to become better at it. And then they could champion it to the rest of the engineers.

Vasil Kaftandzhiev: There is an additional trend that I'm observing tied to what you're saying, and this is that engineers start to be focused so much on cost, that this damages the reliability and high availability, or any availability of their products... Which is a strange shift, and a real emphasis on the fact that sometimes we neglect the right thing, and this is having good software produced. Yeah,

Tom Wilkie: Yeah, you mentioned a policy where you're not allowed to increase costs. We have the exact opposite policy. The policy in Grafana Labs is like in the middle of an incident, if scaling up will get you out of this problem, then do it. We'll figure out how to save costs in the future. But in the middle of like a customer impacting problem, spend as much money as you can to get out of this problem. It's something we have to actively encourage our engineering team to do. But 100%, the policy not to increase costs is like a surefire way to over-optimize, I think.

Dio: We have a reason, we have a reason. So you have a very good point, and it makes perfect sense. In our case though, engineers, they have free rein. They completely own their infrastructure. So this means that if there's a bug, or something, or technical debt, it's very easy for them to go and scale up. If you're an engineer and you have to spend like two days fixing a bug or add a couple of nodes, what do you do? Most of the times people will not notice. So having a policy over there saying "You know what? You're adding 500 Euros of infrastructure in this PR. You need someone to give an approval." It's not like we're completely blocking them.

And by the way, we caught some very good infrastructural bugs out of these. Engineers wanted to do something completely different, or they said "You know what? You're right. I'm going to fix it in my way." Fix the memory leak, instead of add twice the memory on the node [unintelligible 00:17:35.22] Stuff like that. But if we didn't have this case, if engineers were not completely responsible for it, then what you say makes perfect sense.

Mat Ryer: Yeah, this is really interesting. So just going back, what are the challenges specifically? What makes it different monitoring Kubernetes? Why is it a thing that deserves its own attention?

Tom Wilkie: [18:01] I would divide the problem in two, just for simplicity. There's monitoring the Kubernetes cluster itself, the infrastructure behind Kubernetes that's providing you all these fabulous abstractions. And then there's monitoring the applications running on Kubernetes. So just dividing it in two, to simplify... When you're looking at the Kubernetes cluster itself, this is often an active part of your application's availability. Especially if you're doing things like auto-scaling, and scheduling new jobs in response to customer load and customer demand. The availability of things like the Kubernetes scheduler, things like the API servers and the controller managers and so on. This matters. You need to build robust monitoring around that, you need to actively enforce SLOs around that to make sure that you can meet your wider SLO.

We've had outages at Grafana Labs that have been caused by the Kubernetes control plane and our aggressive use of auto-scaling. So that's one aspect that I think maybe doesn't exist as much if you're just deploying in some VMs, and using Amazon's auto-scalers, and so on.

I think the second aspect though is where the fun begins. The second aspect of using all of that rich metadata that the Kubernetes server has; the Kubernetes system tells you about your jobs and about your applications - using that to make it easier for users to understand what's going on with their applications. That's where the fun begins, in my opinion.

Dio: Completely agree. If you say to engineers "You know what? Now you can have very nice dashboards about the CPU, the nodes, and throughput, and stuff like that", they don't care. If you tell them though that "You know what? You can't talk about these kinds of things without Prometheus." So if you tell them "You know what? All of these things are Prometheus metrics. And we just expose [unintelligible 00:19:49.05] metrics, and everything is working", they will look there. If you tell them though that "You know what? If you expose your own metrics, that says scale based on memory, or scale based on traffic." Either way, they become very intrigued, because they know the bottleneck of their own services; maybe it is how many people visit the service, or how well it can scale under certain circumstances, based on queues... There are a ton of different services. So if you tell them that "You know what? You can scale based on the services. And by the way, you can have very nice dashboards that go with CPU memory, and here is your metric as well", this is where things become very interesting as well.

And then you start implementing new things like a pod autoscaler, or a vertical pod autoscaler. Or "You know what? This is the service mesh, what it looks like, and then you can scale, and you can have other metrics out of the box." And we'll talk about golden metrics.

So again, it would take a step back... Most engineers, they don't have golden metrics out of the box. And that is a very big minus for most of the teams. Some teams, they don't care. But golden metrics means like throughput, error rate, success rate, stuff like that... Which, in the bigger Kubernetes ecosystem you can have them for free. And if you scale based on those metrics, it's an amazing, powerful superpower. You can do whatever you want as an engineer if you have, and you don't even need to care where those things are allocated, how they're being stored, how they're being served, stuff like that. You don't need to care. You only need some nice dashboards, some basic high-level knowledge about how you can expose them, or how you can use them, and then just to be a bit intrigued, so you can make them the next step and like scale your service.

Tom Wilkie: You said something there, like, you can get these golden signals for free within Kubernetes. How are you getting them for free?

Dio: Okay, so if you are on TCP, or if you are in Amazon, most of those things have managed service mesh solutions.

Tom Wilkie: I see. So you're leveraging like a service mesh to get those signals.

Dio: Yes, yes. But now with GCP it's just one click, in Amazon it's one click, Linkerd is just a small Helm deploy... It's no more different than the Prometheus operator, and stuff like that.

Tom Wilkie: [22:09] Yeah. I don't think it's a badly kept secret, but I am not a big fan of service meshes.

Dio: I didn't know that... Why?

Tom Wilkie: Yeah, I just don't think the cost and the benefits work out, if I'm brutally honest. There's definitely -- and I think what's unique, especially about my background and the kind of applications we run at Grafana Labs, is honestly like the API surface area for like Mimir, or Loki, or Tempo, is tiny. We've got two endpoints: write some data, run a query. So the benefit you get from deploying a service mesh - this auto instrumentation kind of benefit that you describe is really kind of what is trivial to instrument those things on Mimir and Loki. And the downside of running a service mesh, which is the overheads, it's the complexity, the added complexity, the increased unreliability... There's been some pretty famous outages triggered by outages in the service mesh... For an application like Mimir and Loki and a company like Grafana Labs I don't think service meshes are worth the cost. So we've tended to resort to things like baking in the instrumentation into the shared frameworks that we use in all our applications. But I just wanted to -- I want to be really clear, I don't think Kubernetes gives you golden signals out of the box, but I do agree with you, service meshes do, a hundred percent.

Dio: It is an interesting approach. So it's not the first time I'm hearing these kinds of things, and one of the reasons - we were talking internally about service mesh for at least six months... So one of the things I did in order to make the team feel more comfortable, we spent half of our Fridays, I think for like three or four months, reading through incident reports around service meshes. It was very interesting. So just in general, you could see how something would happen, and how we would react, how we would solve it... And it was a very interesting case. And then we've found out that most of the times we could handle it very nicely.

Then the other thing that justified a service mesh for us is that most of our -- we are having an engineering team of 100 people. And still, people could not scale up. They could not use Prometheus stuff. They could not use HPA properly, because they didn't have these metrics. So this is more complex... Anyway, we're using Linkerd, which is just -- it's not complex. We are a very small team. It's not about complex. It's not more complex than having a Thanos operator, or handling everything else. Again, it has an overhead, but it's not that much more complex. However, the impact it had to the engineering team having all those things out of the box - it was enormous.

And one last thing - the newest Kubernetes version, the newest APIs will support service mesh out of the box. So eventually, the community will be there. Maybe it's going to be six months, maybe it's going to be one year. I think that engineering teams that are familiar with using those things, that embrace these kinds of services, they will be one step ahead when Kubernetes supports them out of the box... Which is going to be very, very soon. Maybe next year, maybe sooner than that.

Tom Wilkie: Yeah. I mean, I don't want to *bleep* on service meshes from a great distance. There are teams in Grafana Labs that do use service meshes, for sure. Our global kind of load balancing layer for our hosted Grafana service uses Istio, I believe. And that's because we've got some pretty complicated kind of routing and requirements there that need a sophisticated system. So no, I do kind of generally take the point, but I also worry that the blanket recommendation to put everything on a service mesh - which wasn't what you were saying, for sure... But I've definitely seen that, and I think it's more nuanced than that.

Mat Ryer: [25:59] But that is a good point. Dio, if you have like a small side project, is the benefits of Kubernetes and service mesh and stuff, is it's so good that you will use that tech even for smaller projects? Or do you wait for there to be a point at which you think "Right now it's worth it"?

Dio: Obviously, we'll wait. We don't apply anything to it just because of the sake of applying stuff. We just take what the engineering teams need. In our case, we needed [unintelligible 00:26:21.07] we really needed these kinds of things. We needed [unintelligible 00:26:26.26] metrics. We needed people to scale based on throughput. We needed people to be aware about the error rates. And we needed to have everything in dashboards, without people having to worry about these kinds of things. But the good is that we got out of the box, they're amazing. So for example, now we can talk about dev environments, because we have all this complexity away to the service mesh. We're using traffic split, which again, is a Kubernetes-native API now.

So probably this is where the community will be very, very soon, but I think [unintelligible 00:27:03.13] DevOps on our team, it's in a state where -- we're in a very good state. So we need to work for the engineering needs one year in advance. And people now struggle with dev environments, releasing stuff sooner. Observability - we have solved it in the high level, but in the lower level still people struggle to understand when they should scale, how they can auto-heal, stuff like that. And service meshes give you a very good out of the box thing. But again, we don't implement things unless we really need them... Because every bit of technology that you add, it doesn't matter how big or small your team is, it adds complexity. And you need to maintain it, and to have people passionate about it; you have to own it.

One other thing that I have found out that it's not working maybe, at least in the engineering department, is that people, they often change positions. And Grafana is a very big player, so it has some very powerful, passionate people... But the rest of the engineering teams it's not the same. So you may have engineers jump every year, a year and a half... So sometimes it's not easy to find very good engineers, who are very passionate. [unintelligible 00:28:14.11] own it, and then help scale it further. So it is challenging, I completely agree.

Tom Wilkie: Yeah, I 100% agree. We encourage our engineers to move around teams a lot as well. And I think all really strong engineering teams have that kind of mobility internally. I think it's very important. I just want to -- you talked a lot about auto-scaling, and I do think auto-scaling is a great way, especially with the earlier discussion about costs... It's a great way to achieve this kind of infrastructure efficiency. But two things I want to kind of pick up on here. One is auto-scaling existed before Kubernetes. Right? I think everyone who's kind of an expert in EC2 and load balancers and auto-scaling groups are sitting there, shouting at the podcast, going "We used to do this before Kubernetes!" So what is it about Kubernetes that makes us so passionate about auto-scaling? Or is it just the standard engineering thing that everything old is new again, and this all cyclical?

Dio: Could you in the past auto-scale easily based on the throughput, and stuff? I'm not sure.

Tom Wilkie: Yeah. Auto-scaling groups on Amazon were fantastic at that.

Dio: Alright. And what about the rest of the cloud providers?

Tom Wilkie: Yeah, I mean... Are there other cloud providers? That's a bad joke... Yeah, no, you know, Google has equivalent functionality in their VM platform, for sure. I do think -- you do kind of make a good point... I think it's kind of similar to the OpenCost point we made earlier, of like Kubernetes has made it so that a lot of these capabilities are no longer cloud provider-specific. You don't have to learn Google's version of auto-scaling group, and Azure's version of auto-scaling group, and Amazon's auto-scaling groups. There is one way -- the auto-scaling in Kubernetes, there's basically one way to configure it, and it's the same across all of the cloud providers. I think that's one of the reasons why auto-scaling is potentially more popular, for sure.

Dio: Very good point.

Tom Wilkie: [30:15] But I would also say, you've talked a lot about using custom metrics for auto-scaling, and using like request and latency and error rates to influence your auto-scaling decisions... There's a bit of like accepted wisdom, I guess, that actually, I think CPU is the best signal to use for auto-scaling in 99% of use cases. And honestly, the number of times -- even internally at Grafana Labs, the number of times people have tried to be too clever, and tried to second-guess and model out what the perfect auto-scaling signals are... And at the end of the day, the really boring just CPU consumption gives you a much better behavior.

And I'll finish my point - again, not really a question, I guess... But I think the reason why CPU is such a good signal to use for auto-scaling... Because if you think what auto-scaling actually achieves is adding more CPU. You're basically scaling up the number of CPUs; you're giving a job by scaling up the replicas of that job. And so by linking those two directly, like CPU consumption to auto-scaling, and just giving the auto-scaler a target CPU utilization, you actually can end up with a pretty good system, without all of this complexity of like custom metrics, and all of the adapters needed to achieve that. But again, I'm just kind of being -- I'm kind of causing argument for the fun of the listeners, I guess... But what do you think, Dio?

Dio: I think it's a very good point. I want to touch a bit on vertical auto-scaling. Are you aware about vertical auto-scaling?

Tom Wilkie: Yeah, I think it's evil.

Mat Ryer: It's where the servers stack up on top of each other, instead of --

Dio: So vertical pod auto-scaling has to do with memory. So CPU is the easy stuff. So if you have a pod, and then it reaches 100% CPU, it just means that your calls will be a bit slower. With memory though, it's different. Memory pods will die out of memory, and then you'll get like many nice alerts, and your on-call engineers will add more memory. And then most engineering teams do the same. So pod dies of memory, you add more memory. Pod dies of memory, you add more memory. And then sometimes, like after years, you say "You know what? My costs on GCP, they are enormous. Who is using all those nodes?" And then people see that the dashboards, they see that they don't use a lot of memory, so they [unintelligible 00:32:34.19] and then you come out of memory alerts.

So vertical pod auto-scaling does something magical. So you have your pod, with a specific amount of memory; if it dies, then the auto-scaler will put it back in the workload, with a bit more memory. And then if it dies again, it will just increase memory. So if your service can survive pods dying, so it goes back to the queue or whatever, it means that overall you have a better cost allocation, because your pods go up and down constantly with memory consumption. So this is a good thing to discuss when it comes to an auto-scaler.

Tom Wilkie: No, a hundred percent. And to add to that, I actually think it's poorly named. I think calling it the vertical pod auto-scaler -- what it's actually doing is just configuring your memory requests for you. And classing that as auto-scaling is kind of -- you know, in my own head at least, I think that's more about just automatically configured requests. So we do that, we use that; of course we do.

I do generally think, to your earlier point, teams that really, really care about their requests, and about the shape and size of their jobs, we've seen internally they shy away from using the vertical pod auto-scaler, because they -- and it is a huge factor in dictating the cost of their service. And then there's the teams that just want to put a service up, and just want it to run, and the service is tiny and doesn't consume a lot of resources... 100% pod vertical auto-scaling is the perfect technology for them.

Dio: [34:07] You're right. It's fine. For most of the engineers, having something there that auto-scales up and down, it's fine. But for the few services that will really benefit for your spikes in traffic, or in any Prometheus metric that you expose, it makes a very big difference. Sometimes it makes a very big difference. But this is something that most people don't care about. They'll have hundreds of microservices in there, and they will work out of the box with everything you're having in a nice way, so you don't have to worry about it. But the most difficult part is to have people know that this is possible if they want to implement it, to know where they will go and have the monitoring tools to see what's going on, their dashboards, and where they can see stuff... And then if you are cheeky enough, you can have alerts there as well, to say "You know what? You haven't scaled enough", or "Your services have a lot of CPU utilization. Maybe you should adjust it", and stuff like that, in order for them to be aware that this is possible. And again, you need people that are very passionate. If people are not passionate about what they're doing, or they don't own the services, it's very difficult. It doesn't scale very well with engineering teams.

Mat Ryer: Alright, let's get into the hows, then. How do we do this? Tell me a bit about doing this in Kubernetes. Has this changed a lot, or was good observability baked in from the beginning, and it's evolving and getting better?

Tom Wilkie: I think one of the things I'm incredibly happy about with Kubernetes is like all of the Kubernetes components effectively from the very early days were natively instrumented with really high-quality Prometheus metrics. And that relationship between Kubernetes and Prometheus dates all the way back to kind of their inception from kind of that heavy inspiration from the internal Google technology. So they're both inspired by -- you know, Kubernetes by Borg, and Prometheus by Borgmon... They both heavily make use of this kind of concept of a label... And things like Prometheus was built for these dynamically-scheduled orchestration systems, because it heavily relies on this kind of pull-based model and service discovery. So the fact that these two go hand in hand -- I one hundred percent credit the popularity of Prometheus with the popularity of Kubernetes. It's definitely a wave we've been riding.

But yeah, that's like understanding the Kubernetes system itself, the stuff running Kubernetes, the services behind the scenes... But coming back to this kind of thought, this concept of Kubernetes having this rich metadata about your application -- you know, your engineers have spent time and effort describing the application to Kubernetes in the form of like YAML manifests for deployments, and stateful sets, and namespaces, and services, and all of this stuff gets described to Kubernetes... One of the things that I think makes monitoring Kubernetes quite unique is that description of the service can then be effectively read back to your observability system using things like kube-state-metrics. So this is an exporter for Kubernetes API that will tell Prometheus "Oh, this deployment is supposed to have 30 replicas. This deployment is running on this machine, and is part of this namespace..." It'll give you all of this metadata about the application. And it gives you it in metrics itself, as Prometheus metrics. This is quite unique. This is incredibly powerful. And it means the subsequent dashboards and experiences that you build on top of those metrics can actually just natively enrich, and -- you know, things like CPU usage; really boring. But you can actually take that CPU usage and break it down by service, really easily. And that's what I think gets me excited about monitoring Kubernetes.

Dio: [38:11] I agree one hundred percent. Yeah. And the community has stepped up a lot. I had an ex colleague of mine who was saying DevOps work is so easy these days, because loads of people in the past have made such a big effort to give to the community all those nice dashboards and alerts that they want out of the box.

Now, I'd just want to add to what Tom said that even though kube-state-metrics and Prometheus are doing such a very good job, like native integration with Kubernetes, it's not enough, in most cases. I have a very good example to showcase this. Let's say one of the nodes goes down, and they get an alert, and then you know that a few services are being affected... And then you ask engineers to drop in a call, and [unintelligible 00:39:00.07] and then start seeing what's wrong... Unless you give them easily the tools to figure out what is wrong, it's not enough. And in our case, -- actually, I think in most cases - you need a single place where to have dashboards, from kube-state-metrics, Prometheus metrics, but also logs. You need logs. And then you need performance metrics, you need your APM metrics...

So I think the Grafana ecosystem is doing a very, very good job. And I'm not doing advertisement, I'm just saying what we're using there. But in our case, we have very good dashboards, that have all the Prometheus metrics, and then they have Loki metrics, and then traces. You have your traces in there, and then you can jump from one to another... And then we have Pyroscope now was well... So dashboards that people are aware, and then they can jump in and right out of the box find out what is wrong - it's very powerful. And they don't need to know about what is Pyroscope, and what's profiling. They don't need to know these kind of things. You just need to give them the ability to explore the observability bottlenecks in their applications.

Tom Wilkie: Oh, I one hundred percent agree. I would add like this extra structure to your application that's metadata that can be exposed into your metrics. This makes it possible to develop a layer of abstraction, and therefore common solutions on top of that abstraction. And I'm talking in very general terms, but I specifically mean there's a really rich set of dashboards in the Kubernetes mixin, that like work with pretty much any Kubernetes cluster, and give you the structure of your application running in Kubernetes. And you can see how much CPU each service uses, what hosts they're running on. You can drill down into this really, really straightforward and easily. And there's a project internally at Grafana Labs to try and effectively do the same thing, but for EC2, and the metadata is just not there. Even if we use something like YACE, the Yet Another Cloudwatch Exporter, to get as many metrics out of the APIs as we can, you're not actually teaching EC2 about the structure of your application, because it's all just a massive long list of VMs. And it makes that -- you know, the application that we've developed to help people understand the behavior of the EC2 environment is nowhere near as intuitive and easy to use as the Kubernetes one, because the metadata is not there.

So I just really want to -- I think this is like the fifth time I've said it, that that metadata that Kubernetes has about your application, if you use that in your observability system, it makes it easier for those developers to know what the right logs are, to know where the traces are coming from, and it gives them that mental model to help them navigate all of the different telemetry signals... And if there's one thing you take away from this podcast, I think that's the thing that makes monitoring and observing Kubernetes and applications running in Kubernetes easier, and special, and different, and exciting.

Mat Ryer: [42:05] These things also paid dividends, because for example we have the Sift technology - something I worked on in Grafana - which essentially is only really possible... You know, the first version of it was built for Kubernetes, because of all that metadata. So essentially, when you have an alert fire, it's a bot, really, that goes and just checks a load of common things, like noisy neighbors, or looks in the logs for any interesting changes in logs, and things, and tries to surface them. And the reason that we chose Kubernetes first is just because of that metadata that you get. And we're making it work -- we want to make it work for other things, but it's much more difficult. So yeah, I echo that.

Vasil Kaftandzhiev: It's really exciting how Kubernetes is making things complex, but precise. And on top of everything, it gives you the -- not the opportunity, but the available tools and possibilities to actually manage it precisely. If you have either a good dashboard, good team, someone to own it etc. so you can be precise with Kubernetes. To actually check that you should be precise.

Tom Wilkie: Yeah. A hundred percent. And just to build on what Mat said - no podcasts in this day and age would be complete without a mention of Gen AI and LLMs. We've also found in our internal experiments with these kinds of technologies that that meta data is key to helping the AI understand what's going on, and make reasonable inferences and next steps. So giving the metadata to ChatGPT before you ask a question about what's going on in your Kubernetes cluster has been an unlock, right? There's [unintelligible 00:43:45.20] a whole project built on this as well, that's actually seen some pretty impressive use cases.

So yeah, I think this metadata is more than just about observability. Like, it's actually -- the abstraction and that unlock is one of the reasons why Kubernetes is so popular, I think. And you said something, Dio, which I thought was really interesting... You started to talk about logs and traces. How are you using this metadata in Kubernetes to make correlating between those signals easier for your engineers?

Dio: [unintelligible 00:44:15.18] So one of the bottlenecks is not having any labels, not having anything. The other can be having too many of them. So you have many clusters, you have hundreds of applications... It's very often the case where it's very busy, and people cannot find quickly what's going on. So we cannot have this conversation without talking about exemplars. So exemplars is something that we -- it unlocked our engineering department for really figuring out what was wrong and what they really needed. So exemplars, they work with traces. And --

Mat Ryer: Dio, before you carry on, can you just explain for anyone not familiar, what is an exemplar?

Dio: Yeah, sure. So an exemplar is just a trace, but it's a trace with a high cardinality. So when something is wrong, when you have let's say a microservice that has thousands of requests per second, how can you find which request is a problematic one? So you have latency. Clients are complaining about your application is slow. But then you see in your dashboards most of them are fine. How can you find the trace, the code that was really problematic? This is where exemplars are coming. And out of the box, it means that it can find the traces with the biggest throughput, and then you have a nice dashboard, and then you have [unintelligible 00:45:34.25] and then when you click this node, you can right away go to the trace.

And then after the trace, everything is easy. With the trace, you can go to Pyroscope and see the profiling, or you can go to the pod logs, or you can go to the node with Prometheus metrics... So everything is linked. So as long as you have these problematic trace, everything else is easy.

And this is what really unblocked us, because it means that when something goes wrong, people don't have to spend time figuring out where can be the problematic case, especially if you have a ton of microservices, a chain of microservices.

[46:11] So yeah, exemplars was something that really, really unblocked us. Because it's really easy to have many dashboards; people are getting lost in there. You don't need many dashboards. You just give some specific ones, and people should know, and you should give them enough information to be able to do their job when they need to, very, very easily. And exemplars was really extremely helpful for us.

Tom Wilkie: Yeah, I'm a huge fan of exemplars. I think it's a big unlock, especially in those kinds of debugging use cases that you describe. I will kind of - again, just to pick you up there... There's nothing about exemplars that's Kubernetes-specific. You can 100% make that work in any environment. Because the linkage between the trace and the metrics is just a trace ID. You're not actually leveraging the Kubernetes environment. I mean, there are things about Kubernetes that make distributed tracing easier, especially if you've got like a service mesh, again. I definitely get that. But yeah, exemplars work elsewhere.

It's the ones that I think -- the places that are kind of Kubernetes-enhanced, if you like, in observability, is making sure that your logs and your metrics and your traces all contain consistent metadata identifying the job that it came from. So this was actually the whole concept for Loki. Five years ago, when I wrote Loki, it was a replacement for Kubernetes logs, for kubectl logs. That was our inspiration. So having support for labels and [unintelligible 00:47:40.07] and having that be consistent -- I mean, not just consistent, but literally identical to how Prometheus does its labeling and metadata, was the whole idea. And having that consistency, having the same metadata allows you to systematically guarantee that for any PromQL query that you give me, any graph, any dashboard, I can show you the logs for the service that generated that graph. And that only works on Kubernetes, if I'm honest. Trying to make that work outside of Kubernetes, where you don't have that metadata is incredibly challenging, and ad hoc.

Dio: Exactly. When everything fits together, it's amazing. When it works, it's amazing. Being an engineer and being able to find out what is wrong, how you can fix it, find the pod... And by the way, auto-scaling can fix it; it's a superpower in your engineer. And you don't need to own all these technologies. You just need to know what your service is doing and how you can benefit out of it.

One other thing as well is that those things are cheap. You may have seen, there are a ton of similar solutions out there. Some of them may be very expensive [unintelligible 00:48:52.10] The thing with Loki, and stuff is they are very cheap as well, so they can scale along with your needs... Which is critical. Because lately -- I hear this all the time; "efficiency", it's the biggest word everyone is using. You need to be efficient. So all these things are very nice to have, but if you are not efficient with your costs, eventually -- if they're not used enough, or they're very expensive, people eventually will not use them. So efficiency is a key word here. How cheap it can be, and how very well [unintelligible 00:49:22.03]

Tom Wilkie: Yeah. And I don't want to be that guy that's always saying "Well, we always used to be able to do this." If you look at like the traditional APM vendors and solutions, they achieved a lot of the experience that they built through consistently tagging their various forms of telemetry. The thing, again, I think Kubernetes has unlocked is it's not proprietary, right? This is done in the open, this is consistent between different cloud providers and different tools, and has raised the level of abstraction for a whole industry, so that this can be done even between disparate tools. It's really exciting to see that happen and not just be some proprietary kind of vendor-specific thing. That's what's got me excited.

Dio: [50:09] Okay. Now, Tom, you got me curious - what's your opinion about multicloud?

Tom Wilkie: Grafana Labs runs on all three major cloud providers. We don't ever have a Kubernetes cluster span multiple regions or providers. Our philosophy for how we deploy our software is all driven by minimizing blast radius of any change. So we run our regions completely isolated, and effectively therefore the two different cloud providers, or the three different cloud providers in all the different regions don't really talk to each other... So I'm not sure whether that kind of counts as multicloud in proper, but we 100% run on all three cloud providers. We don't use any cloud provider-specific features. So that's why I like Kubernetes, because it's that abstraction layer, which means -- honestly, I don't think our engineers in Grafana Labs know which cloud provider a given region is running on. I don't actually know how to find out. I'm sure it's baked into one of our metrics somewhere... But they just have like 50-60 Kubernetes clusters, and they just target them and deploy them. And again, when we do use cloud provider services beyond Kubernetes, like S3, GCS, these kinds of things, we make sure we're using ones that have commonality and similar services in all cloud providers. So pretty much like we use hosted SQL databases, we use S3, we use load balancers... But that's about it. We don't use anything more cloud provider-specific than that, because - having that portability between clouds.

Dio: And have you tried running and having dashboards for multiple cloud providers, for example for cost stuff?

Tom Wilkie: Yeah, it's hard to show you a dashboard on a podcast... But yeah, 100%.

Mat Ryer: You can just describe it, Tom.

Tom Wilkie: Our dashboard for costs is cloud-agnostic, is cloud provider-agnostic. So we effectively take all the bills from all our cloud providers, load them into a BigQuery instance, render the dashboard off of that, and then we use Prometheus and OpenCost to attribute those costs back to individual namespaces, jobs, pods... And then aggregate that up to the team level. And if you go and look on this dashboard, it will tell you how much the Mimir team or how much the Loki team is spending. And that is an aggregate across all three cloud providers.

The trickier bit there, as we've kind of talked about earlier, OpenCost doesn't really do anything with S3 buckets. But we use -- I forgot what it's called... We use Crossplane to provision all of our cloud provider resources... And that gives us the association between, for instance, S3 bucket and namespace... And then we've just built some custom exporters to get the cost of those buckets, and do the join against that metadata so we can aggregate that into the service cost. But no, 100% multicloud at Grafana Labs.

Vasil Kaftandzhiev: Talking about costs and multicloud, there are so many dimensions about cost in Kubernetes. This is the cloud resources, this is the observability cost, and there is an environmental cost that no one talks about... Or at least there is not such a broad conversation about it. Having in mind how quickly Kubernetes can scale on its own, what do you think about all of these resources going to waste, and producing not only a waste for your bill, but a waste for the planet as well, in terms of CO2 emissions, energy going to waste, and stuff like that?

Dio: That's a very good question. I'm not sure I have thought about this that much, unfortunately. As a team, we try not to use a ton of resources, so we'll scale down a lot. We don't over-provision stuff... We try to reuse whatever is possible, and using [unintelligible 00:53: 52.04] and stuff... But mostly this is for cost effectiveness, not about anything else. But this is a very good point. I wish more people were vocal about this. As with everything, if people are passionate, things can change, one step at a time... But yeah, that's an interesting point.

Vasil Kaftandzhiev: [54:10] For me it's really interesting how almost all of us take Kubernetes for granted... And as much as we are used to VMs, as much as we're used to bare metal, as much as we can imagine in our heads that this is something that runs into a datacenter, with a guard with a gun on his belt, we think of Kubernetes as solely an abstraction. And we think about all of the different resources that are going to waste as just digits into the Excel table, or into the Grafana Cloud dashboard.

At the end of the day, I should be right here, but approximately 30% of all of the resources that are going into data-powering Kubernetes are going to waste, according to the CNCF... Which is a good maybe conversation to have down the road, and I'm sure that it's going to come to us quicker than we expect.

Tom Wilkie: I think the good news here is -- and I agree, one of the things that happens with these dynamically-scheduled environments is like a lot of the VMs that we ask for from our cloud provider have a bit of unallocated space sitting at the top of them. We stack up all of our pods, and they never fit perfectly. So there's always a little bit of wastage. And in aggregate, 100% agree, that wastage adds up.

I think the 30% number from the CNCS survey - I think internally at Grafana Labs that's less than 20%. We've put a lot of time and effort in optimizing how we do our scheduling to reduce that wastage... But the good news here is like incentives align really nicely. We don't want to pay for unallocated resources. We don't want to waste our money. We don't want to waste those resources. And that aligns with not wasting the energy going into running those resources, and therefore not producing wasted CO2.

So I think the good news is incentives align, and it's in users' and organizations' interest not to waste this, because at the end of the day if I'm paying for resources from a cloud provider, I want to use them. I don't want to waste them. But that's all well and good, saying incentives align... I will say, this has been a project at Grafana Labs to drive down unallocated resources to a lower percentage. It has been a project for the last couple of years, and it's hard. And it's taken a lot of experimentation, and it's taken a lot of work to just get it down to 20%... And ideally, it would be even lower than that.

Mat Ryer: And I suppose it keeps changing, doesn't it?

Tom Wilkie: Yeah, the interesting one we've got right now is - I think we've said this publicly before... The majority of Grafana Cloud used to be deployed on Google, and over the past couple of years we've been progressively deploying more and more on AWS. And we've noticed a very large difference in the behavior of the schedulers between the two platforms. So one of the benefits I think of GCP is Google's put a lot of time and effort into the scheduler, and we were able to hit like sub 20% unallocated resources on GCP. Amazon has got a brilliant scheduler as well, and they've put a lot of time and effort into the Carpenter project... But we're just not as mature there, and our unallocated resources on EKS is worse. It's like up in the 30% at the moment. But there's a team in Grafana Labs who literally, their day to day job is like to optimize this... Because it really moves the needle. We spend millions of dollars on AWS, and 10% of millions of dollars is like more than most of our engineers' salaries. So it really does make a difference.

Vasil Kaftandzhiev: This is a really good touch on salaries, and things... I really see monitoring Kubernetes costs currently as the ROI of an engineering team towards their CTO. So effectively, teams can just now say "Hey, we've got 10% or 15% of our Kubernetes costs, and now we're super-performers and stars."

[57:59] One question again to you, Dio. We have talked a lot of namespaces... But can you tell me your stance about resource limits, and automated, for an example, recommendations on the container level? I know that everyone is talking namespaces, but the little [unintelligible 00:58:13.26] of ours, the containers don't get so much love. What's your stance on that? How do you do container observability?

Dio: Alright, so Containerd? Like, compare it to both? Or what do you mean?

Vasil Kaftandzhiev: Yeah, exactly.

Dio: So we're in a state where other than the service mesh, everything else is like one container equals one pod... Which means -- but it's difficult to get it right. So what we advise people is just have some average CPU and memory, [unintelligible 00:58:44.09] and then keep it there for a week. And then by the end of the first week, change the requests and limits based on usage. We just need to be a bit pushy, try to ping them again and again, because people tend to forget... And they always tend to over-provision; they're always afraid that something will break, or something will happen... And as I've said before, most of the times people just, I think, they abuse infrastructure, where they just add more memory, add more CPU to fix memory leaks, and stuff... So you need to be a bit strict, and educate people again and again about what means in terms of cost, in the terms of building, and stuff like that. But yeah, what we say most of the time is that have some average stuff, what do you think, and then adjust after the end of the first week.

Now, we don't have a lot of containers in our pod, so this makes our life a bit easier. If that wasn't the case, I'm not sure. I think though that this is something that maybe will change in the future, but you will -- I don't remember where I was reading about this, or if it's just like from [unintelligible 00:59:54.02] mind, but I think in the newest version of Kubernetes, requests and limits will be able to support containers as well. But again, I'm not sure if I just read about it or just [unintelligible 01:00:05.26] I'm not sure.

Vasil Kaftandzhiev: I think it's already available, that.

Tom Wilkie: Yeah, I'd add a couple of things there, sorry. Firstly, it's worth getting technical for a minute. The piece of software you need to get the data into Kubernetes, to get that telemetry, it's called cAdvisor. Most systems just have this baked in. But it's worth -- especially if you want to look up like the references on what all of these different metrics means, go and look at cAdvisor. That's going to tell you per pod CPU usage, memory usage, these kinds of things. It's actually got nothing to do with pods or containers; it's based on cGroups. But effectively, cAdvisor's the thing you need.

In Grafana Labs, we're moving towards a world where actually we mandate that limits equals requests, and everything's basically thickly provisioned. And part of this is to avoid a lot of the problems that Dio talked about at the beginning of the podcast, where if people -- there's no incentive in a traditional Kubernetes cluster to actually set your requests reasonably. Because if you set your requests low and just get billed for a tiny little amount, and then actually set your limit really high and optimistically use a load of resources, you've really misaligned incentives. So we're moving to a world where they have to be the same, and we enforce that with a pre-submission hook.

And then the final thing I'll say here is, I'm actually not sure how much this matters. Again, controversial topic, but we measure how much you use versus how much you ask for. So we measure utilization, and not just allocation. And we bill you for the higher of the two. We either bill you for how much you ask for, or how much you use. When I say bill, obviously, I mean internally, like back to the teams.

[01:01:49.13] And because of that approach, the teams inside Grafana Labs, they all have KPIs around unit costs for their service. So they're not penalized, I guess, if their service costs more, as long as there's also more traffic and therefore more revenue to the service as well. But we measure -- like a hawk, we monitor these unit costs. And if a team wants to optimize their unit costs by tweaking and tuning their requests and limits on their Kubernetes jobs, and bringing unit costs down like that, or if they want to optimize a unit cost by using Pyroscope, to do CPU profiling, and bringing down the usage, or rearchitecting, or any number of ways -- I actually don't mind how they do it. All I mind is that they keep, like a hawk, an eye on this unit cost, and make sure it doesn't change significantly over time. So I'm not sure -- I think this is like down in the details of "How do I actually maintain a good unit cost and maintain that kind of cost economics over a long term?" And I think these are all just various techniques.

Dio: So Tom, is this always the case? Are teams always supposed to have the same requesting limits? Is this always a best practice internally, at Grafana?

Tom Wilkie: It's not at the moment. It's something I think we're moving towards, at least as a default. And again, there's a big difference between -- again, our Mimir team literally spends millions of dollars a year on cloud resources. And they do have different limits and requests on their pods, and their team is very sophisticated and know how to do this, and have been doing this for a while. But that new service that we just spun up with a new team, that hasn't spent the time and effort to learn about this, those kinds of teams are [unintelligible 01:03:30.16] to have limits and requests the same, and therefore it's a massive simplification for the entire kind of reasoning about this. And again, these new teams barely use any resources, so therefore we're not really losing or gaining anything.

And I will say, there's that long-standing issue in Kubernetes, in the Linux Kernel, in the scheduler, where if you don't set them the same, you actually can find your job is like frequently paused, and your tail latencies go up... And that's just like an artifact of like how the Linux scheduler works.

Dio: This is very interesting. It actually solves many of the things we have talked earlier. So you don't have to worry about cost allocation, because most -- like, GCP at least, they will tell you how much it costs based on requests. But if your requests and limits are the same, you have an actual number.

Tom Wilkie: Exactly. Yeah.

Dio: I think Grafana is a bit of a different company, because everyone [unintelligible 01:04:21.24] so they know their stuff. I think most of the engineering teams, at least for our case, having requests and limits the same, even though it would be amazing, it would escalate cost... Because people, they always --

Tom Wilkie: [01:04:40.13] Yeah, so the downside. The downside of this approach 100% is you lose the ability to kind of burst, and you're basically setting your costs to be your peak cost. Right? But I'd also argue -- it wouldn't be a podcast with me if I didn't slip in the term statistical multiplexing. A lot of random signals, when they're multiplexed together, becomes actually very, very predictable. And that's a philosophy we take to heart in how we architect all of our systems. And at small scale, this stuff really matters. At very large scale, what's really interesting is it matters less, because statistical multiplexing makes things like resource usage, unit costs, scaling - it makes all of these things much more predictable. And it's kind of interesting, some things actually get easier at scale.

Dio: Yeah, it's very interesting. So are teams internally responsible? Do they own their cost as well? Or no?

Tom Wilkie: Yeah, 100%. And you mentioned earlier you have like Slack bots, and alerts, and various things... We've moved away from doing this kind of [unintelligible 01:05:45.16] We don't like to wake someone up in the middle of the night because their costs went up by a cent. We think that's generally a bad pattern. So we've moved to... We use -- there's a feature in Grafana Enterprise where you can schedule a PDF report to be sent to whomever you want, and it's rendering a dashboard. And so we send a PDF report of our cost dashboard, which has it broken down by team, with unit costs, and growth rates, and everything... That gets sent to everyone in the company, every Monday morning. And that really promotes transparency. Everyone can see how everyone else is doing. It's a very simple dashboard, it's very easy to understand. We regularly review it at the kind of senior leadership level. Once a month we will pull up that dashboard and we'll talk about trends, and what different projects we've done to improve, and what's gone wrong, and what's caused increases.

And this is, again, the benefit of like Grafana and observability in our Big Tent strategy, is that everyone's using the same data. No one's going "Well, my dashboard says that the service costs this, and my dashboard says that the service costs this." Everyone is is singing off that same hymn sheet. It gets rid of a lot of arguments, it drives accountability... Yeah, having that kind of one place to look, and proactively rendering that dashboard and emailing it to people... Like, it literally -- I get that dashboard every morning at 9am, and it is almost without fail the first thing I look at every day.

Mat Ryer: There's also that thing where we have a bot that replies to your PR, and says like [unintelligible 01:07:15.06]

Tom Wilkie: Oh yeah. Cost. Yeah.

Mat Ryer: Yeah. So that's amazing. It's like, "Yeah, this great feature, but it's gonna cost you. You're gonna add this much to your bill." And yeah, you really then -- it is that transparency, so you can make those decisions. It's good, because when you're abstracted from it, you're kind of blind to it. You're just off in your own world, doing it, and you build up a problem for later... But nice to have it as you go, for sure.

Thank you so much. I think that's all the time we have today. I liked it, by the way, Tom, earlier when you said "I think it's worth getting technical for a minute", like the rest of the episode was about Marvel movies, or something... [laughter] But you are my Marvel heroes, because I learned a lot. I hope you our listener did as well. Thank you so much. Tom, Vasil, Dio, this was great. I learned loads. If you did, please tell your friends. I'll be telling all of mine. Gap for banter from Tom...

Tom Wilkie: Oh, you know, insert banter here... Like, no, thank you, Vasil, thank you, Dio. Great to chat with you. I really enjoyed that.

Mat Ryer: Yup. Thank you so much. Join us next time, won't you, on Grafana's Big Tent. Thank you!

@ 2022 Grafana Labs