Cache Rules Everything Around Me
Mat Ryer: Hello and welcome to Grafana's Big Tent, the podcast all about the people, community tools and tech around observability. I'm Mat Ryer, and today we're digging into caching. We're going to start very high level; we'll look at what a cache is. I'm sure we've all at least encountered some form of cache in our computering lives so far, but what's really going on there? We'll dig a bit deeper and look at some technical considerations around that, and maybe even explore some common gotchas. And we'll look at this in the context of a real case study from within the Loki project here at Grafana Labs. In fact, I'm joined by two of the Loki team now, Ed Welch and Danny Kopping here. Welcome, Danny, and welcome back, Ed. Ed, you joined us in season one to talk about logs.
Ed Welch: I did, and you had me back, so... I'll take that as a good sign.
Mat Ryer: Yeah, absolutely. And Danny, is this your first Big Tent? Is this your first podcast?
Danny Kopping: It's my first Big Tent one, but not my first podcast.
Mat Ryer: Okay. So you've been vetted.
Danny Kopping: Well, I wouldn't say that much, but...
Mat Ryer: Okay, but has someone talked to you before this?
Danny Kopping: Yeah. But I guess I was only on that once, so... It doesn't bode well for this, does it?
Mat Ryer: Well, welcome to Grafana's Big Tent. We're also joined by Alan Kasindorf. Alan, you're Dormando online, and you're the full-time maintainer of memcached. Welcome to the show.
Alan Kasindorf: Thank you.
Ed Welch: I have a quick question... Is it memcached or memcached?
Alan Kasindorf: It doesn't matter.
Ed Welch: I think I've said it both.
Alan Kasindorf: I usually say memcached.
Mat Ryer: There you go. And he's a maintainer, so I feel like...
Ed Welch: That's it.
Mat Ryer: Imagine if the GIF guy says it's GIF, not GIF. And it's definitely GIF, isn't it?
Ed Welch: I saw somebody spell GIF the other day with a ph, and that was just a whole world that I'd never seen uncovered.
Mat Ryer: Ed Welch, engineer on the Loki team, taller in real life... You do have a kind of idea when you hang out with people online of their height. And then when I met you in real life, I did think "Ed is slightly taller than I expected."
Ed Welch: Slightly. That's generally the perception most people have.
Danny Kopping: I guess it's the opposite for me then, Mat.
Mat Ryer: No. Why do you say that? [laughter] You're also an engineer on the Loki team, and you're calling in from South Africa. Right, Danny?
Danny Kopping: That's right.
Mat Ryer: Nice. How is it? Is it winter now? What month is it?
Danny Kopping: Apparently it's winter, but where I am it's sort of permanently warm all year round, which is glorious... So it's like 28 degrees Celsius today. What's that in American -- 90 Fahrenheit? Something like that.
Mat Ryer: That sounds hot. As the famous joke goes, the three hardest things in tech are naming things, invalidating the cache, and invalidating the cache. Let's start by actually looking at what caching is. Ed, if you had to define caching - and you sort of do, because of this - how would you do it? How would you define caching?
Ed Welch: So I guess the way I think about how I use cache, which is now being how I'm going to define it, is I want to save myself from doing some work twice. So I do something that takes some amount of time, it's computationally expensive, it's IO expensive, or any of those things, and I don't want to do it again... And then caching is a sort of a faster path to getting the same result. I did not look up any actual definitions beforehand so I'm curious to see if others have some variations.
Mat Ryer: Yeah, I mean, that's certainly how I've used caches before. Danny?
Danny Kopping: Yeah, roughly along the same lines. I think where you see cache in other places is sort of preemptive caching. So instead of saving the results of a previous computation, sort of preempting what you'll need in the future. And this is done a lot like in CPUs, and in web applications... In the past I worked on an eCommerce application, and when we would boot up our memcached instance, we would pre-warm the cache with all of the data about the store. So that would sort of be doing it ahead of time, rather than saving what you've done in the past... Which is kind of another awesome feature of cache. It doesn't really care about where the data came from. It just helps you speed things up and lower latency.
Mat Ryer: So caches are quicker to read from and write to, are they, compared to other things?
Danny Kopping: Yeah, typically caching is done in memory, and that has a very, very low latency, compared to for example looking it up in the database, or retrieving it from disk... Not all caching is done in main memory though, as we'll chat about later in this session, but typically it's done that way so that you have the fastest access possible to that information.
Mat Ryer: Yeah. Okay. So that makes sense. So we're gonna do some work, andwe don't want to have to keep doing it again. We expect that -- this may have been read the first time, but we expect other people are going to read, all the same people are going to read the same thing many times. So I guess, do we prefer this to be kind of like data-driven, these kinds of decisions about where we should cache things? How do we know what would be a good thing to cache? Surely just cache everything, and then it'd be really quick, won't it?
Alan Kasindorf: So the definition of not doing something twice is pretty good for cache. I like that, I hear it a lot. It's probably the best way to really think about it if people are just starting to think about it. But the other really main thing with caching is just getting data closer to you. Usually physically. So with CPU caches, L1 is really in there next to your execution units, which are actually doing the math. And then you have L2, L3, and then main memory, and it's a matter of distance. And the closer something is, you start to refer to it as being cached. And this is true for CDNs as well, for speeding up the internet, and all sorts of stuff. So it's usually the second layer of it anyway.
Mat Ryer: [06:21] Yeah, so I wonder if you could just explain a bit about these different layers of cache, because you hear this a lot. Like, you hear L1 cache, L2 cache, these levels of cache. What do they mean, and what's different about them?
Alan Kasindorf: So when we talk about caching - and I think while we were initially talking about caching - it's mostly application-level caching. So very high-level, L7 in the parlance, the applications talking to each other over network, over the computers. But when you talk about caching in the general sense, related to computers, the first thing people usually come up with is the CPU cache, which is holding data in code a lot closer to where the execution units are, so the CPU doesn't have to wait as long. So you have these CPUs that are running at four gigahertz, five gigahertz, they're doing things like 5 billion times per second, and having to wait for data completely kills them.
So accessing something from a memory, or God forbid a disk, has your CPU just sitting and doing nothing for a very long period of time. So if you really want to actually get the usage out of your computer, you need to have very good usage of caches at various layers. And so you kind of pull that up to a higher level, where maybe you have a whole server sitting there in the cloud somewhere, and it's waiting on a database response, and you can't do anything else while you're waiting for that database. You might have other processes on the box doing something, but in general, you're paying for this thing to sit there and you want it to actually be active and doing things, instead of just waiting for data.
Mat Ryer: And what does a typical caching API look like? I know that I've used sort of application-level caching, I've used things like just simple get, put... And sometimes you can set an expiry, and things. Is there a common pattern, are there different flavors of this?
Ed Welch: Well, I was sitting here, thinking about how little I know about L1, L2 and CPU cache layers to know what kind of APIs they define... But I know for example in our application-level caching we tend to pretty simple interfaces. Really, one question would be whether there's some mutability in your cache or not, and whether or not it's just a get input. I think we actually might do both with Loki. We might mutate, and if I'm wrong about it, Alan is gonna tell me whether memcached supports muting of keys within cache or not... Or if we just append multiple entries and it sort of appears like we're mutating it based on how we access it. But for the most part, our APIs are pretty much CRUD... Although I suppose we don't really delete cache stuff, we just rely on expiration for that. So we use Least Recently Used type access to expire things when they're out of space in our cache.
Alan Kasindorf: The common APIs are really just gets and sets. It's kind of the joke, especially at larger companies, is "What's so hard about cache? It's just gets and sets." It's really gets, sets, deletes, sometimes you want to update the TTL, the time to live, sometimes you want to fetch something, but not have it be bumped in the other [unintelligible 00:09:18.07] You want to just look at it, you want to scan it. So there's lots of little things that come off, but really, it's just gets and sets. And then you have more advanced stuff that maybe we can get into later, but... In general, the application cache doesn't usually operate the same way as like a CPU cache, which is usually readthrough and writethrough, where you say "I want something from main memory", and then the computer goes and says "I've got these --" they're called set associative algorithms, usually, "to figure out if this is already in the CPU cache, which layer", and if not, whether or not to pull it all the way in, pull it all the way out. When you write data back to memory, it'll probably write to the cache first, and then eventually flush back to memory... So it's got all these things, but It's transparent to the user; you don't really have to think about it. When you're designing applications, sometimes you think "I'm doing this or that, I'm gonna be getting cache misses, and it's gonna be a little bit slow", but most people don't care, especially when you're writing like PHP, or Ruby, or anything.
[10:13] So that's really the main difference. There are a lot of products out there that do writethrough and readthrough caching, especially if some larger companies like [unintelligible 00:10:21.01] Facebook is kind of famous for that. They have papers out around that. And there's a couple companies trying to bootstrap around that all the time. But it is a very difficult topic to do writethrough and readthrough caching in front of like actual databases, at an application level.
Danny Kopping: Yeah. If folks want to hear more about CPU caches, there's an amazing talk by a chap called Scott Meyers. He does a talk called "CPU caches, and why you care." And it's pretty interesting. It's definitely, for me, I've found, working in very high-level languages, I've never really considered how things work at a lower level like this... But when you start working on very performance-sensitive code, writing software in a way that is mechanically sympathetic to how the CPUs work can really improve the performance of your code in quite surprising ways.
Mat Ryer: Yeah. And again, this comes back to that point of when does this become a problem for you? Because for a lot of people -- depending on what you're building. Now, Loki of course has such high traffic, you'd probably really notice if you've got any sort of -- any kind of bottlenecks, you're really going to feel them at that scale.
Danny Kopping: Yeah, absolutely.
Ed Welch: I've been thinking about Alan's extension of my original definition, and it will tie into actually a big part of what we want to talk about, some caching changes that we made... But yeah, certainly data locality... Although for a system like Loki, a distributed system, we're largely just trying to get stuff within the same rack, or set of racks within an availability zone, or two, or three. And there's sort of two ways that we use cache.
And maybe one of the things we can talk about is like when you shouldn't cache, or you know that you need to cache. It is going to be a bit subjective on what you're doing. But like a lot of things, these problems will present themselves; I wouldn't necessarily suggest you prematurely cache your results for whatever your app does... But one of the things we find, the access patterns for a time series database that is backed, or is backing a dashboarding tool, means you get a lot of repetitive access of the same data. The dashboard will be refreshed, and you will be essentially drawing the same data repeatedly. Most of that data hasn't changed, only the recent -- since last refresh interval. So that was a use case where caching results becomes incredibly valuable for fast reload times. We don't want to have to fetch the same data again and again.
But we also cache the data that we fetch from object storage. Loki is a database built on object storage, like S3, GCS, Azure Blob... These object stores are very fast; I'm sure that they have a tremendous amount of caching inside of them based on the behaviors that we see... But it's still the case that when we're using stuff multiple times, it makes sense for us to cache it closer to where we're accessing it. So we just cache the data chunks themselves.
So I've lost track of what was perhaps the original question here, but I think in terms of when you should use cache, or how, these problems tend to present themselves, especially if you have access to good profiling tooling, and you can see where your application is spending a lot of time, and that you can look to optimize by adding cache. But there's always a price to cache... So [unintelligible 00:13:52.27]
Alan Kasindorf: [13:57] Keeping it simple is easier said, I think. I think a lot of people kind of lean towards it a bit too soon, even... Every database has a cache, a buffer pool cache or something... And usually, the common refrain with like DBAs is "Oh, somebody added a little cache when they really just needed to add an index." So you kind of usually probably want to cache database-level stuff after you've kind of run out of other options. You've done all the good things, you actually understand where your database is... Your database itself has more than like 10 megs of cache configured... The machine is sized properly... You really want to try to not do that, because it is adding complexity. And you only really want to use cache if it factors into your cost metrics. You want time to be better, you want it to cost less overall... So removing repeated accesses because your database is expensive is good, removing repeated accesses because it's slow and you really want to get your shopping cart up in 30 milliseconds instead of 10 seconds and lose your patrons...
Mat Ryer: Yeah. I think that's really sound advice. But I've definitely in the past reached too soon for the cache. And actually, sometimes -- you say it's a really simple "tends to be gets and sets", but actually as soon as there's a few of these interacting systems, the number of possible combinations goes through the roof, and you end up in this -- it can be really quite difficult to debug as well, caching issues. Any web developers know just by the fact that the JavaScript in the page is cached, and sometimes refreshing that doesn't actually get the latest version of the code, because it's cached. And even things like that still happen, which is kind of outrageous, really...
Danny Kopping: And you opened up with one of the most pernicious problems, which is cache invalidation. Like, how do you actually know when the data that's in your cache is actually the right data? And if you reach for cache too soon, and you don't realize that you can start running into this class of problem, you can start developing a whole new problem for your application, where instead you were trying to plug that another one. So I guess the first law of caching would be "Don't reach for it too soon", and then the second one would be "Know your workload." Understand what's going in there, and when it can go stale, how to invalidate it. And with tools like memcached we have expiration, which we should probably touch on, since we mentioned it earlier... And how that can feed into cache invalidation and what it means to invalidate items in the cache.
Mat Ryer: Yeah. So I guess you can either explicitly delete it based on some logic somewhere, or you can let it just go out, go stale; just expire on its own. For cases when you do the push, when you're sort of saying "Now, I know this data now is invalid. I'm going to go and explicitly give that signal." Is that just a case of kind of calling delete on something like a memcached?
Alan Kasindorf: In the best case, you're calling set instead. So if you want to update the cache, it's best to kind of put the new cache in there. But the simplest is definitely to delete it or let it expire, and then recache from your source of truth.
Ed Welch: Could you dig into that a little bit, Alan? Why would it be faster to set -- it sort of came to my mind that maybe the delete operations can be handled asynchronously or something in a different way, but that may not have been where you were...
Alan Kasindorf: Yeah, I' trying to not overcomplicate the answer. Sorry.
Mat Ryer: No, but if it's complicated...
Alan Kasindorf: We can touch it a little bit later, it's complicated. The introduction to this -- so usually, when you have an application, and some user goes in, they click their profile and they say "I'm going to change my age on this profile, because I forgot how old I was when I registered." And you click that, and your application goes "Okay, well, I've updated the database..." Say their age is different, or now they're now taller... And I've got to tell the cache that is displaying their profile page to their millions of fans that "This needs to be updated."
[18:09] The simplest obviously is just issue a delete. Like, "This is gone. This data is gone. Their profile is gone. Go get it from the database, go bring it back in." But as I said, sometimes this data might be very hot. You're already a celebrity, and you've got a million followers, and they are all looking at your profile obsessively to see if you're updating how tall you are. So in those cases, you really don't want the cache to ever be empty. This is, I guess, pretty rare overall. This is kind of the next level of get and sets, and now like "Okay, well what happens if you do a delete and the cache is missing, and you've got 10,000 people who are trying to hit at that thing at the same time?" You're gonna hit your database 10,000 times, you're gonna fall over. So at that point, you start thinking "Well, maybe I've updated their data in the database. Now we can set off an asynchronous process, which goes and fetches it from the database again, remelds the whole thing, and updates the cache directly, instead of just expiring it."
Ed Welch: That kind of ties in with an interesting set of trade-offs that we're going to dig into with some of the cache work we've changed with Loki, where if we cache too much, we set ourselves up for failure if we lose our cache, or we lose enough of our cache... So we're trying to strike a balance between sort of offsetting some downstream load, but not removing it completely, because the downstream system uses our traffic to sort of scale their infrastructure as well. So if we completely stop requesting downstream of an object store, their internal rate limits and capacity is reduced, because we're not requesting it... And then if we entirely lose our cache, then we hit them with way too much for what's provisioned, and run into rate limits and have to wait for that to be sort of reprovisioned. So similar, but different, I suppose, in the sense that getting the right size cache and the right amount of stuff cache is not always as simple as it might seem.
Mat Ryer: Yeah, but that's such an interesting case, because you're dealing there not only with just your own application design, but also like realities of the environments that you're running in the real world. You're running these in cloud environments, and then that's going to have an impact. So that's a really interesting lesson again, I guess, about trying to preempt it too much, and almost wait until you have problems... Because that would be a hard one to predict, wouldn't it?
Ed Welch: Yeah, I'm kind of itching to sort of dig into the sort of specific use case with Loki that we ended up getting turned on with Alan about, and the [unintelligible 00:20:42.24] capabilities of memcache to store to local disk... So do we want to --
Mat Ryer: Yeah, I've just got a couple more questions first... Because I'm just also interested in something Alan mentioned. When you read data from the database, you might then change it or do something to it before putting it in the cache. And I think that's also quite an interesting thing to talk about, is you probably aren't -- well, sometimes you will just be literally taking the same data and having a copy of it. But depending on how that's going to be displayed - it's very common to actually prepare the view almost, in advance, so that it's very easy and very quick to read straight from the cache. Are there good practices around this? Does this just depend on each case? Or are there guidelines we should follow for what data we actually put in, and what to change?
Alan Kasindorf: [21:35] I think starting simple... so you start with "I'm going to cache some data from my database", or a picture from somewhere else, or something. And then the next time you look at your profiling data, or you look at, say, your web server load, and you say, "Well, we're spending 80% of our CPU applying templates to HTML, and then serving that." That's how it used to be. We go back and forth between single-page services, and backend rendering, but sometimes that backend rendering is a huge part of your cost, and your time. And so you say "Well, actually, if I am not re-rendering these templates all the time, I can cut my web server load in half, and save half my money, and serve these pages faster." But again, that's just more complicated. Now you have to worry about "Am I accidentally caching something from the wrong language that's been templated?", or I have this combinatorial explosion of different ways you have to cache it, so you're kind of like "Well, maybe I've got this sub-piece of it that takes a while to pull in from three different sources", and you put it all together, and then you cache that... And you kind of just have to look at what your highest pain point is, and only really focus on that. If you think "I'm gonna go holistically from the bottom up and look at every single access we do", and everything, you're gonna get lost and just give yourself lots of places where you might end up caching the wrong data, or serving the wrong data to your users, or something they didn't ask for anyway.
Danny Kopping: Yeah, the way that I've used cache in the past is almost like a materialized view. So you've got a very complex database structure, and for example taking the eCommerce store again, you want to find out how many orders a particular user has placed. And that becomes a bit complicated, because you have the users table, and the orders table, and some orders failed, and they didn't go through, or there was another problem... And so to actually count the number of orders that were successfully placed, you have to run a select statement that selects from multiple different tables... And so in the SQL space, you could have something like a materialized view, which is a query that gets run on a particular cadence, and denormalizes that data... So that instead of storing it in this perfect form, where everything is related to each other neatly, and there are these very atomic fields that just represent one thing, now you're combining them into multiple things to represent some more abstract idea, or a more combined idea. And so with cache, you would do that, too. So you would change the shape of the data, so it wouldn't be exactly what is in your database, you would do a computation on that data and then store the result of it in cache. So it wouldn't be exactly what is represented in your database.
Mat Ryer: Yeah, so in that world, that is then a bit more complicated. And what happens if the cache -- because you should always assume that you're going to get a miss from the cache, right? You should never rely on it being available. And so can you just fall back to a plan B if it happens to not be in the cache? Just go to the original place and get it?
Alan Kasindorf: Yes. So I would absolutely say don't cache your ephemeral login sessions in memcached. That's kind of the classic counter example, is "Oh, [unintelligible 00:24:52.02] my cache, and now all my users are logged out and they're mad." In general, really just try to make sure that there is authoritative point for your data that is not the cache.
That said, in these days, in these cloudy days, databases are sometimes getting a little bit slower than they used to be, and have a little bit less control over their capacity. I think things mentioned earlier, like rate limiting, and if you bust your cache accidentally, the cloud provider might say "Oh, you're gonna cost us a million dollars in traffic, so we're gonna just tell you no a lot, or we're gonna tell you no very quickly", and you're gonna send a lot of no back to your users, and they're gonna be very upset.
So I see a lot more now, like a replicated cache, and things that are a little bit, I guess, sketchy in the grand scheme of things, but kind of required for the way that they're being built now, where if I'm losing my cache in availability zone A, then we're going to be down for -- we'll come back, but we'll be down for half an hour, and we'd like to avoid that, so it's it to spend the cost on having a second copy of your cache that's warmed in another availability zone.
[26:01] I'm seeing that a lot now... And if not that, then people are just straight up -- if it's a small operation, and they only have like three cache servers, they'll have three copies of their cache. And that also makes like invalidation harder, as we keep coming back to... But I see that a lot more now than I used to.
Mat Ryer: That's really interesting, to see how this stuff changes. There's another interesting subject I'd love to just touch on a little bit and learn a bit more about... And really, it stems from this question. When you put and get something from a cache, presumably you have to have some kind of ID. What's special about that ID? Can it be anything? Is there something we should consider when picking these IDs?
Alan Kasindorf: Well, in memcached they're just alphanumeric strings. 230 bytes or less. The shorter your key, the better. Long story short, it takes memory as well... So if you have a limited amount of cache, and your keys are very, very long, because you're putting like customer ID, like the word customer ID, and then your customer ID, that's wasting space in your cache, and it's gonna cost you money.
There is support for kind of arbitrary binary keys in newer versions, but it's not very common. Usually on the wild there are like hashes, or seeded hashes, like an MD5 sum, or a SHA, a very large hash, and some other data. And then that's the actual key that's stored in the cache. The CDN case is what you'll usually see, is like you customer ID, your website ID, and then your cache object is some huge amount of data, like these various HTTP headers that all meld into what this cache object was, and that all gets hashed together... And then that gets put into the cache.
For memcached it's just a string, so it's kind of abstract from that point... You might say "Oh, this cache object I'm putting in is based off of these rows here, and then this JSON file from somewhere else", and then I hash those together, and then I put that in the key, and then I put that in the cache.
Mat Ryer: So it's worth thinking about though, because if you MD5 the content of something, and then the content's changing, of course, you're gonna have a different ID. So you're going to be thinking about the use of your cache. So that's interesting. Are there advantages to kind of having things consistently distributed in a hash? [unintelligible 00:28:20.08] something you don't have to worry about?
Alan Kasindorf: That's something you usually have to worry about. These are all getting hashed multiple times. The shower epiphany of memcached was that we're going to hash things twice. So a long time ago, I actually worked with Brad Fitzpatrick, the guy who invented memcached a very long time ago, and we were struggling with this idea of caching before there were standard platforms. And then a year or two later -- we had some prototypes that didn't work out, and then a year or two later he said he was just sitting there and he was like "Wait a minute, what if I hash the thing twice?" So you take your key, and you do a computational hash on it, and then your first hash table is your list of servers. Your list of different cache nodes. And then you send that key to the cache node, and then the cache node itself hashes the key again into this numeric value, and then spreads that out internally.
So then there were a lot of schemes and evolutions since then, and a little bit before, depending on who you ask, for how to distribute that data across all these things... So we have pretty good standards for that now, both in hashing algorithms and key distribution algorithms for making sure that whatever you put in there as an arbitrary string is gonna get spread out pretty evenly.
Ed Welch: Is that something that memcached -- so within Loki, we built our own Jump Hash implementation for... So we run our memcached servers as a stateful set, so that they get a consistent name that sorts well... And then our key objects we use to a job hash to figure out which instance they go to. Is that something that you can get kind of out of the box with memcache now?
Alan Kasindorf: [30:03] So memcached itself is just a server. I will say though, I do like Jump Hash. I'm glad we're sort of starting to standardize on Jump Hash. It has some other pretty good properties for managing caches, and more advanced cases. But it's also just really simple. It's like, there's one really tiny function, which means it's easy to port to different languages. So as you said, you've just kind of spun it up and did it. The older, consistent hashing algorithms were more complex.
So out of the box, we're talking about clients, usually. So memcached in the past had a standard set of clients; like in any language, it's available, the protocol is very simple, so it's very easy for people to write these clients... But some of them implemented things like consistent hashing a little bit differently, or they're not as contained anymore... So in the last few years I've been working on a built-in proxy in memcached, and one of the things they can do is kind of take this role of a bit of a client for you. So you can write like a very simple client, that's just gets and sets, and then you talk to the proxy, however you set it up, and then the proxy itself is doing things like key distribution, your Jump Hash, or your Ketama Hhash, or whatever you need to do. And that kind of abstracts you one more layer away from your Java or PHP or whatnot. At least that's the idea.
Ed Welch: The reason that that was important for -- and probably for anybody... I didn't write this, I only roughly know how it works, but the jumphash minimizes the amount of key space, like thrashing that happens when we add nodes to our memcache pool... Because what we don't want to happen is to add one node and have the entire key space change, so that everything is invalidated effectively in the cache, because you go looking for it and it's not where it was. So I don't know -- I don't know how much the jumphash minimizes that, but I believe that was why it was picked initially, was to minimize the amount of key space juggling that happens when we change the size of the node pool.
Alan Kasindorf: Yeah, I did the thing again where I'm too familiar with the subject and I forgot to define it... So let's talk for a minute about the hashing strategies for key distribution. So this is taking your data, which is keyed by some value, and then distributing it across your caches. So memcached is fundamentally a distributed system where the systems, the individual servers don't talk to each other at all. So this simplifies a lot of the cases, and it allows you to scale by just adding more servers. If you say like "I need more cache", you add one more server, you get both read/write capacity, and you get memory capacity.
And so when you add a server, what happens? Obviously, if you change the size of a hash table, you need to move your keys around inside that. The strategy for this is called consistent hashing. If you google a Wikipedia on that, you'll see probably the ring hash explanation of it, where you'll hash the object using like MD5, and then you take the MD5 and you do a little bit of magic and you kind of spin around a wheel looking for a location and you drop there. That's not a very good explanation of it, but it's actually -- it's much easier to see, unfortunately, so I don't think it's gonna be really great to explain a consistent hashing on a podcast... But the overall idea for the consistent hash is to take -- say if you have three servers, and you add a fourth, then some large percentage, like 20% plus of your keys are going to have to move to realign to these new servers. But what if your three servers really look like 1,000, or 10,000, which is what these algorithms are doing? So it takes your three servers and it displaces them multiple times all along this wheel, and then when you hash it, you spin around the wheel and you land it at one point, and that point is going to be server A, server B, server C, but one of 1,000 chances.
[34:10] So if you go from three servers to four, you have to now divide that across 1,000 points instead of just three to four points. You move much less of the cache, because you disrupt fewer of those points by adding server D. So jumphash is a simplification of this, where we used to actually have to create 1,000 buckets in our consistent hashing algorithms, and manage it... And they were some times keyed by the host ID or something from the server... And jumphash, you just have buckets, and then it internally does its trials, a little bit of math magic, and kind of internally expands and then contracts the amount of key space here that your key is being hashed onto.
So then you have two good properties from jumphash. One is you kind of minimize the churn, because it's creating this virtually large hash table to hash against. And the second is it's abstracted to buckets instead of using your host ID, your host information. So if you have three servers, and your third one goes kaputt, you can put a new third one back and your keys don't move. You've only just lost the ones that were lost on the server that went away.
Ed Welch: Yeah, that's both a better explanation and consistent with how we're using it, where we use the buckets, and then the buckets map to a list of servers which are lexicographically sorted, so that if we replace one, it has the same name; if we had to.
Alan Kasindorf: Sorry it took three tries to explain that...
Ed Welch: No, that's -- I have this sort of running joke, because I've been working at Loki for about four years and I didn't know anything about distributed systems before I started... And it's like every problem is just a key-space problem, and it's just figuring out how you can shard that key-space up. And all of the problems that we solve in Loki are with hash rings, right? So your explanation is good in that I don't know that I've ever explained before why we insert multiple tokens for a server on a ring, and it's for the effect you've just described, where you don't want -- if you had one token per instance, your key space changes dramatically when you're at a smaller scale. Interesting, though... And maybe getting a bit off topic for caching, but it's still interesting.
Alan Kasindorf: That's the algorithm that made Akamai back in the day, is the original ring hash.
Mat Ryer: That's cool. So Ed, the technical problem that kind of led you to this collaboration - maybe you could tell us a bit about that.
Ed Welch: Yeah, we've been kind of hinting at this a little bit, but I'll give a little bit of background, because it does tie into what Alan was saying about some of the interesting problems with building a database on an object storage that's basically relying on kind of a black box that is an HTTP API for storage... Which comes with a tremendous amount of advantages, right? I'll go on to the parts that I love about this, because operationally, over my career, every time I've managed systems that have disls, that comes with a sort of set of nightmares, right? Disks run out of space, disks are not super-reliable... Object storage abstracts that all the way through this nice API. It supports a lot of parallelization, high durability... However, within Loki, we've built this system in a way that we generate a lot of small files. Interestingly, that would have been, I thought, one of the bigger problems we would have had to have addressed years ago, and surprisingly, it's not a lot of my sort of conventional understanding around small files, and performance isn't in practice what I've seen... But where it does matter is that all object storage has rate limits on IO operations. So there's a limit to the number of get and put requests that you can make per second; depending on the provider, they may be movable, they may scale dynamically as well, too...
[38:11] Back to the key-space problem, when you're using object storage is the key space you use, which is the name of the files you put in becomes important, because that has to essentially spread over the load that they run on the backend in similar ways... But you can find that you're gonna go make a bunch of requests, and suddenly you're met with a slowdown, and that's often because the infrastructure behind that is scaling, or you hit a limit that the provider won't move.
So within Loki we do run into this, and we actually found ourselves in a situation where we were pretty consistently running into this with a very particular customer workload... And the real solution for us that we're moving towards is sort of changing the file size to kind of request larger chunks of data, and not so many small files... But then in the meantime we have this interesting problem, which is "Okay, what do we do now, while we're working on that?" We're already using memcached for caching these objects. So we fetch them, and then we store them in a cache, and that already serves as a write deamplification against -- or read and write deamplification, mostly read deamplification, because mostly reads are what we're working with here... But log data and log queries can be terabytes of data, and terabytes of cache, and memory gets rather expensive.
So that was the sort of origin into "How do we solve this problem? We're probably going to need to use disks." I mentioned before I don't like disks. I'd rather not have a lot of the problems associated with them. And that was when Danny made us aware of the ability for memcached to back the memory cache with local disks. And this is particularly appealing for us, because it just presents itself as the interface we're familiar with today, and gives us like infinitely more capacity. And from a latency perspective, it's going to operate very similarly to what we see out of object storage anyway... So it was a really nice way for us to cache way more, in ways that were consistently performing, and probably more consistent lower latency because of that proximity thing, without having to care about a lot of the factors related to this. Now, that's not entirely true. Danny has spent quite a bit of time in looking at the ways we can build up machines to get the right set of disks and the right set of cost trade-offs. But in terms of like running out of space, or if a node has a disk failure, it's just the same as losing a cache node. It has the same modes that we're already familiar with operationally. So that kind of led us down this path of "Oh, interesting. This --" I'm gonna say Extstore, external store, I'm sure it's a short version of that... Really, really neat feature of memcached, that I don't know how many folks know about, and I think that's what we wanted to kind of dig into.
Mat Ryer: Yeah, that's very cool. So the interesting thing there then, you said you're putting it on disk. But isn't the disk much slower than just having it in memory? How come that's still preferred?
Ed Welch: Yeah, the latency question... Because in your presentation that I watched recently, Danny talked about the trade-offs on memory latency versus disk latency. But mind you, we're essentially using this as a form of a more localized object storage. So that's the latency that we're working with. So what kind of were the different latency values we saw, or have seen after moving to an Extstore-based approach?
Danny Kopping: Yeah, so the fundamental unit of work inside of Loki is a chunk. A chunk is a compressed file full of log data. And so when a query gets run, we have to go and enumerate all of the chunks that we could scan to respond to that query. So if we wanted to find all of the Nginx logs, for example, and then find all of the Nginx logs that have a particular IP address in it, Loki has a small amount of indexing, which will point us to the right number of chunks, the right set of chunks to download in order to satisfy that query.
[42:20] In the past, what we would do is store all of this in memory using memcached. And it's super-fast, because it's all in primary memory, and that's got a latency in the nanosecond scale... Like hundreds of nanoseconds, maybe a microsecond or two. And memcached being this abstraction on top of main memory that we can use to distribute keys around adds a little bit of latency there, but it's extremely tolerable for us, so we would be accessing these chunks at a millisecond per request, usually less. But our sort of standard p50/50th percentile read latency would be around one millisecond.
Now, we had a fundamental trade-off to make here, because we needed a much larger cache, because we couldn't fit all of our items into cache. And we can come back to why that was a problem and how we discovered that. But fundamentally, the only real choice that we had that was cost-effective to be able to store a large amount of data in cache was to use disks. And so intuitively, if you think "Well, we use disks to avoid the need to go down to disks... So why are we using disks for our cache?" The really nice thing about SSDs is they have this property nowadays where they're able to do extremely high numbers of readwrite operations at a very low latency, that is approaching the speed of DRAM. It's still a couple of orders of magnitude off, but it's fast enough now that we can use this in a caching application.
So when we started using Extstore from memcached we noticed that our 50th percentile latency didn't go up all that much. It went up from one to three milliseconds, to between 8 and 10 milliseconds. And that sounds like a big jump, but in comparison to the latency that we would experience fetching those items from the backing store from object storage, which would be in the tens of milliseconds, hundreds of milliseconds sometimes, depending on the load, this is a really, really nice trade-off for us. And then there's a huge amount of optimization inside of memcached to make sure that we don't do too many operations that an SSD doesn't like. Because SSDs are very, very fickle machines, as anyone is well aware. So yeah, maybe you can tell us more about what are the challenges of building a disk-based cache.
Alan Kasindorf: So Extstore - it sounds like it's a separate thing, but it's actually just a set of start options for memcached. So if you take any memcached that's been built in the last couple of years and you give it some special options, to start with you could say "Use this disk file, and if objects are over this size, put them in the disk file." That's kind of the high-level oversimplification. But it is just like a set of start options.
And what this does is it takes -- once memory is actually full, it starts looking at the oldest items in the cache, the coldest items of the cache, and says "Well, are you over this size? Are you something that we feel like we want to keep around?" And then it will take that data and take the value and put the value on disk. It actually puts the whole thing on desk, but it mainly is putting the value on disk, and then it leaves the key and a little bit of data around in memory. And so this means if you have -- Loki is pretty much a perfect use case for it, because these data chunks are fairly large. So if you start taking these data chunks and you cut the value off, and you keep the key [unintelligible 00:45:55.25] in memory, and you throw that on a disk, you can still store an awful lot of data. You can fill your disks. And this trade-off is -- as Danny said, the SSDs are very fast, and in most cases people don't really need to care about how they're using them... But if you're doing it with a cache pattern, we kind of do, unfortunately.
[46:17] So let me back this up a bit and say that RAM is very cost-effective for cache. And this is because RAM doesn't really care if you are modifying it a lot. Hard drives used to not really care about if you're modifying it; reading or writing, it didn't really matter. But SSDs, one, they have a limited amount of write cycles. You can write to it a certain amount, and then it's dead. And so SSDs go through great lengths to kind of moderate this. They do background garbage collection, and they have all these fancy algorithms, extra hardware, RAM caches and things to try to alleviate and kind of moderate around this issue. So if you are taking data in memory that's cached, and you're putting that on disk, and then you're changing that, you're evicting the data, you're doing something with the data, it's putting an awful lot of unwanted load on these SSDs.
And I observed at a couple points that a lot of what people have in cache is this larger data. And we can get away with -- first of all, if it's the larger data, you're probably going to be not reading and writing it as much as smaller data. But it's also just a better fit for like if we put the disk here, we're gonna get a larger benefit by dropping this data onto the disk, and making more space for smaller keys, or whatnot. So it was kind of the original "Oh, maybe this will work." So it was just a matter of finding the right trade-offs for making this actually work.
And what we have with Extstore is like basically a write buffer. And so when things are being pulled from the tail, it puts them a little right buffer, and then when it's ready, it drops that whole thing on disk, sequentially. So what this does is it completely bypasses the SSDs garbage collection, all their fancy algorithm that says "Here's a big contiguous chunk of new cache data. This is the only way we'll ever operate with you", and they really like that. It will be fast forever if you do that, more or less. A lot of these drives have systems for taking big writes right away, and then dealing with them in the background... So it's very, very optimal.
So then the second thing they had to do was cut down the number of operations that cause writes. So a lot of naive disk caches will say "Okay, you add something to cache, you write to disk; you delete something from cache, and I need to go to disk and say "This is deleted." In Extstore we don't do that. In Extstore if you write an item to memory, it goes to disk, and then you go and you rewrite this data, you update the cache, or you delete it, it just gets rid of that memory pointer. It doesn't put anything else to disk. And so there's algorithms I won't go into for when it decides to reuse space on disk, but it allows me to use a kind of very cache-optimized pattern for if you're overwriting data in memory, or you're filling your disk and you want to evict old data like a normal cache, I'm evicting stuff from disk without ever actually touching it again. And so this really cuts the amount of writing down to just when things are being written, and not anything else.
So you can change the TTL, the remaining time to expiry, for an item that's on disk without touching disk. You can delete it, you can fetch it as many times as you want, you can overwrite it; if you overwrite it, it still doesn't touch it on disk. It just eventually goes to the tail of the -- it'll go back to memory, and eventually, if it goes back to the tail, the cold tail of the memory, it will go back to disk again.
So that was kind of the fundamental setup. It really minimized the amount of times we've talked to the disk, and we really optimized the way we look at it, and then we kind of tried to abstract this away from the user. This is hopefully obviously at this point not a great ideal for very small data... So if your values are like 100 bytes, a couple 100 bytes, not going to be the best thing.
[49:57] Some people do this though; there's some fairly large use cases which are using Extstore with items that are 200-byte values. I was expecting like maybe eight kilobytes to half a megabyte or something, but even very small amounts of memory; if your keys are 20 bytes, and your value is 250 bytes, you throw an Extstore at it and you can still cut your RAM budget, cut your costs down by three fourths or more, even with these relatively small values. So it worked a bit better than I expected, which is always great when you're writing new features.
Ed Welch: I have a question, Alan, on file size, because that's kind of a relative -- I was saying how Loki has small files, and you're saying Loki has large files... And from the perspective of something like an object storage, we generally target making files that are a couple of megabytes in size... Which are kind of smaller for large-scale data storage. But for cache, even sort of historically, we were very familiar with Memcache giving us out of memory errors, because we were largely allocating all the slabs in the 512 KB and up, because all of our objects that were going in were in this kind of one megabyte size... And it sort of made memcache a bit grumpy with us years ago on how that worked... And so this is my question about like -- I guess one is is there like an ideal size, and/or is there a limit to the size object you would want to try to store this way?
Alan Kasindorf: So Extstore - there's some technical trade-offs. The maximum size of the item has to be at least half the size or less of your write buffer or something, just based on how it handles the data. As far as large objects go, it doesn't really matter otherwise. Years ago, if you were storing a lot of very large items, you did tend to get out of memory errors, if you were writing very aggressively... That algorithm in recent versions of memcached has actually been improved.
So memcached used to get very inefficient when you stored large data. It was kind of designed for much smaller data, in the realm of a couple dozen kilobytes, or down to a kilobyte or less. It was very good at very small data. But once you started putting in half a megabyte or larger, you'd start losing like 10%, 20%, 30% of your memory to this overhead based on its memory management. And a long time ago now I edited that, so when you store a large item, it actually splits that up internally into smaller items. And so you probably saw this as your 512-kilobyte thing, where if you put a two-megabyte object into memcached, it'll actually split that up into four 512-kilobyte objects. And it'll try to be even a little bit more efficient than that. It'll do like two megs in a byte, it'll try to put a little cap in the end from a different slab class, and try to get these larger seems to be a lot more efficient.
And there were some edge cases in this algorithm that were fixed in the last two years, I think. So if you try that now, it'll probably work a little bit better... Extstore should be more or less fine at handling these larger items. I wouldn't put like 20, 30, 40-megabyte items in here, just because it's used to doing stuff on the smaller size. You might run into some edge cases where like "Oh, well the whole thing was waiting for 40 megabytes to pull in from disk all at once", and you got a little bit of a lag bump, or something. But it shouldn't break; it just might be a little bit unhappy.
Ed Welch: I should say that because memcached has been so reliable for us, we didn't upgrade it for a while, and that class of errors represented a small percentage of our write traffic, and then we just kind of tolerated it... And then we have since updated and those have all gone away. So it all works very well now, thank you very much. [laughs]
Alan Kasindorf: Yeah, the not upgrading thing is pretty common. It's pretty hard to convince people that despite doing -- I do usually monthly releases of the software, and it's usually a surprise to people to find out that it's still in development. "It just works, so I didn't know that it was still being worked on." Or...
Ed Welch: [54:04] Yeah, right? I took a different strategy with Loki, and did infrequent updates N major changes, and we fixed a bunch of stuff... But memcached is a good -- it's a good [unintelligible 00:54:13.22] to the product, and that was an interesting experience for us... So we were super-excited to see those changes. And like I said, that class of errors has gone away, and now we're finding that Extstore works really, really well for what we're doing here. There's a part of it I didn't quite talk about earlier, which is one of our goals here wasn't to take all of the traffic away from object storage. Loki is a highly parallelizable or parallel machine, or whatever... And what we were looking to do was take a fraction or some portion of that from object storage. The amount of total data will always be more than we'll probably want to put on disk, but we can take sort of a controllable amount now by changing the size and the number of nodes we run, to be able to say "20%, 25% of our data will come from local disk, the rest from object store, 50%", and that gave us a really nice knob to turn there. And interesting on the TCO side too, in terms of like IOPS costs versus disk costs, and things like that... But it's been super-fantastic for the situation we're in, and having that really nice interface that we're very familiar with, and the cache behavior, and then just being able to put sort of unlimited space behind it compared to what you would think of with memory. So... I'm very, very excited. If anybody has not seen/checked out memcached in the last few years, definitely update; definitely do it.
Mat Ryer: And is that SSD capability available for everybody? Was this something that --
Alan Kasindorf: Yeah, you can. If you go to pretty much any distribution of any Unix-based operating system and install from the standard package, it should be new enough to have Extstore now. If you're kind of living on the edge with large objects, you might want to make sure that you're on a more recent version, something from the last 12 months... But it's been in there for quite a while. Netflix was an early adopter; they talked about it somewhere. They put a lot of data in Extstore.
Danny Kopping: It's kind of like the industry's best kept secret, in my opinion... And Ed touched on this as well. The ability for us to completely change the way that we back our cache with a much cheaper medium, but keep the same interface into how we access that cache is such a game-changer for us. We didn't have to change anything in the application code; all we had to do was just tell memcached where to put items that were too large to fit into memory, and that was it. It was really, really easy. I think Ed and I took - what is it, two weeks? ...between testing in our pre-prod environment and taking it to full production... And that kind of iteration speed is really unheard of when making such a large architectural change to your application.
So Extstore - it really surprised us with how effective it was. And even with disks behind the scenes. We really don't feel that when using it. And I should also throw some real numbers around here, so folks can understand the scale that we're talking about... So in the before times, in this large environment that we're running, we had about 200 memcached instances all running in a cluster. Each one was scaled to, I think, six gigs of RAM, and each instance had one CPU associated. And so that gave us about 1.2 terabytes of RAM cumulatively. And then we needed to store -- and that could store about three to four hours' worth of data in our time-series database... Because we've got data coming in at a very, very high frequency, and observability data is heavily skewed towards more recent data.
[57:56] So this was working out quite well for us for a while, but then as our workloads started changing, and we started seeing customers frequently accessing data over large periods of time, no longer just a few hours, but now a few days, to a week or two, we started churning through those items in the cache very, very quickly. So we couldn't actually read the items fast enough out of cache before we were putting new items in and replacing them.
And we mentioned Least Recently Used previously; that's the algorithm that memcached uses to replace items. So the items that were least recently accessed, they would get evicted, and then the hotter items, the items that are accessed more recently, will get retained.
In our case, there isn't any real sort of hotness to the data. It's sort of a sliding window over time. So if you imagine like a news website, a news article will become available, and a lot of people will read that news article, and then over time old news is irrelevant. Compared to a streaming service, where the same item is going to get accessed again and again.
So our cache of 1.2 terabytes was not exactly big enough for us, and so we were able to scale from that to 50 terabytes of cache with relative ease, using Extstore. And another little known feature of public clouds, which is local SSDs, which we should chat about a little bit as well. So if folks are familiar with cloud-based storage options, they're usually thinking about things like EBS, or GCS, where you can store a large amount of data in a block store that's network attached. But the problem with network-attached storage is that there's this inherent latency that's involved. You're writing to an actual disk somewhere out there in the universe, and there's the replication that's happening, there's there's all of these durability concerns that are taken into account... And we didn't want any of that for our caches. We just wanted something that was as close as possible for our cache use case. And Alan mentioned it earlier, cache is all about bringing the data as close as possible to that locality; it's really, really important.
And so in the public clouds, GCP, AWS, Azure, they all offer machines, VMs that have SSDs physically attached. And those are the ones that you should be using with Extstore, rather than just using a normal network-attached device, because you'll probably have a bad time with that.
Ed Welch: To put a pin on kind of the experiments that Danny summarize a little there, it is worth -- like, we went through several rounds of trying to find the right machine size to match with the right amount of storage, because there's a little bit more -- you could put 10 disks on one machine, but you likely will then have CPU or network constraints. And ultimately, I don't remember which size we ended up on it, and it may be a little bit different provider, but -- and horizontally distributing your load. But those local SSDs are both cheaper and faster, and you can find yourself a nice configuration to get orders of magnitude more storage, and at very, very fast speeds, that are fast enough when you're trying to keep speed levels with what object storage is doing, not what 1-2-3 cache levels are doing... And it worked out extremely well for us. I'm super-excited with where we ended up here... So thank you, Alan, for all your hard work. Memcached is amazing.
Mat Ryer: Great. So it was a lot of trial and error. Trying it on -- just running experiments, and stuff, on the different sizes, to see what worked...
Ed Welch: That worked out well for us, because we have an infrastructure that lets us do that, so we would just go experiment on our bigger ops clusters to see what would be the right trade-off on machines and disks. Actually, that's my view of it. I know Danny did a lot more work into the picking of those machine types, but...
Danny Kopping: [01:02:05.26] Yeah, we also had a lot of help from Alan. I reached out to him through the -- what is it? Through the Discord... And he was extremely gracious with his time, and helping us with this particular problem. And yeah, and a number of very interesting constraints started to emerge when we started implementing this solution. The one thing that I didn't immediately consider when we started was network throughput. We actually had a cache that was so fast, that we were maxing out the network on the nodes that we were running the cache on. And so we were running -- I think the configuration that we've settled on now is about one and a half terabytes per node. And that's got four SSDs attached to it, I think, in GCP. So we can dive into the details if we want, but basically when you're dealing with a cache of this size, you need to take into account is there enough CPU to physically go and fetch the data off these disks and handle all of the incoming traffic? And can I push that data onto the network fast enough in order to consume it without introducing any sort of contention?
All of the tools that we have here at Grafana Labs, the observability that we have internally, plus Memcached's own stats information is super, super-useful here, because we can understand the dynamics of what's happening inside the system. I'd actually love to ask Alan how that stats -- what would you call it, the Stats API, I guess? ...how that came about. Because this was exposing application telemetry before it was cool.
Alan Kasindorf: I can't take credit for the memcached stat. That's always been there. I have added a lot of stats over the years. If you compare older versions of memcached, especially down to the 1.2 series, when it was first getting really popular, there were only a handful of steps, and now there's kind of a lot. And they are very thoroughly documented in doc/ protocol.txt in the repository, or any tarball... So if you look at any stat from the stats output and you wonder what it means, it's all documented there and ready to go.
A lot of it probably doesn't mean much to end users, a lot of it is for me kind of unfortunately -- if people run into issues, and they come in and they say "Well, give me the stats output", that's usually my only window into people's environments, is like I can ask you for maybe two copies of your stats output 30 seconds apart, and I need to understand from that exactly what your workload looks like. And the stats are kind of designed in such a way that I almost can kind of -- if I look at your stats in maybe a couple of instances, I can pretty much tell you what kind of company you are, what you're doing, how you're using this, in a lot of detail. And then I can educate the user on like "Here's the things that you really need to care about, and take care of this, and take care of that." And that's usually what we need to do, is just look at the stats. I mean, when we were first talking, I was like "Okay, well--" You said "I have this problem", and it's like "Okay, well stats output, stats output, stats output", and eventually you sort of lead in with the stats output. But it's been really good to just have users have only run this one command really, and kind of understand things.
There are also log streams that people don't usually know about. This is Watch Command; you can telnet to a memcached instance and you type of "watch evictions", and suddenly you get a long stream of all your objects that are being evicted, and you can do analysis on that and say "Well, these are being evicted, they were never touched", like "What is this and that?" And there's a couple of streams like that, that are kind of interesting... But most people only need to kind of look at the stats from a high level.
Mat Ryer: But that's interesting, there's a theme there of measuring, firstly, when you're going to need a cache before you make that decision in the first place... And then once you have a cache, to sort of keep an eye on it, see what's happening, see what it's doing, to make sure it's behaving.
Well, unfortunately, that is all the time we have today. We could talk about caching forever, I think. We talked about -- we've figured out what it was, which was very nice to learn... We talked about why we do it; you know, it makes things fast, and it makes things cheaper... And for something that has such a simple API, gets, sets, maybe deletes, expiry, the complexity that you can end up building with caching, or the complexity that's inevitable is quite a surprise, and it probably sneaks up on you. Great insights from the panel there on "Start with nothing." It's interesting when the experts say "No, don't use this. You probably don't need to use it yet." And so it was great to also dig into that specific use case a little bit. Thanks so much to my guests, Ed, Danny and Alan. Thank you very much. We'll see you next time on Grafana's Big Tent.