The Evolution of OpenTelemetry with Co-founder Ted Young
**Mat Ryer:** \[00:00\] Ted, did you write the book with someone else? Who is it?
**Ted Young:** Austin, Austin Parker.
**Mat Ryer:** How do you write a book with another person?
**Ted Young:** So the way we wrote that one together was we just came up with the outline and then we're basically like, evens are mine, odds are yours. You trade off both the rough draft.
**Mat Ryer:** And then you break it. Yeah, yeah, pages, words, yeah, every other word.
**Ted Young:** Yeah, it's like pair programming. Yeah, that's how we did it. So we each wrote a rough draft of like every other chapter and then handed them to each other and edited each other's chapters. And then we're like pretty much done.
**Mat Ryer:** Nice, who published that?
**Ted Young:** O'Reilly.
**Mat Ryer:** It's very nice that you've got an official Chinese one because I have a book. And it was translated unofficially into Chinese. It's just kind of stolen. But I sort of don't mind it. Really, it's sort of open source, I guess.
**Ted Young:** Sure, as long as people are reading it.
**Mat Ryer:** Yeah. But yeah, no, it's been officially translated into a number of languages, but two I know of are Chinese and Russian were two, where I think because maybe it's going through third party distributor, they ping me about getting a copy of it.
**Ted Young:** Is there like a British English version where you spell the color with the U and things?
**Mat Ryer:** Yeah, yes, definitely, definitely. A Canadian one, eh?
**Ted Young:** Yeah, Canadian one's more polite.
**Mat Ryer:** So that's very cool, the book. How did you get involved with O'Reilly?
**Ted Young:** I'm one of the co-founders of the project. So actually OpenTelemetry was originally formed by merging two projects, OpenTracing and OpenCensus. And I'm from the OpenTracing side of the family.
But even OpenTracing and OpenCensus are two versions of the same kind of set of technologies coming out of Google. Namely Dapper, a system called Dapper that was a tracing system designed by Ben Siegelman. And there was a metric system called Monarch. There was another metric system called Borgmon. And Prometheus is based off of Borgmon and a lot of the metric stuff in OpenTelemetry is based off of Monarch.
And internally within Google, there was this total like Borgmon versus Monarch slap fight. And then unfortunately, people who are really invested in that slap fight have continued to perpetuate it, I would argue. This is the one conspiracy theory you'll get out of me. We said no conspiracy theories, the Prometheus and OpenTelemetry shade is mostly powered by an internal Google slap fight over Borgmon versus Monarch. That's my conspiracy theory for the day. It's actually mostly internal Google nonsense that's filtered out into reality. And there's plenty of people who would disagree with me. No, the problem is you're evil or something, but not really.
But really, that is actually two different approaches within Google that turned into two different technologies. And it's like the same bikeshedding nonsense that we're kind of stuck with. But it all came out of Google. So everyone should just get along at this point, in my opinion.
**Mat Ryer:** So that's very cool. So for anyone that hasn't seen this before then, what is OTel? We did do an episode on this, didn't we, Matt? Last season.
**Matt Toback:** We did. We did. Yeah.
**Mat Ryer:** Who was in that?
**Matt Toback:** That was Daniel Gomez Blanco and Dressy.
**Mat Ryer:** So I've gone listen to that.
But for anyone that doesn't know, just briefly Ted. What is OTel?
**Ted Young:** Yeah. So OpenTelemetry is a telemetry system. So you could think about this as a new way of dividing up the observability stack. We used to divide the stack up by signal. So you would take one style of observability, like tracing or logging or metrics or profiling. And then you'd be like, I'm going to make a logging system. So I will make a logging format. And then I will make a logging API and a logging client to generate those logs and something to process them. And then I'll make a database to store those logs and then like a UI for looking at the logs in that database. And that's my logging system. And then someone comes along and it's like, now I want a metric system. And they redo all of that work, but just for that metric system.
The new kind of way of dividing it up is to say there's telemetry, there's applications and systems and computer resources. And they're all self-describing. They're all just describing what they are doing. And when they're describing what they're doing, they are all trying to speak the same language. Because when you're trying to understand what they're doing, you're never looking at any of these pieces in isolation. You're looking at them together. So if they can all speak the same language to describe what they're doing, then it'll be a lot easier to form a comprehensive picture.
And since we're talking about describing what computers are doing and mostly what computers are doing are like standardized, right? Like talking over network protocols and things like that, it stands to reason that it should be possible to standardize the language that systems are using to describe themselves. If you wanted to do that, you would want to decouple it from things like analysis and storage and user experience stuff. All of that is greenfield.
You want to keep on developing that stuff. But the actual data, you maybe would want to look at standardizing. And so that's kind of the new way of slicing up observability is to say we have telemetry, which we want to standardize, and we all want to share. And then we have analysis. And analysis, we all want to compete.
Now more than ever, with machine learning and AI really taking off, it feels like open season. No one has the perfect paradigm for taking all of this data together into one system and doing some really interesting correlation analysis and stuff on it. There's a lot of greenfield there. And the fact that people don't have to change telemetry systems, they don't have to dump a huge pile of instrumentation and then re-add a whole new pile of instrumentation. They don't have to do any work to switch analysis tools. I think it's actually accelerating the rate of invention in the industry, because the cost of moving or adding on a second or a third tool is so much lower now, people are more adventurous looking for new tools.
**Matt Toback:** So how would you say that shows up? So what are you seeing in teams that are the more adventurous teams?
**Ted Young:** Yeah. So one thing I'm seeing showing up more than anything is people actually adopting distributed tracing. And in a way replacing logging with distributed tracing. Traditionally logging and distributed tracing get talked about like they're these two totally separate things, but that's really this kind of like human historical accident. It's why I really don't like the three pillars metaphor. We're like, there's the three pillars of observability, tracing, metrics and logs. And I'm like, well, one, no one ever used tracing because it's too hard to install. So outside of a couple of like big companies, no one used it. So it's like, but we made all this merch with pillars. You can't just throw the merch away. Right. But like, you know, but it also it sounds like we designed it this way. Like this was like a great design to have these data streams completely siloed from each other with like no way to cross-correlate between looking at the events, right? The individual events happening in my system versus looking at aggregates of events, aka metrics. You think you'd want to like integrate those two experiences.
And so now we're doing that as an industry where like the data is integrated. And we're discovering that like, wow, tracing and logs are really like tracing is just logging with the context that you always wanted for your logs. Like what transaction are these logs a part of? It's crazy to me how many decades went by without us being able to just look up the other logs in the same transaction, right? It's kind of wild that you can't do that.
**Mat Ryer:** So you like the guy at the party when, you know, who comes in with the more existential questions, like what if just like logs and events were really traces? Everyone's like, go home.
**Ted Young:** Like it's kind of the opposite. Like I hate being existential about this stuff. It's almost like I feel like I struggle uphill against ideology because people have already pre-bucketed what these tools are like useful for. And I think that ends up kind of blinding people to like, in a lot of people's heads, tracing is like a latency analysis tool. And it's really expensive, which means you have to do a lot of sampling. And because you have to do a lot of sampling with tracing up front, sampling's really not useful for like firefighting or root cause analysis because you just don't have the data unlike in your logging system where you keep everything.
And we're like, the only reason tracing is heavily sampled is because people were trying to add it on top of a really expensive logging system that stored everything, but had no trace context. But they couldn't touch that system and add trace context to it. So they built like a second logging system called tracing. And then since that thing had only a tiny amount of resources left over from the logging system, it was like heavily sampled to do some latency analysis. So it's like, that's just like a historical accident that we built it that way. That's not because tracing and logging are like really like different from each other.
**Matt Toback:** Do you think that if the fundamental architecture was different that distributed tracing would have gotten traction sooner?
**Ted Young:** I think the hardest thing with distributed tracing is installing it because the context that you're getting from it, the really valuable context is the execution flow that you're following. But language runtimes don't really give you a way to follow that execution flow effectively from an observability standpoint. Like if you just rely on thread-locals in languages that have threads, you might get some of it. But work often switches railroad tracks, right? Like work will move from one thread to another or a set of threads will get organized into some kind of like scatter gather thing or something like that. Or an entire like user land, co-routine asynchronous system will get laid on top of that.
And so actually tracking that flow of execution, that's the trickiest part of OpenTelemetry in every language is this context propagation mechanism that we have to come up with. And then every single piece of instrumentation has to use the same cross cutting mechanism. That's harder to roll out and get value out of than logging or metrics where I can just roll out metrics in this corner just for me just to monitor this one thing and I get value.
If we're saying the value of tracing is this distributed context, that means you have to roll tracing out across all of those services before you get that value. So that was like kind of the problem. When you combine the amount of work with people thinking, I'm going to have to rip all this out and replace it if I ever switch vendors. It's just kind of like dead on arrival outside of an organization like Google or Microsoft or Xerox Park or something like that where there's just this enormous internal engineering culture that can make it worthwhile to go ahead and do that. So that was the real blocker.
What I see the future is like OpenTelemetry today, but eventually observability gets so in people's mind that future languages bake it in. I'm hoping they bake in OpenTelemetry or at least something really compatible with OpenTelemetry. I'm nervous because like language designers are contrarians. Like the whole reason you make a new language is to make everything different from all the other languages. So I'm a little worried when it comes to observability they're going to be like, yeah, but we're going to do it different and then we're going to be like, but now your stuff doesn't compose with the rest of the distributed system.
But anyways, it's a problem I would like to have. I would like language designers to be thinking about observability to the point that they're actually providing these like context propagation mechanisms.
**Mat Ryer:** Do we want to do a quick side where we rate the most contrarian language folks? Or we'll leave that off. OK.
Matt, we were curious about something. You were about to ask a really good question. Do you want to go ahead?
**Matt Toback:** Was I?
**Mat Ryer:** It was the one that we practiced earlier.
**Matt Toback:** OK. What was the words that I could say? What's this question?
**Mat Ryer:** All right, when I do, I do. It's just like a surprise birthday party. What's going on? It feels like it.
**Matt Toback:** \[14:13\] So last year we talked about like OTel in 2024. How quickly is everything moving? So open source projects, particularly ones driven by foundations, we're talking about that earlier. There's so much the clip is fast. So what's different? So if you haven't been keeping track, what wouldn't you recognize this year from last year?
**Ted Young:** Yeah. OpenTelemetry is this curious creature. I consider OpenTelemetry's unofficial mascot to be the racing snail from the Never-Ending Story. If you remember that thing, because OpenTelemetry is high latency, but high throughput, which means when you go and engage with it on any one particular thing, you're like, oh my god, this is so slow, designed by committee. It can really feel that way, because unlike a lot of other open source projects, OpenTelemetry doesn't really get any takebacks.
We generally speaking, if we put it out there and it gets adopted and we break it, people hate us forever. Much more strongly than most things, where if it's new and we break it, we're just like, well, it's like new open source.
**Mat Ryer:** Wait, no way. Doesn't everyone feel like that? If someone loves it and you take it away?
**Ted Young:** No, no, no. I feel more so than other projects, OpenTelemetry gets punished hard for breaking backwards compatibility. At certain levels, layers, we have to be very strict about it. The API layer in particular, the data layer, if we create any kind of dependency conflict, we're like, this library depends on OpenTelemetry 1.0 and this other library depends on OpenTelemetry 2.0. Now, these libraries won't compose because of OpenTelemetry. We are so dead if we ever create a situation like that for ourselves. So we really have to care about it.
**Matt Toback:** Which probably speaks to the way that people are using it in production, or it's the relying on it.
**Ted Young:** It's a crosscutting concern, right? There's certain aspects of OpenTelemetry around propagating context, where it only works if everyone uses literally the same thing. And so that part is very, very temperamental and very sensitive to compatibility issues. So we just have to care a lot.
Anyways, because we have to care a lot about the first major version, we do a lot of work to make sure we've gathered all the requirements and everything like that. And that can feel slow. So people will be like, OTel is so slow. But then we're very, very high throughput. There's so many people working on it. There's so many open initiatives at any given moment that we then also hear from the vendors that we all work at, people being like, oh my god, another change? Like, where did this come from? Like, you'll have these like kind of incoherent conversations where people are like, we can't rely on OTel to deliver stuff because it's too slow. Also, you're delivering like too many things and we can't keep up. Like, because it's both. It's like high throughput, high latency. And once you get that about OTel, then like a lot of it makes sense.
**Matt Toback:** So it's particularly like how much of the goals shifted for you. So it was like from the beginning, when you put those together, like where does it feel like you're at now as far as like how you expected the product to evolve?
**Ted Young:** Yeah, that's a great question. I would say it's less that the goals have shifted so much as we got through the initial set of goals. Like the initial set of goals were tracing metrics and logs integrated together into a single system and completely ruthlessly normalized and organized so that the data is very, very uniform. That was the initial goal. And finally, with logs this year we're complete.
The kind of last little cherry on top was this sort of epic logs versus events, bikeshed nonsense that just seems like a rite of passage. I think you have to go through when you're building a logging system is to like, or an event system.
**Mat Ryer:** Or an event system. Or a system that logs events, you know, or the system that creates events every time there's a log. Or a shed that houses bikes or a bike.
**Ted Young:** The amount of fields people had, like towards whether their data structure was called a log or an event. I wish we could harness that energy and put it towards something productive. That's all I can say.
**Matt Toback:** So if you distill it down though, what ended up being valuable from the decisions that were made? Because like I got the bikeshedding, I understand all the lead up, but eventually that yields something. So what did it yield?
**Ted Young:** Yeah, well, I think there's something that's really interesting with OpenTelemetry is every time we tackle one of these signals that we call them, we're thinking about it. It's just like another data signal, but this is like a whole industry, a whole ecosystem, a whole set of practitioners who only overfixate on this one data source, right? And so you're actually like bringing in a whole community when you do that. And that community has like a lot of opinions and they're not used to leveraging the other data sources available. So there's this aspect of everyone having to learn about each other's observability practices so that we can integrate and create an integrated data experience.
So for example, something in tandem with this is browser and client observability. That's something I've been, we care a lot about that, but we had to delay going after it because we realized we were missing some fundamental pieces and one of them was client code is way more kind of event focused than it is transaction focused.
Servers are very transaction focused. It's all of these totally independent transactions happening at the same time. Each transaction's a completely separate user, you don't want these transactions interacting with each other, but they do because they're competing for resources and this is the source of like all trouble you know, when it comes to like server side observability is the independent transactions competing for resources is like mostly what we're dealing with.
Then you go to the client side and it's all one user, everything going on in this machine on this like computer is all related to one user and one flow of consciousness. And even the way that's constructed in terms of like code architecture is more of this kind of like event reactor pattern than the kind of like transactional like context switching pattern that you see in servers. And we just were not ready for that because the last thing we were tackling was logs, right? But this is like a very event forward world.
It's also a world where like we assume the resources available to a computer program don't change because on a server they don't. If you're going to change the resources available to a program on a server, usually turn the machine off before like adding more RAM or something. But on the client side, especially mobile devices, the resources are changing all the time, right? Like the app might be backgrounded and now it's in background mode, totally different set of resources available. Like the network stack might switch out on it, right? Like Wi-Fi versus cell. And you need to know about that or your observability data is gonna get really muddy, right?
**Matt Toback:** Right, you wanna do apples to apples comparisons which means you have to actually track these moving targets of right being like it's fast here but slow here. And this is Wi-Fi and this is like bad cellular connection. You need to be able to differentiate all of that if you want clean data.
**Ted Young:** And we just like weren't ready for that. So that was a place where we had to kind of like go back to the beginning, get everything done. And now we're like finally, finally done with that with logging. But it was a bigger push than I thought it was gonna be because on the surface logging seems so simple. It seems so simple, it's actually the simplest of all the data structures we've tackled. But weirdly that made it the most difficult to deal with because it's just like, it's just bags of dictionaries. It's just dictionaries of dictionaries of dictionaries and like why put anything anywhere because you could put everything everywhere because it's just dictionaries. So that weirdly made it harder to figure out like what our model was going to be. If we want our model to be like highly structured and predictable.
**Matt Toback:** Did it start to resemble something that had been like created a million times over or is it something new?
**Ted Young:** It feels new in the sense of like logging traditionally has been very, very, very unstructured, right? Like logging traditionally was just like a string that you splatted out somewhere. And now we're trying to.
**Mat Ryer:** That's how I'm doing it still.
**Ted Young:** Yeah, it's like got here. This code should never execute. Like whatever the log message is.
**Mat Ryer:** Usually it's just monkey, it's just the word monkey. There's a sneak to see where it shows up in the other logs.
**Ted Young:** I used rabbit, follow the rabbit.
**Mat Ryer:** It was always good. Yeah, yeah, I'll use that now, that's better.
**Ted Young:** Right, but now we want something very, very structured because we don't want to be making random logs. What we want to be doing is like having a model for how we describe this library or this data service or whatever the heck it is. And we have a model and we wrote it down as what we call semantic conventions, right? So we're like, these are the events that this system will emit. Here's how you should be using these events to create like alerts and dashboards and things.
Like we want to be very structured. And then we want these events to be organized, right? Like we want them all to have like a type, you know, which we call the event name. And we want it to be like, if you see one event with this name, you know 100% of the other events that have this name will have the exact same structure. They have the same attributes and the values of those attributes will be the same type of data. It will always be a number, it will always be a string.
Right, so getting that level of rigor into two logs are what arguably makes it more like an event. So that's where we landed was we have a logging system. And if you name these things, then they're events. If they have a name and a classification and you're doing all the work to be super organized, then you can call it an event. And if you're not being super organized and if you don't have a rigorous like system for log, then it's a log, then it's just some, and you know it's one of those because it doesn't have an event name. It's just like, oh, this is just some from, you know, the person's like application logs from like 10 years ago.
**Mat Ryer:** Well, now kids can't listen to this things, Ted. You just love to cut that one.
**Ted Young:** Well, I just think I just think you've probably held back the generation from adopting OTel.
**Mat Ryer:** Well, you've just done that, Ted, with that one word.
But hold on, like, is there that before you ask me, is there a third? So there's logs events, is there a third pillar?
**Ted Young:** Well, these would both be the same pillar.
**Mat Ryer:** No, okay, fine. I'm just going back to the whole creating pillars that were unnecessary. So now we need pillars.
**Ted Young:** But it's so, you know what people cared about? What people cared about was how important their thing was. People really didn't like the idea of calling it logs because their events were so very, very, very important to them. If they were coming from a domain where like, this is the primary observability tool, like browsers or whatever, like their events, and their events because we care. And then in domains where like your average log has like a negative value because it's useless and it costs money. Then it's like, we call them logs to indicate that like the value is like super low. And it's like, it's not that they're even structured differently. It's just like, did we care about them or not? If we cared about them, that community wants to call them events. And they don't want to mix stuff with the logs because that sounds like you mixed up my precious data with stuff that doesn't have a lot of value.
**Mat Ryer:** With something that you flushed down the toilet.
**Ted Young:** Right? Whereas an event is a big thing you go to, yeah, this works.
**Mat Ryer:** Right? And we're just for, nobody gets dressed up for a log.
**Ted Young:** And we're just like, but like, could you look at the data structure and be like, like our event names, okay, is everyone cool with names? And everyone's like, yeah, we're cool with names. We just want to argue about, what if you just have a log line that has, it's no event name, but it's just a log line, but it just says the word event. What's that?
**Mat Ryer:** That you logged the word event, it's okay.
**Matt Toback:** Does that, Ted, this is not a question. So then, all right, this is how I'm following as a dummy in the room. So kind of this idea of like you had this three pillars, it was sort of like a false set up in a way, almost like you started shipping your org chart and you tried to put things into these distinct teams, these distinct systems and make it work. And OTel is kind of pushing past that a little bit. And it does sound like the log event conversation then created this concept of, it's like almost like the value, it's like a value conversation.
So then does that value conversation get applied to things beyond logs and events, meaning would you have value discussions around metrics? Would you have it around tracing? Is that a new way of thinking about all these things?
**Ted Young:** You know, in a certain way, it is. The part that we haven't thought about too much is how people use this data. Like in a sense, we do think about it, right? Like we do care about the cardinality of some of these attributes, right? We know that spans and logs are going to get turned into metrics in some way. But because we don't have a backend and we don't have like an OTel UI, I think there's been less of a focus on when defining these semantic conventions. Are we also defining a dashboard set of alerts, a set of playbooks?
I think it's like very feasible to be like, here's like the default observability stack for HTTP stuff, right? That's the OTel approach. It's not logs or metrics. It's like HTTP and networking information. How should you set up all of that? Database queries. How should you set up all of that? Infrastructure stuff, you know, Kubernetes. How should you have all of that set up?
I would like OpenTelemetry to kind of go back in and start thinking more about the dashboarding kind of playbook aspect of what to do with this data. I think there's an opportunity to not totally standardize it, but at least make sure there's at least one standard use case.
And again, like the browser is a good example of this. Like we're on the browser right now. We're trying to define browser observability. So the first thing is like, well, what is the browser doing, right? The browser itself has a bunch of events that it fires when it's loading a page
**Ted Young:** \[30:00\] And you want to record all of that so you can replay that page load. But then it was like, but we also care about network package size, for example, right? Because mobile networks, like we want the data to be cheap. So that instantly caused the question to be like, well, which of these attributes are we using for what? Because in order for something to go into this precious payload that we want to be very careful about the payload size, every attribute needs to buy its passage by explaining how it's going to be used in the background. And so that felt like a good step forward. I'm like, this is a great practice I would like to export to the other six of like, there at least being one dashboarding or one kind of experience that we write down as a justification for adding this attribute, the way that we're adding it.
**Mat Ryer:** Yeah, I love that. I think this, it reminds me a lot of how the Go language is also kind of managed. They also have the backwards compatibility promise of V1. And yeah, just to, I don't know if we want to get into the language, which one's the best one or the good Go people are contentious. I don't know if they all know.
**Ted Young:** I wish you hadn't said that boy.
**Matt Toback:** Something like that.
**Ted Young:** Imagine I did a good thing. Go's terrible, Go's our absolute problem child. And I say that as someone who has managed a lot of teams writing Go and has written a lot of Go. There's some aspects of Go that I really like.
**Mat Ryer:** Is this genuine?
**Ted Young:** This is genuine.
**Mat Ryer:** Oh, okay, all right, all right. I don't know if you're just giving them a run.
**Ted Young:** Yeah, there's some aspects of Go that I really like on a personal level. I really like that it's focused on readability, right? Like they're very restrained about the kind of, quote, magic that you're allowed because they really care about readability. You should be able to dive into some codebase you don't know anything about and read the Go code and understand what the system's doing.
It's that and I think that's fabulous and like A+. But there's a bunch of aspects where Go feels like very slapped together from the perspective of people trying to manage these cross-cutting concerns like observability where context propagation in Go is like the worst, like by far compared to other languages.
**Mat Ryer:** Because it's explicit.
**Ted Young:** Because it's explicit and it's explicit in a language runtime that's like the level of abstraction of the language runtime, which means if we try to kind of auto-instrument, if we try to hook in from a low level using low level tools like eBPF or LD preload to automatically install all this stuff, it doesn't work in Go. It's dangerous to do it in Go because there's just so much machinery sitting on top of those, where those tools hook in and where the actual user land stuff is happening. It's like a dangerous proposition. So we have these tools that we're trying to leverage right now to really improve the installation experience in a bunch of languages. And then like Go's language, it doesn't work. Go is very brittle from a backwards compatibility concern. So a lot of our design concerns involve kind of putting stuff into the spec in a way to make sure we're not going to break Go. So I really, really love our Go maintainers. They're really fabulous. But they have a harder job in my opinion when it comes to spec and design, they have to pay attention harder than the other maintainers and they have to be more involved because they run into more tricky situations in Go. Then in any other language, and I feel like I can say that definitively at this point because it's been years of working with all these languages.
**Mat Ryer:** And it's the internals, isn't it? It's like the, when you build it, there's bits going on in there and that's the stuff that's more complex.
**Ted Young:** \[34:03\] That Go often, I describe Go is like a box where the sides don't meet. Like it has all of these different features. But then when you really try to flex those features for their intended purpose, you realize you run into some problem with where it interacts with another feature. So in Go, you want to use interfaces as a way to have loose coupling between, you know, separate systems, classic thing, right? Classic problem, classic solution, loose coupling between separate systems. But then you want to have compatibility evolution where like I have this interface and I want to safely evolve this interface. What can I do safely in Go that won't break backwards compatibility? In every language except Go, adding a thing to the interface is generally considered a safer, like a decent practice. In Go, if you even add a new method to an interface, that should be a V2 in like Go best practice in law. You should major version bump if you even add because someone might be doing some dynamic stuff with that. So they added interfaces to manage safely loose coupling systems and then they added dynamic programming. And then these two things don't think about each other. And so now your interfaces are so locked down, it's like touching them at all means you have to do a major version bump. But as we talked about earlier, if we ever do a major version bump, even if we didn't change anything, if we just did that, we would be creating this compatibility disaster for ourselves where some packages would depend on the 1.0 and some would depend on the 2.0. And now you can't import those packages into the same place. So we have to violate Go's principles of versioning and add methods to interfaces and tell people don't do dynamic stuff with this interface unless you're willing to keep up to date with minor version bumps. And then everyone in the language is irritated with us. And we're like, yeah, but the other problem is worse. It's not our fault that all these things don't line up the way they do in other languages. Like it's just Go didn't literally just didn't think about this stuff when it was put together.
**Mat Ryer:** Yeah, well, even not doing dynamic stuff, if you add something to an interface, you have changed then the contract and any calling code wouldn't necessarily build right? So yeah, right.
**Ted Young:** But so that's a thing we say is like as an implementer, if you implement these interfaces, you're required to keep up with the API, right? OpenTelemetry has this very nuanced approach to backwards compatibility and one of the kind of pressure release valves we give ourselves so that we have room to actually deal with stuff is we say the SDK has to keep up with the API. So we're going to ship new features that's going to go into the API. Instrumentation is going to depend on those features. People are going to depend on that instrumentation. The SDK has to be able to handle it. We don't want to be trying to think up, we don't want to play 5D chess with ourselves where we're like, what if we add this new feature, but someone runs that with an old SDK that's just going to ignore those API calls? Like how do we deal with, we're just like, we don't. The SDK has to keep up. So if someone wants to build their own SDK, they can totally do that, but they have to do what we're doing and keep up with the API. You have to be willing to do that for your users. So, you know, which again is like slightly weird, but we need it in order to actually deliver the compatibility guarantees that our users need.
**Mat Ryer:** \[38:07\] Yeah, that's very interesting. I didn't realize that.
**Ted Young:** It's so much more straightforward. There's like a path forward in every language. It's like that community has a path forward for dealing with this because these are normal things. These aren't like OTel observability things. And then in Go. It's like, ah, it's just this minefield. It really feels more like that in Go than another languages. And I say that as someone who likes writing Go code.
**Matt Toback:** Do you, is there, what's the easiest language then? It's just going to be the one that's just like doing here.
**Ted Young:** Java is the easiest language because Java is the language that thought about this the most. And has the most overt facilities available for doing exactly what you'd want to do, which is to just dynamically attach all of this stuff in a safe way that doesn't get in anybody's way or require any effort. And that's like an explicit concept in Java, right? They have a whole agent concept, right? And the whole dynamic programming universe. And this is all very normal things to do with it. So Java is the easiest because there's the most intentionality available in that world.
**Matt Toback:** So should we tell everyone switch back to Java then you think? Because there's a lot, there's a lot of Go around in this.
**Ted Young:** Hey, I heard you're a lot of, I thought Rust was cool. But now I just heard about Zig, which is like a new way to write low-level C stuff. And that sounds way hipper than Rust.
**Matt Toback:** Yeah, I heard it's easier to learn.
**Ted Young:** Yeah, it sounds like it pairs with my record collection. So I think I'm going to become a Zig programmer with my next.
**Matt Toback:** Got a lot of David Bowie.
**Ted Young:** My next bit. Yeah.
**Mat Ryer:** Yeah. I wonder if David Bowie used OTel because he was quite into tech.
**Ted Young:** That's true. Yeah.
**Mat Ryer:** \[40:08\] Starman.
**Matt Toback:** At the estate of.
**Ted Young:** Yeah. Yeah, the Starman was just, it's just because it did really well on GitHub one time and one of his projects took off. That was the inspiration behind that.
**Mat Ryer:** It's not apparently.
**Ted Young:** That's how I'm doing. We could be here right now. I'm the GitHub Starman.
**Mat Ryer:** What kind of thing? And hopefully that doesn't make it into podcast. If it does, dear listener, you'll really come to realize how little say I have in what makes it in or not.
**Ted Young:** Do you have kids?
**Mat Ryer:** No. I don't have any. Do you?
**Ted Young:** So that's why we're getting this, right? It's like, there's no other outlet for this particular.
**Mat Ryer:** That's it. Okay. Great. Good. So have you read my performance review?
**Ted Young:** That's very true though. I sometimes find myself being really silly and then I think, I'm not meant to be like this at this age. I just don't worry about it.
**Mat Ryer:** How tempted were you to call it OhTed instead of OTel?
**Ted Young:** Oh, man, if you thought the logs versus events naming bikeshedding was epic, naming OpenTelemetry, that was epic. That was this.
**Mat Ryer:** This is such an obvious name now, it seems. Isn't it? Isn't it obvious now?
**Ted Young:** At the time, it was not obvious. Everyone hated the name at the time. Absolutely hated it. There's like the worst, the beginning of OpenTelemetry's designed by committee disaster where everything's a disaster and then like six months later it's like totally fine and everyone's like, why did anyone have a problem with this? Yeah, OpenTelemetry was that name.
**Mat Ryer:** What were the other ones though? Round and high.
**Ted Young:** I can't find the spreadsheet anymore. There's this epic spreadsheet. My favorite one, this was said with a straight face.
Right face by another leading contributor was, you know, we have like standard in, standard out, right? Like this. Why not have standard monitoring? So we should call this new system STDmon.
**Mat Ryer:** Yeah. I think now it's called STImon.
**Ted Young:** We're not calling it STDmon. Let's take a look.
**Mat Ryer:** This is why the question didn't get answered last year because we talked about this. You know, like, and there's a spreadsheet and you can't find the spreadsheet.
**Ted Young:** We voted on a name, see, but now this is everyone probably swore, but this is like all conspiracies. Right. There's always a leak before someone dies and now we know there were, there were a lot of really elegant and interesting names, but to be fair, you've ruined, you've ruined standard in and standard out as well with that too.
**Mat Ryer:** Yeah. It's like, oh my freaking god, though, we, and yeah, and that was like, it's like, what, we've already got these two and no one's complaining about them. I'm like, you're going to be drawing attention to something that we want to forget about.
**Ted Young:** But at the end of the day, boring won out because we're like, we want to be a standard actually. And so even though there were some interesting fun names, we're like, it actually works against our best interests. The more this thing sounds like a Pokemon, the less it sounds like a boring standard and we actually want people to treat it like a boring standard. And so OpenTelemetry and then people are like, there's so many letters and then we came up with OTel like three days later and everyone's like, that's fine.
**Mat Ryer:** I want to, if I get anything from, or if anyone listens to this, I want them to join the community calls as much as I do because I feel like it's like, it's like joining a reality show. Like you get to listen to everyone kind of flip out and you're like, that's fine.
And then yeah, we just make a bunch of good decisions and then life moves on.
**Ted Young:** Big ideas loosely held. Yeah. Like we have these big bikesheds and then we do the obvious thing and move on. And yeah, that's just how it works.
**Matt Toback:** So I do, I do want to make sure we talk about this zero touch installer because talking about how like, you know, becoming a standard and teams adopting it and ease of adopting it. Yeah. What's going on there? What, what do you want to happen over the next few months there?
**Ted Young:** Yeah. I mean, so long term, I would love for observability to be a first class system. I think we've done a lot of work on OpenTelemetry to build the instrumentation APIs in a safe way for it to get embedded directly in libraries and data services. So the future I would like to see is that no one has to install any OpenTelemetry instrumentation anywhere because when people write software, they're thinking about how their software is going to be run and they're instrumenting it and they're shipping a playbook for the people running their software, letting them know how to actually tune the configuration parameters that they gave them and all of that. So that's the world I want to see is like, people right now don't think about observability and I think in part is because they don't have tools to instrument their own stuff. It's like, without OpenTelemetry, it's actually hard to do native instrumentation. So long term, I would like to see that and I would like to see everyone become more of an observability expert as a side effect of that.
**Matt Toback:** Do you think AI is going to help us get there with that?
**Ted Young:** Do I think AI is going to help us get there?
**Matt Toback:** What's AI?
**Ted Young:** Artificial intelligence.
**Matt Toback:** Yeah.
**Ted Young:** In the meantime, we're sticking with AI artificial, artificial intelligence, which is just people who know how to program. But eventually they're all going to be dead one day eventually we'll get the real thing. The real artificial intelligence will come in.
**Matt Toback:** Okay, fine. So on this question then, is it, do you see AI in some sort of, you know, two or three iterations along just completely upending anything that you're working on here? Do you think that it, like because it's a little more structured because there's more thought in how the data is used and it becomes an accelerant somehow or is this all just still I'd even talk about?
**Ted Young:** I mean, I feel like with AI, it's like 90% chance, nothing changes except our tools are more complicated with a 10% chance that the world ends up in a Terminator disaster and like very little in the middle. I'm very bimodal in my AI predictions.
**Mat Ryer:** The moment in time, the Terminator moment in time when we went back is when you all decided for events and they're like, well, that's like, we're like, we're going to call it events and like Arnold. And he's like, no, get out. It comes in, it's like, as if Java's better than no.
**Ted Young:** Yeah. I'll be fine once the SEnC compiled.
**Matt Toback:** I guess taking the AI just back for a moment for the install, is there anything, I don't want to lose this? You would say for let's say someone who has seen, who's rolled out OTel, like in their teams are starting to change or shift focus, like how would you tell an early adopter team within a company to help it get wider? Like is it? Yeah. What would you say to someone?
**Ted Young:** Yeah. So the thing about installing OpenTelemetry that makes it hard is there's like a catch-22, which is to know whether or not you've done it correctly, you would have to know what OpenTelemetry does, right? And like, how can you know what OpenTelemetry does if you've never interacted with it before? So I mean, this is like any complicated piece of software, that first installation experience is like a difficult one to manage. And you really want to have a product experience around installing this thing is really less about you doing work for it and more about it teaching you how it works. Like when you start a new video game and the very first level just happens to teach you what the different buttons do, right? You want to have that kind of productized experience when it comes to installing OpenTelemetry and we have put zero effort into that, right? OpenTelemetry takes the opposite approach of like it's a box of Legos because we want to make sure that we're not locking you in to our implementation, right? Like if you want to use OpenTelemetry tracing, but you still want to use Prometheus for metrics and your old application logging system, it's very straightforward to mix and match. But the side effect of that is like it's a box of Legos and you have to know how to mix and match all these things. So we're always interested in improving the docs, improving the installation guides, providing things like the OpenTelemetry demo application so you can go look at something that's been instrumented and get insights there, all of this stuff. But really what we'd like to do is to have you do nothing at all, just like installing any software, be like step one, install, step two, it works. Step three, you use it step four or five or six, maybe you start learning the details as you want to start to mess around with the knobs and stuff. All right, so can we do that? The answer is absolutely yes, we can do that and it involves really getting into the weeds with a bunch of packaging tools that we haven't focused on. We've mostly been focused on language library level packaging. So you have a bunch of libraries that we want to install instrumentation for, so we have a bunch of instrumentation packages in Python or Go or Java. And then there's something in that language that's specific to that language that knows
how to scan what you have in your app and find the instrumentation packages that match what you have and then download them and install them. We do have some kind of automated monkey patching thing in many languages. But it's different for every language. Some languages like Java, it's normal to do that. In other languages like Python, it's kind of weird, right? Like you can do it in Python and Ruby and JavaScript, but it's kind of weird. It's not a normal thing. So even explaining to people how to use the weird OTel bootloader in Python is like still some weird new thing they have to learn. So we would like to use a low level tool called LD preload, which is like a Linux bootstrapping tool to actually get in there and automatically install all this stuff for you. So that takes it from a language specific installation process to elevating it to more of a traditional kind of sys admin Linux packet management experience. You know where you're just like, you know, brew, you know, keg tap OTel and all the OTel stuff just shows up and it scans your system and sees everything running and just binds to all the things and then you start managing that whole set of stuff it installed using OpAMP or control plane protocol. So that's actually like a new set of software, something called the OTel injector. We're also looking at OB, which is our eBPF solution as another thing for getting red metrics and network profiling out of everything. And then we're looking at the operator for Kubernetes and host metric stuff. And just trying to bundle all of that up into a kind of OTel and so I would love a name for this thing, you know, but I don't want to call it the OTel agent because agent is just utterly horribly overused word, right? But it's like the OTel installer, right? It's like the package installer that would let you as the IT operator at a big organization. You would have the ability to go in at a low level and just run this thing everywhere. And then OTel would just get booted everywhere and no one have to lift a finger.
**Mat Ryer:** Yeah, I think what you can call like a virus, right?
**Ted Young:** Yeah, yeah, you want it to have this like viral viral.
**Mat Ryer:** We use crypto miners currently, currently we have like a dark OTel has a dark wig contract to get it as a ride along.
**Ted Young:** A Trojan OTel.
**Mat Ryer:** Yeah, so it's yes, mostly phishing attacks is how we expand our reach. It's one OTel's like, I need you to buy gift cards for me.
**Ted Young:** Yeah. So many OTel gift cards, but it's one touch, probably not zero. If it's zero touch, that's like your Apple put in you to do that.
**Mat Ryer:** See, that's where the AI comes in. It knows it knows before you know.
**Ted Young:** Speaking of knowing stuff, what can people, where can people go to learn more? Because you did an ObsCon talk, but that's on YouTube. So people can check that out. How do they find that Google?
**Mat Ryer:** I mean search for it.
**Ted Young:** Yes. Yeah, well, yeah, I did a talk at GrafanaCon. ObsCon is coming up and I'll have a talk at ObsCon that's, I believe it's called Deploying OpenTelemetry with Grafana. And that will actually be kind of a deep dive into this installer. And this kind of world of like, here's, here's what we were calling like a Maslow's hierarchy of need, but it's like Ted and Ed's hierarchy of observability needs and how OpenTelemetry is going to try to improve on the value there in terms of automatically installing the most important, widely available stuff first. And kind of like what a roadmap looks like for tackling this installation experience.
**Mat Ryer:** That's great. And I was speaking in past as a, the event had already happened because when this goes out, it's already happened.
**Ted Young:** Oh, right.
**Mat Ryer:** I don't want to sound like Doc Brown, you know, but you're not thinking for the mentioning to anyone.
**Ted Young:** Yeah, ObsCon, yes, because we're in the future now. And that's already happened.
**Mat Ryer:** Yes. Yeah. And you're going to be positive too, right?
**Ted Young:** Yes. So this will also be out by the time this is released, but we will be throwing an unconference called OpenTelemetry Unplugged or OTel Unplugged as part of FOSDEM Fringe. So that'll be in Brussels in February.
**Mat Ryer:** What are you going to do? Just not have the screen. Just not have anything plugged into the projectors.
**Ted Young:** Nothing plugged in. Yeah. We just go around. We unplug. We unscrew all the lights. And then we just sit in the dark and the silence and think about what you've done.
**Mat Ryer:** Breathe.
**Ted Young:** Yeah. Meaningfully.
**Mat Ryer:** Well, that's what our listeners can now do because actually that is all the time we have. I assume once the Big Tent podcast is finished, that's what they do anyway.
**Ted Young:** Well, usually they just need a moment of silence after, after yeah, I do have to hang it out with you. No.
**Mat Ryer:** Oh.
**Ted Young:** Oh, it's okay. It comes across as mean when you're mean to me for some reason it's all jokes and fine. When do I do? When I mean to you.
**Mat Ryer:** Do they say something mean?
**Ted Young:** No. We never do.
**Mat Ryer:** Too nice. I think. Anyway, we haven't got time for that. So thank you very much. Ted Young. Thanks so much for joining us.
**Ted Young:** This was great.
**Mat Ryer:** Absolutely.
**Ted Young:** Yeah. Good times. Yeah.
**Mat Ryer:** \[56:33\] And we'll see you next time on Grafana's Big Tent.