
Ep. #6, From Big Data to Curiosity-Driven Insight with Roger Magoulas
On episode 6 of Data Renegades, CL Kao and Dori Wilson speak with Roger Magoulas about the real bottlenecks holding data organizations back. From the origins of “big data” to today’s explosion of tools and pipelines, the conversation focuses on why understanding, semantics, and communication matter more than ever. The episode is a call to shift from constant firefighting toward curiosity-driven insight.
Roger Magoulas is a longtime analyst and thought leader in the data and analytics space. He played a key role in popularizing the concept of big data through his work at O’Reilly Media and the Strata Conference. Roger is known for bridging technical systems, business context, and storytelling to help organizations make better decisions with data.
transcript
Dori Wilson: Hi, I'm Dori Wilson. Welcome to another episode of Data Renegades. I'm the head of Data and Growth at Recce.
CL Kao: And I'm CL, CEO of Recce, your co-host. Today our guest is Roger Magoulas, a true pioneer who's been shaping how the world thinks about data for over two decades. Roger helped popularize the term "big data"back in the 2000s while running the O'Reilly Strata Conference as the co-chair, which brought together data professionals to tackle the biggest challenges in data and AI.
He later volunteered to do the analytics for Covid in California as well. So we'll definitely dive into it. Hello, Roger. Welcome to the podcast.
Roger Magoulas: Hi. Good to see you.
CL: Yeah. Can you take us back to the beginning? What problem first pulled you into the data space?
Roger: Sure. I'll just start with, I really love knowing trivial adjacent obscure, telling facts. So I've always been curious about that. And I started off as a C/SQL programmer, then I got an MBA and I wasn't the best programmer, I wasn't the best MBA, but the combination of the two I found really compelling.
And I was lucky enough to get on a data warehouse project in the 90s that opened my eyes to just enough coding, just enough thinking about business, that it was a great combination for my interests. And it was these projects. But as I got into them, I mean, I really liked it and seemed to have an affinity for it.
The nice thing is, when I joined Sybase in the mid-90s, I was brought in to build a data team. And I was given free reign. So I was able to learn a lot from doing that. So thanks to Mike Laudato, who was my boss at the time, because I was able to learn on the job and learn a lot.
Dori: Yeah. How did the MBA you think help you? Because I will tell you a lot of my friends in our late 20s, early 30s, there's always the question of should I go get an MBA?
Roger: Yeah, I think it depends on your personal preferences on how to do things. I like to know everything, like a generalist thing.
And what I found is what the MBA did is it gave me this kind of broad perspective. I know about finance, a little about accounting, a little about marketing, a little about innovation, a little about entrepreneurship. And so by getting a little more background on all those things, it gave me like not really a toolkit per se, but to think about things from a bigger picture.
And I have a feeling that's going to come up more during this conversation that having that broad perspective is good. You know, in some circles, particularly the very intensely math people, having an MBA might be a bit of an embarrassment. But I'm really glad I had it because it did broaden my perspective. And I did have Janet Yellen for Econ.
Dori: That's incredible. For the non-economics nerds out there, she was the Head of the Federal Reserve.
Roger: And Head of Treasury.
Dori: And Head of Treasury, yeah. Just an absolute badass. I have a Master's in economics, so that's why I was like, oh yeah. Okay, so when you're building out the data team and you said you're given free reign, what were you most excited about doing?
Roger: Well, what was interesting about it is when I got started there, there's a whole long story behind it, but I had casually said, I think I could build a data warehouse in 90 days. At that time that was, these were like year long projects and stuff.
And my boss said, okay, go to it. And at the time Perl was what we used. So Perl and, of course, Sybase, because we were working at Sybase. And we did it in way less than that. And it was so much fun. I just had one other person who was in on the project and Kathleen Edholm Glassy, who worked with Ralph Kimball, happened to stop by Sybase one day.
I caught her in the parking lot and she explained dimensional modeling to me. So we started doing star schema right from the start. And it was this really, like I said, one, we were able to accomplish a lot in a short time. And we had to have this kind of big picture as well as knowing the little picture of self of how to code and stuff.
And when it all came together and we started learning things about the company that no one had learned before, I found that pretty exciting to do. And I think I'm a fairly good communicator. And then I was given the opportunity to communicate that throughout the company. And so I think we became a valuable resource as well.
CL: And it's so interesting, like doing data in a database company. But at that time it's like a general technology, but even the modeling is on top of that. Right?
Roger: Well, we also ran into the kind of problems that, of course Sybase was mostly a transactional system and we wanted to tune our thing for the data warehouse thing. And sometimes we'd ask for things. "Well, we want to allocate a lot of memory here."
And they say, oh, you want like 100 megabytes instead of 10? And we're like, no, no, we want gigabytes. And keep in mind that that was a lot back then.
CL: Yeah, yeah. And then, so what was the spark that led you to launch ideas and conferences around big data? And then I know you had conversation with DJ Patil about making data science, something also popularized at that time, around the same time. Right?
Roger: Right. So what happened is a company that did job posting asked O'Reilly, where I was working, if I could be their analytics basically for them, build their infrastructure, and then I would have access to this job post data, which would be pretty interesting for a publisher of computer books, that we would then know what people were hiring for.
And it was 2 terabytes of data, which at the time was pretty big. It doesn't sound like it now, but back then it was. And I was trying to do it on MySQL and it didn't work. And someone put us together with Greenplum, which was a distributed database, and it worked. We had some queries that couldn't finish and suddenly they were getting done in six minutes.
And that made a huge difference. And part of this was job posts are pretty unstructured. And so at the time, databases weren't really designed for indexing on a job post that had 2,000 characters in it. So it was a challenge generally. But a distributed database made a huge difference.
So I started doing talks and calling it Big Data. You know, the other thing that was going on was Hadoop had started making the circles. And we were lucky enough at O'Reilly that we knew about this. We started going to meetups and stuff. And Ben Lorica, who a lot of you probably know about, he and I were the data team.
And we started saying there's something going on here that's a little different than how data warehouse felt in the kind of 90s, Ralph Kimball-oriented days, in that it was really large behavioral data sets, data that's a little more unstructured, so more natural language work that needed to be done. We packaged it together and called it Big Data.
So as far as I know, the first talk I gave was in Belgium in 2006 and we published a paper in early 2009. And that's where we met DJ, because we thought what LinkedIn was doing was emblematic of what we were seeing elsewhere. And he also had visualization people, he had natural language people, he had the people who did Baltimore, which was like fast write data stores.
And we thought that they had the complete picture, so we used them as a bit of a template. And of course, soon afterwards I suggested we have a conference on data that was turned into Strata. And that I think snowballed into, the tribe finally got together and it became a big subject.
And it also helped that Hal Varian, who was a good friend of O'Reilly, said data science was the sexiest job of the new millennium and brought a lot of attention to what was going on there.
Dori: Absolutely.
CL: Yeah. So I think almost 20 years passed, right? It's just so interesting that it feels like a long time ago, but also not long ago. I think we ran to each other at the small data conference. Right? So the Big Data at that time is probably small data now. So I'm curious what your takes are on what is still the hardest part of data engineering that nobody's really talking about?
Roger: Yeah.
I think the hardest part is the overwhelming amount of things that are going on, that it's gotten, in some ways, too complex and there's not enough room to think about what you're doing.
What I mean by that is a lot of data engineers, if they're doing DAGs or whatever, they know what they're doing, they don't know why they're doing it. They're so focused and so stressed by the amount of complexity, the number of feeds, number of pipelines keeping everything going, that they're not able to be more on the insight data knowledge side of what's going on.
So I think part of it is just there's a lot of tools. I mean, I'm sure we'll talk about this more too. I mean I'm looking at a bunch now that do different little pieces and I think that it ends up feeling pretty overwhelming.
When I built that first data warehouse at Sybase, I commissioned two of my programmers to build basically orchestration for what we were doing. It was this big Perl program and it worked great. Now, we had like 12 feeds. So I think the difference is that when you started getting into like when Airbnb did Airflow, they probably had hundreds or thousands of feeds going on.
It's that level of complexity that meant you couldn't internalize or couldn't imagine the whole thing at once. So you needed like abstractions and you needed to maybe give up on knowing everything and more reactive and less proactive.
And you know, I think generally it'd be great to address that some because I think it diminishes the value of data teams when all they are is reacting and they're not providing the real value, which is what is the insights in the data.
Dori: Yeah, it's the "so what" problem. Do you think, because my background came from it, from like economics and then got into like data science and Uber. So I was always coming from like that qualitative, how do we make something that's like qualitative quantitative mindset and then kind of self taught analytics engineering. Because at startups, you don't have time to wait for somebody to build a pipeline. You just need to build it.
Do you think part of, to solve this thing of okay, you're just building but you're so behind, you're so underwater that you kind of lose sight of the forest for the trees. Do you think that's a problem of team scale? I mean you mentioned just the complexity of data. How are you thinking about kind of solving that problem? What do you think would work?
Roger: So I think the main way to resolve that problem is one, it's going to take a little more resource which no one wants to--
Dori: They never want to give it.We're cost centers.
Roger: Right. But having a dual focus, one is there needs to be time and exploration. So EDA, exploratory data analysis, should be part of everyone's remit. Right? It's that knowing the data matters.
And two, instead of becoming the only place that's doing the data and this gets into maybe more democratizing data mesh kind of things, the data team should be more enabling. In other words, what the data team is there to do is to help other people become more adept at looking at their own topic area.
And I think it's unreasonable to expect that every team in a large organization is going to have the kind of data people who can build a sophisticated pipeline.
But if they can get close because they're enabled and that people are able to do the kind of exploratory work that I actually think some of the AI things and so forth are making a little easier to do, that you can get to a place where you've got a better balance in that you're able to then get more out of what the reason you're building it.
And I tell you one of the things, and I think this is just human nature and I'm going to use the word lazy and I don't mean it in a pejorative sense. I think that generally, you know, humans, I mean the biggest thing we spend money on is cognitive thinking, is that you want to limit that resource consumption of thinking about things.
So it's very hard to measure the real impact of a data team. So how do they do it? How many dashboards do we build? How many pipelines do we have? Instead of a more qualitative what have we figured out? How are we bringing provocative questions that then lead to further thinking about the business, how are we impacting things?
Not in so much a strict ROI way but in having a, ah, better context of what is going on in the business. This is one of the reasons why I think, and this is self-serving, that generalists are not a bad thing to have on a data team.
So yes, it's good to have math, it's good to have people who are really strong on the data engineering stuff and you'll need some of that. But I also think it's good to have people with a broad perspective that maybe don't know all the stats and so forth, but can contextualize things into the words, the semantics of the business so that they're able to process what they're getting from the data team.
Dori: I want to double click in here for a little bit of like you know, data teams proving value. So as we're thinking about these qualitative questions and showing, okay, I found this spicy take or you know, look at how our product's being used maybe in a different way. How do you think about structuring those arguments or, the economics person in me, how do you make that quantitative for people? Are there certain orgs you should prioritize or people you should look for in a company to work with?
Roger: Yeah. One, it does take some skill to turn things quantitative. So I think what you're going to end up having to do is storytelling, that it's really the story that's going to resonate about what you're doing.
And if you can get enough stories out, your value becomes around those stories of the stories that the company uses to figure out what's going on. One of the things that happened to me, and this is at O'Reilly, I had a feed break and I didn't notice that it had happened.
And Tim O'Reilly says, Roger, what's going on? This data doesn't seem up to date. And I'm like, oh, shit. You know, I thought something broke. I went in and fixed it. It was a weekly feed. The next week I checked that everything was okay. I sent him an email, I called "Synced." I said, everything is synced up and it's all up to date.
Then I started doing that every week and started adding insights. It was a real newsletter. Soon it was going out to the whole company. Every Sunday morning, I did the new version of it. I had insights about the whole book market in general, about the book market for O'Reilly, things that were going on that would be hard without a narrative structure to even get at a dashboard.
And that I think that one of the ways that you add value is something that looks like Substack. In other words, you're on a regular basis saying, here we are, we're looking at data. Here's what we found. Here's the questions you need to ask yourself about what we found. Why is this happening?
So having a little bit of, I used to call it supply-side analytics, that people had a little time to look at their own thing, not what clients or whatever are asking for. You're in this data, you probably have questions about it. And hopefully you've hired curious people. Right? That's a really important part of doing this work.
And what are they finding? What's really weird that's going on? And I think when you do that, and it doesn't have to be weekly, it could probably be monthly. And I've done it monthly at another company. But that's a way to establish a relationship.
And I know it sounds like old hat, but it's kind of like the reason people used to buy business magazines. You get this kind of contextualized statement of what's going on that provided advice and got you thinking and that the way that the data team can establish some value is that that's what they're doing.
And yes, you're going to have a lot of charts and there's going to be numbers and so forth, that's in there. So it is quantitative. What it isn't is you're not only measuring it on just what is the exact impact on something with the business because quite often that's hard to figure out whether that's happening or not.
Like, you know when you do an A/B test, well you pick the best one and now you've got better results. But unless you ran a huge kind of trial, you're never really going to know what the exact impact was.
Dori: Yeah, it's our value being shown in the absence of, or, I've thought about a lot of the work as like we're amplifiers, you know, we just make everyone better and choices better. But then it's how, like okay, well this team did better like how much of that is from us?
CL: Wow, there's a lot to unpack here. And I think you're describing a utopian state for a data team where they're proactively able to deliver insight in the newsletter. But something did it. And then I think a lot of time we hear teams being very reactive, like supporting demands from other parts of the org and all that.
So how does one team shift from more reactive, like responding to demands, to a more , like you say, supply-side analysis. if you will?
Roger: Ask nicely and say please? Trying to at least get something, like the newsletter idea can be a cudgel for trying to get you in that position. So if you can like peel off a little bit of resource to regularly communicate to the broader organization what's going on, I think you might be able to get then more resource to make that more widespread.
Look at what we're doing with this marginal effort and let's make it bigger. I think the other thing that's going to go on, and I mentioned the complexity, is that I think some of the busy work that data engineers do can be nicely augmented with very focused use of LLMs. And in particular, and I know this might be getting a little off the subject, but this is a particular thing that I think is interesting.
The problem with LLMs and data is they're non-deterministic unless the LLM is generating code which is deterministic.
And I think for data teams, if they can prompt in a way that says, write me a program or write me some SQL that monitors my pipelines and tells me when a feed is more than one standard deviation away from its mean and give me a warning. And if it's two deviations, stop the feed and give me an alert that can make a big difference.
And because it's deterministic, you're now in a position where you can rely on it, where what you really don't want is, you know, person A asks the same question as person B, they hit a different flank of whatever LLM they're buying and they get different answers. It's like, okay, now we've got to spend time rationalizing which answer is right rather than focusing on what the answer is and what that implies or tells you about what's going on.
Dori: So kind of taking a step back and having it, Instead of outputting the answer, output the way to get the answer and then using that.
Roger: Yes. And I think, particularly in the context of data engineering with that complexity, like this is a way to help manage some of that complexity, is that you've got things that are programmatically, you know, monitoring, but have that extra productivity of having a generative thing help you write the program, anticipate things, maybe write test scripts and stuff like that, that helps things go.
And I've had people tell me there's gonna be no engineers next year, no data engineers. I also had someone tell me there's gonna be no programmers in the 80s because of that first wave of AI. Prolog and Smalltalk were gonna replace everything. Clearly that didn't happen, and I don't think it's gonna happen now.
But do I think it's a productivity tool? Absolutely. Do I use it all the time? Absolutely. Sometimes I'm in Python land, doing something, and it's like, I can look this up or Cursor can tell me and, you know, I maybe don't need to know this internally, I just want it to work. And luckily I know enough Python that I can check whether things are going okay or not.
Dori: Yeah. Making sure the code's efficient. I've seen it write some weird SQL.
Roger: Yeah. And with the SQL stuff, you know, this is a little bit of a digression, but MySQL is super formatted. It looks like Python. Everything is columnized. It's really regular looking. And why do I do that? So that the next person coming in will have a much better chance of figuring out what's going on.
It's something that is, not that it's sloppily put together, but it's kind of all over the place and there are no subqueries. Everything is a CTE. Because I think that that kind of modular programming is much easier to maintain than having to like be like a Lisp compiler and going inside the inner parentheses and working out it from there.
Dori: Yeah. And no select stars, I bet anywhere in your code.
Roger: No, there are no select stars. The one thing though, and I love DuckDB since we're at the small data conference, "group by all." N ow BitQuery has it and I think there's some others that have it too, because you're never not in a data analytics scenario, not grouping by everything that isn't having a summarization function.
CL: Yeah, it's just so fascinating to see that the SQL language is still evolving with all this extension and all the add-ons into almost standardizing.
Dori: Yeah.
CL: So Roger, early on you touched on delivering some broken reports for O'Reilly. So besides that, what's the most painful bug or failure you've seen in production caused by data?
Roger: Yeah, you know, I was thinking about that question before we came up and there were so many times I failed at something and got a great lesson from it and found that that was like a really valuable thing. So when I think about the specifics, one, I thought about that that particular case was one of the good ones.
I also think there's been times when I haven't been as expansive as I could have been in thinking. And once I figure that out, I immediately pivot to, oh, yeah, we need to bring more in. So it's almost like I wish I had brought in some outside data or another feed or whatever.
So we were starting to put Strata in Europe and we're trying to figure out where in Europe to put it and stuff like that. And I was like you know, what are we going to do? How are we going to figure this out? And I think it was Ben who came up with: Let's create Meetup. And we did and we found out that like, maybe not a surprise, like Tel Aviv is a very big place for analytics people.
And you know, per population, Stockholm was also very big. So we found that there were some cities where there was just clearly more going on, at least as reflected as what Meetup was. And you know, they say the failure was that, "why didn't I think of that?"You know, what were the sensory organs that we could have used to get a grip on what might be worthwhile?
And is Meetup the best source? No. I mean some cultures don't like to meet in person. Just in the US, when you look at Meetups, a city like Kansas City is very tech heavy, right? There's Mastercard, there's Cerner, there's Google Fiber, there's a Kauffman Center for Entrepreneurship, there's Sprint. Right? And no one goes to Meetups.
But part of that is it's just not the cultural norm there. It's more established companies, more kind of enterprise companies that are, I know it sounds like a cliche. They're not getting that recent MIT or CMU or Berkeley grad who's working on the latest reinforcement learning algorithm. They're getting people who are good at keeping things going in a more risk averse manner because these are big established companies that don't want to screw up things.
And I think this is, that means that it's an older programming crowd and probably more family oriented. You know, free pizza isn't enough to leave the house. And this is what I mean about getting the kind of insights of things. If you just deliver a number and you say "hardly anyone goes to meetups in Kansas City" and you're trying to figure something out about that, you've only gotten a third of the job done or a quarter of the job done.
You've also got to figure out why, and then what you can do to maybe work through that. Maybe do something where you've got a roving in the day thing that provides some value in a way that attracts people. So it's like a meetup that's not in the traditional after work way, but it's during the day. But because you're providing some valuable content, there will be a draw in there.
So it's one of the things I say to all the data teams I work with, the best analytics asks more questions than it answers. And that the George Box thing about all models are inaccurate, some are useful, is moving towards that usefulness because you're continuing to drill in and figure something else out. Why is this happening? And being able to be fine with throwing out an assumption you had before because you've discovered something new that's really valuable.
I thought of another failure. We had a conference for JavaScript at O'Reilly and I used to go to these conferences because we did analytics on it and stuff. And we had one in the spring and one in the fall, and the one in the fall, we hardly got any big retailers at. And I said you know, why? And I'd go to the conferences, I'd sit at tables with people, and I sat with someone from Target, and I said, maybe you can explain to me why you know so few people. He said in anticipation of Black Friday, every big retailer locks down all their employees.
No one can go to an event because they're preparing for their biggest traffic month of the thing. What analytics would have figured that out? I don't think anything. I mean, you would have to know about internal policies at the people building the front ends of retailers. It's just not something that's very obvious. So through a qualitative mean. So we had a quantitative thing, which is what's going on, a qualitative source, and then we figured out some ways to work around that.
CL: Yeah, I love it when you say that the best analysts ask more questions than they answer. This is actually how I use LLMs these days. Ask me a lot of questions, why I'm asking you to do this, so that you have better context to actually solve the thing I want you to solve.
Roger: Yeah.
Dori: Yeah. A super curiosity-first mindset.
Roger: Yeah. One of the things that I think about with that curiosity thing is there's just some people that are just more naturally curious. And, you know, I've put myself in that bucket. But there's a book called Superforecasting, and it's by a guy named Philip Tetlock, and he's really a political scientist, but in the realm he was working in, he found that there were some traits of these people who are really good at forecasting political economy kind of things.
And one of the things that they were good at is changing their mind, not having what's called anchoring bias, which is you've got something, you think it's the thing and you keep staying on it too long. And that you have an inability to go, "wait, this new evidence is really strong. I really need to change."
And you see this culturally, even in places like the sciences, where it shouldn't happen given the real, if you're really doing the scientific method. And having that low confidence, it ends up that the best forecasters, at least according to Tetlock, have the lowest confidence in their answers. And the worst forecasters are ideologues, who had an answer from the start that they're just using data to support. Now, clearly, in a company context, that's a little different than in doing political economy stuff.
But I think that there's a lot to draw from that in terms of curiosity. And what that means is it's got to be coupled with a lot of humility. Saying, "I'm wrong. I got this wrong. I've learned something by going through this process." And it's that kind of loop, that iteration, that I think turns you into overall a better analyst and better at making sense of what's going on.
You know, I mentioned the Black Friday thing. Well, that's not as true as it was. Black Friday has diminished a little bit in terms of being a big pull. And it's just things change. Right? And so you've got an insight, you can make some sense of it, but you got to keep testing it and seeing, is this still true? And then working from there.
Dori: Yeah. Celebrating the learning in the iteration, not just the answer.
Roger: Yeah. Right.
CL: So I feel we touched a lot on what a good data team looks like, right? But what's the best advice you would give to someone just building out their first data team?
Roger: Yeah, so lots of EDA. Right? Lots of Exploratory Data Analytics, that you document those findings. That's just as important as a SQL you write, is the little weird things that you find. Focusing on semantics, data models and metrics. And what I mean by metrics is that what are the meanings in the company of what makes sense, what is profit, what is ARR?
Getting a definition of that stuff early and having it understood across the company means that you get to spend more time on insight generation than on someone has this variant or that variant, and then you're arguing about which variant works.
Communication matters as much as math when you're building a data team.
So knowing how to communicate, how to visualize, how to create narrative so that things will stick is really important. And what I think that means is you want a diverse crew.
Do you need some heavy duty data science people who know how to build very sophisticated models? Yes. But you also need people who write well, you need people who know about the business. You really want to get this quilt of specialties so that you're getting the biggest and most wide perspective on how to make sense of what you're doing.
I was just reading today an article about Giorgia Lupi, who's a pretty famous data visualization woman has done this application for the subway, giving you all these random facts about the subway. What's the fastest subway line, what's the longest subway line, what's the shortest subway line, all this kind of stuff.
So here's a person whose real focus is on visualization, yet there's so much that you can get from that. And now you can look at that and you say, I lived in New York all this time, never knew all this stuff, and now you've got all this information that I think helps you make sense of your world.
Do you need to have sense of the subway? Maybe not. But imagine that brought into a company where you're kind of bringing in and you're able to visualize and tell a story about some of the things you're doing that help align people around strategy or what's working and what's not working and stuff like that. And I think that it takes that kind of generalist you know, the book Range by David Epstein is a good book for people to look at in terms of why you might want to get bigger.
So I talked about the team. I've already mentioned humility, being open to change, learning from mistakes, all that is particularly important. I think architecting for a little of that supply-side analytics is also something I would say and I know this is the hardest thing of all--
Most companies, when they deal with software, they think about projects. Analysis is not a project, it never ends. And the more that you can finesse that that's the case, I think then you're just going to operate in a way that both the client organizations and your organization understand better.
So you're not like at loggerheads around things. Yes, there are project things like building these new pipelines or adding a new dashboard or whatever. But generally the work of the data team is the sensory organ. There's this German word, umwelt, which is spelled with a W, but it's pronounced with a V, which describes the environmental sensing organisms of an animal or a creature. Is that the data team is a umwelt for the company.
They're the ones who are processing the sensors, which is typically, you know, whether it's sales data, funnel data, operational data, whatever, and they're trying to make sense of it. And that never ends. Yep. And so you want to like, I don't even know if cadence is the right word, but like that there is some cadence so that people know that you're there. There's a heartbeat that they can detect.
But that you're in this always exploring, always trying to figure out more, reacting, learning new stuff and that having that mindset really helps with the data team. The last thing I'll say I said this earlier, is think of them as enablers and not an appendix to the organization, that the data team is there to help other people do their work. And that means that--This is hard sometimes for people who are really into data sciencey stuff, don't tell them how you did it, tell them the result.
Dori: Mhm, yep.
Roger: Right? I mean sometimes it's hard. You did something really cool and clever and you apply this really great thing and you're really happy and your other data team members will be like wow. But no one else cares. What they do care about is did you find something out that was really valuable.
CL: I love the framing about data team being the sensory organ because if you follow that naturally you will want to architect for what we are calling as a supply-side capability and then everything just follows through.
Dori: Yeah. I was just thinking when you're talking about, yeah, they don't want to know the "how" of--
I had some really luckier, in my career, I had some managers that were like, "You're no longer in economics. They don't care about the model, they don't care about the standard deviation. Cut this. What are the three numbers you're showing them? Maybe one chart."
Roger: That is a rule of thumb that I used to tell all my analysts and stuff. When you're presenting, you can present four things plus or minus three. There's a reason phone numbers are seven digits. Right? And without a narrative even four things are hard to remember.
Overwhelming people is not the way to provide value that will stick. The way to have value stick is to not have too many things and have a narrative, because human culture develops with storytelling and storytelling is really critical somehow to the way our brains are wired around remembering stuff.
Dori: Mhm. I think opinionated as well. It's not enough, you know, it's part of storytelling. But I think sometimes you can build a narrative but it doesn't have like "and now what" at the end. And you gotta have that "what" I think.
Roger: Yeah. And as far as being opinionated, you know I mentioned the super forecasters, which is that you're open and you know the joke about, since you were an economist, you know, like one president I think said I wish I had a one armed economist because they're always telling me , "Well, this might be true, but on the other hand..."
Dori: Yep.
Roger: But being able to say something like, "this is what I think you need to think about." Like that's the kind of opinions that I think work. Given what I found this is what works. And give me feedback on whether that is something that's helpful. Right? So that I'm getting reinforced or the data team is getting reinforced in what worked and what didn't. Because you can think something is really valuable but if no one's processing it then it isn't.
Dori: Yep. A number is only as useful as it gets used.
Roger: Yep.
CL: Yeah. Oh wow. Roger, I feel we went down the Memory Lane for how we got here today with all the data industry and then I think you distilled a lot of wisdom from what a good data team should look like. But fast forward five years, what's going to be feeling laughably outdated about how we handle data today?
Roger: Okay. There is a subset of people who say SQL's dead.
CL: Every 10 years people say that.
Roger: Yeah, yeah. I think it's not going away. I think it will someday, but not recent times. I also think overly complex pipeline management is going to be looking outdated because something that an LLM is pretty good at is in that of monitoring, as I was describing, and building the kind of deterministic tools that help you work.
I also think that natural language analysis in the way that C-levels would like to see it is a long way coming. It's not going to happen. So like right now I think a lot of people are like why can't I just say, you know what's going on with sales? I think that's going to require a lot of work.
And the analogy I would use is the self driving cars. 10 years ago they were doing pretty well but it really took until last year to get where they're, I mean right now it's clearly safer to be in a Maymo than a lot of people driving and I think we'll see a similar pattern.
Clearly driving doesn't relate exactly to what data people do but that it's just going to take longer to get to a real natural language and not infallible but reliable. And trust is going to take a lot more data quality work, a lot more guard railing, a lot more context supplying and that that is going to take like almost like cultural learning kind of the way like when Hadoop first started up and people like there were ways people were using it that how to evolve and it took the whole community to figure out different ways that things should work.
And I think it's going to be like a community build. So I'd say that generally there's an expectation right now that things will happen quicker than they will and that it'll just be slower. So I don't know if that's really laughable, but I think someone, not just Gary Marcus, will say something like, remember when you said AGI was coming in 2026? Right? And he also said in 2025 and in 2024.
CL: So I think we talk about a lot about like how AI is going to reshape the role for data engineers and then with all the different components , now that AI is part of everything. But do you think the data engineering role, is there going to be less demand for that and a more generalist kind of full stack data people working on all sorts of understanding the data itself more as opposed to moving data around? What else do you see where AI is taking us?
Roger: Yeah, no, I think what you're describing is my expectation. And there's a whole Jerome Syndrome about when a new technology comes that gets used more than less. So I don't know if there'll be more or less data engineers per se, but I do think when you think about building a data pipeline, right, there's a source, you got to transform it and you need to load it. That's actually pretty easy to describe.
Why is the plumbing so difficult? Do I need to know Kubernetes as a person who works with data? I don't think so. Do I need to do helm charts, because of the complexity and the scaling and all that stuff? The more that that gets abstracted away, I think the more that data teams can then focus on the real work. And this is where I think AI stuff can help, but in an augmentation way, I don't think there's going to be like wholesale replacement, not in this area stuff it's going to be more around like augmenting how people explore their data.
Just think about something like a data catalog. Well, I think an LLM is pretty good. If you give them a sample of data and say what is this? And then something changes. You just run the LLM again, it generates a new data catalog. Right? That saves a lot of time and is valuable.
And it's something that while every data person knows that a data catalog is valuable and wish they had it, they hardly ever do and they're out of date and stuff like that. And I also think, given I mentioned that having a good semantic layer and metrics, that having those things defined and giving that as a context for the LLM to know those things just gets you that much closer to getting valuable augmentation to the work that the data engineers are doing.
Dori: Do you have a favorite semantic layer?
Roger: You know, I tend to DIY a lot of stuff like that. I found that the data I work with tends to be so, not obscure or not arcane, but like specialized. And so, if I'm getting data from Salesforce, well, it's not that interesting, you know, for one thing, not that it's not interesting to have sales data, it's not interesting from a getting more out of it than what's there.
But you know, I'm working on an IoT thing right now that, you know, it's fascinating, what could be there. And there isn't like a standard thing you can do. You've got to like think about it you know, because it's sensors and stuff like that.
I do think that things like, you know, I have to be very friendly with the people at Real Data and I think the way they have their metrics and stuff is really good. I think the dltHub is another way where you're able to define things in a way that I think is pretty straightforward and I think is easier for a data team's analytics side to get things done than it would be otherwise.
You know, I think DuckDB as kind of a mid-tier between your big distributed data store and how you're accessing the data is a really valuable thing. And then you can do a lot of semantics in there and make that worthwhile. I say the reason that I don't have a particular one is, you know, I suppose like something like Malloy, which you know, is basically what that language does is you're doing a CTE ahead of time, uh, that defines your metrics and how it relates to everything else. And I don't think you need Malloy to do that.
In other words, you can have your feeds and your data model support that kind of work. On the other hand, should you do, maybe you should do Malloy, if that's how you think you should go. I found that knowing Python and knowing SQL gives you a lot of power.
Dori: Yep.
Roger: Knowing something very specific gives you directed power, but not general power. And a lot of things have gone from the specific to generalized. So I know that this isn't very fair to R, but when R came out, I mean, it was really something. Right. And data frames are clearly the way a lot of people think about data. I have a feeling a lot more people are using Pandas than Rs these days.
Dori: Yeah.
Roger: Now you talk to a real statistician. Pandas is not nearly as good as what you can do in R. And they're absolutely right. But for most cases, it's good enough and it's got enough functionality to get things going. And so I think things that, where the general tool is the wrapper or the framing for what's going on, it ends up being the best way to get things done.
And that's why something like Julia hasn't gotten the kind of traction you might have expected, given how much analytics is going on. It hasn't happened. Is that it's nice to have a kind of a more general purpose thing that handles many workloads, not just that workload.
Dori: Yeah.
CL: Well, this is such a great conversation, Rodger. Thank you. And so before we dive into our lightning round, is there any something we should have asked you but didn't?
Roger: I thought you would have known to ask this.
One of the things I think is missing from people talking about the data space is philosophy.
I think there's strategy, but I think philosophy was a subtext. A lot of what I was saying, exploration. There was this French philosopher named Gilles Deleuze who had these kind of lessons for living that were around running a lot of experiments with your life. Some will succeed and some won't. But learning from it, not being too provincial, but getting out in the world and trying new things, like having a philosophy that helps provide that kind of guidance and alignment with how to do things when they're new or unanticipated.
Okay. How do I handle this? Humility, learning from iteration, those kind of things like that, to me, is more philosophical than strategic. And that I think that it makes sense to have a philosophy because the philosophy works as like a silent manager. I apply the philosophy to help me make the next decision. And I don't need to run everything up and down the flagpole to figure things out because I understand the philosophy of our organization or how to handle data.
So like I mentioned, George Box and "all models are inaccurate, but some are useful." There's a painter named Edgar Degas who had this great quote, "Painting looks easy when you don't know how, but it's difficult when you do."
I think a lot of data stuff seems kind of simple to the people who are going to use it, but when you get into it and you realize there's so much subtlety and things that you need to figure out and make sense of and are conflicting that it is difficult to get to.
So those kind of things, I think help.
CL: Right. Yeah. We keep coming back to like immunity and curiosity. I think you were referring to Deleuze who coined rhizome. Right? So it's like expanding underground to the horizon which is super important for data. All right, well, let's get into our data debug round. A couple of rapid fire questions. Are you ready?
Roger: Yeah, I'm ready.
CL: All right, first programming language you loved or hated.
Roger: Okay. I've had many crushes. SQL, C, Perl and Python. Sequentially. Fell in love with all of them and over time like them. I mean, I don't like C anymore. Not that I don't like it, I just don't use it. But yes, SQL and C were the first ones where I was like, wow, this is really powerful.
CL: Cool. And then your go to data set for use for testing.
Roger: Census data. People underuse it. Unfortunately, it's gotten a little screwed up recently. But on the 11th, the newest release is coming out. I'm looking forward to grabbing the newest release. You know, you get great demographics. It's wonderful.
CL: Cool. One lesson from outside of tech that influenced how you build.
Roger: The book Unreasonable Hospitality. It's about a restaurant. 11 Madison park is a very upscale restaurant and he's got all these lessons around how to treat, in that case its patrons, really, really well. And yet managing that to a budget. And I think that there's a lot to learn from treating people in ways--
Everyone says, oh, we're going to delight our customers, but really taking that extra step to delight in a way that seems unreasonable how much you're doing. But I think that you create a legend and there's a couple products like Rivendell Bikes or Black Hawk Pencils.
And so I know this is a lightning round. I'm going into all this detail, but I think there's some places that do like great communications and are just great in the community and stuff and it ends up that that makes them special.
Dori: Yeah.
CL: Amazing. All right, what's one hot take about data you're willing to defend on the podcast?
Roger: It's one I've already said. The AI stuff's going to take a lot longer than we expect right now and it's going to take a lot of work.
CL: Okay. Favorite podcast or book that's not about data.
Roger: A History of Rock Music in 500 songs. It's amazing. It's this British guy, he starts in the 50s, some of them are four hours long, and he goes the history of everything. So just. You might not know the song, but there's a Beach Boy song called Good Vibrations. He starts off in pre -Homeric Greece, makes a stop in Russian electronics research, then Rhapsody in Blue, and then finally gets to the Beach Boys Good Vibrations.
And he's super thorough. And it's about rock and roll. And almost every episode has a trigger warning. "If you're concerned about misogyny and drug abuse and all these terrible things, you might not want to listen to this episode." It was because the people at Rock and Roll were such terrible people.
Dori: Yep.
Roger: But it's a wonderful podcast. And, like, he's up to the 70s now, and it's been like three or four years. It's going to take him another bunch of years. But I actually heard Terry Gross on Fresh Air reference this podcast, so I think it's wonderful.
Dori: Yeah. This is an extra question just because we talked about philosophy. Do you have a favorite philosopher?
Roger: yeah. I would say as Deleuze, I think in terms of me applying what I know, I'd say the other one is David Hume and the whole Scottish explosion of great thinking in the mid 18th century is one. It's just phenomenal that it happened. But I think you can dial a lot of things back to then, that kind of rationalized thinking, respect for humans and stuff.
But I think with Deleuze, there's, like, good lessons for living that go beyond stuff. So if you know the book Tiny Experiments, which came out recently, it was basically Deleuze put into a book about how to not overreach too often and then be disappointed in what you can accomplish and whatnot.
I don't read philosophy all the time. I've got a good friend who is very deep in philosophy. So I'll just do a shout out for Steve Swoyer. He's an amazing resource for that kind of stuff. I am not like that at all, so I'm very dilettante about it. But I did like Deleuze.
Dori: Awesome. Well, I think that's it for our data debug round.
CL: Final question. Where can listeners find you?
Roger: Best thing to do is to text me on LinkedIn and then I'll give you my contact information and go from there. I'm developing what I should do, and I haven't made a decision yet on the best way to implement that.
Dori: Let me suggest back to you, from your own advice, a Substack.
Roger: Yeah, no, I definitely have thought about that. One of the things I thought about, inspired by a History of Rock Music in 500 songs, I started listing them--my life is in 17 databases.
CL: Wow.
Roger: I don't think I could do four hours on any one of them. But I have gone, you know, at different times, I was using different ones of these. So I've thought about that. We'll see.
Dori: Yeah. Thank you so much for coming on, Roger. This has been a fantastic conversation. Really appreciate it.
Roger: Thank you both. These were great questions.
CL: Thank you so much, Roger.
Roger: Take care.
Content from the Library
Data Renegades Ep. #5, The Identity Crisis of BI with Benn Stancil
On episode 5 of Data Renegades, CL Kao and Dori Wilson sit down with Benn Stancil to explore how data tools evolve, and sometimes...
The Kubelist Podcast Ep. #48, Unpacking Software Supply Chain Security with Justin Cappos
On episode 48 of The Kubelist Podcast, Marc Campbell and Benjie De Groot sit down with Justin Cappos, professor at NYU and a...
O11ycast Ep. #87, Augmented Coding Patterns with Lada Kesseler
On episode 87 of o11ycast, Ken Rimple and Jessica Kerr sit down with Lada Kesseler to explore how experienced engineers can work...
