1. Library
  2. Podcasts
  3. Data Renegades
  4. Ep. #3, Building Tools That Shape Data with Maxime Beauchemin
Data Renegades
52 MIN

Ep. #3, Building Tools That Shape Data with Maxime Beauchemin

light mode
about the episode

On episode 3 of Data Renegades, CL Kao and Dori Wilson sit down with Maxime Beauchemin. They explore the origins of Airflow and Superset, the evolution of open source in the data ecosystem, and how today’s tooling reshapes the role of the data practitioner. Max also shares a forward-looking perspective on agentic workflows and how AI is accelerating everything from BI to pipeline development.

Maxime Beauchemin is the creator of Apache Airflow and Apache Superset, two of the most influential open-source tools in modern data engineering and analytics. He is the founder and CEO of Preset, the commercial company behind Superset. With a career spanning Facebook, Airbnb, Lyft, and Ubisoft, Max is known for shaping the data engineering discipline and pushing forward the future of open source BI.

transcript

Dori Wilson: Hi, I'm Dori Wilson. Welcome to another episode of Data Renegades. I'm Head of Data and Growth at Recce.

CL Kao: I'm CL, CEO of Recce. Today our guest is Maxime Beauchemin, or simply Max. I'm super excited about our conversation. He's the creator of two very popular open source projects, Airflow and Superset, both turned into huge commercial successes.

His role has actually defined the data engineer role, which has now evolved to a broader spectrum of different roles, and bringing the functional programming concept to traditionally very stateful and ad hoc data workload. He truly is a pioneer in the data space.

I met Max again this year at the 2025 DBT Coalesce Conference in Vegas and we chatted about the future of the industry, agentic workflow, and a bunch of other things, and I knew I wanted to invite him on the show. I'm excited to speak more. Hello, Max.

Maxime "Max" Beauchemin: Hello, it's a pleasure to be here and to be officially now a data renegade. That's an interesting title, you know?

CL: Yeah, totally.

Dori: I think you always have been one. We're just kind of, you know, we're the official tag now.

Max: Great. Super excited. So can you take us back to the beginning? What problem first pulled you into the data space?

Max: Oh my god, that goes way back. I'm not even sure if I should start at the beginning there, but I think it's interesting, so I started my career at a company called Ubisoft, that's kind of in an interesting position right now, but they were really big in the early 2000 and I started Ubisoft Montreal, super fun company, young, fun, dynamic, and then within six months or a year they're like, "Do you want to go work in Paris?"

So I'm from Quebec originally, so I grew up in Quebec City, I speak French, so I was like, "Heck yeah, I want to go." I was like 20 years old, I kind of skipped college, I didn't really go to college or anything like that, so the internship turned into a, "Let's go to Paris for six months," which turned into three years.

And then at the time when I joined, they were starting to look into building a data warehouse, or just like analytics solution, so that was early on, but there was a fair amount of like tooling and precedents. So they had at the time, SQL Server analysis services or the SQL Server stack, and then I picked up the Ralph Kimball book on data warehousing, read through it, and it kind of made sense to me or I knew about entity relationship diagrams and how to do some data modeling and kind of learn that, you know, analytics, data modeling was different from building apps data modeling at the time.

And then built my first data warehouse with a very small team, basically me and some other good dude, Mustafa, so I'm going to say hello if you're listening, but Mustafa and I, I took on more of the OLAP cube side of the house, so SQL Server Analysis Services, but contributed to the warehouse. He built a lot of the warehouse there and then put some tools on top of it, some interesting tools at the time.

Something people don't realize is software goes through phases and every five, 10 years we kind of rewrite the same stack on a new premise.

And you know, there's been probably two premises, so that was the desktop era and there was some really cool stuff like in the desktop era, and OLAP was actually a pretty cool, functional thing. Like language is like MDX, that's like multidimensional SQL, was pretty evolved at the time, and modeling with cubes was kind of a thing.

I also picked up, for the financial reporting, they had bought Hyperion Essbace, which is an antiques at this point, but that product I think got acquired, but I think Oracle acquired Hyperion, but Hyperion was a multidimensional OLAP cubes with write-back and some things we haven't seen in OLAP, you know, since then pretty much.

CL: Oh wow, you were really pretty early into all this.

Dori: Yeah, I was going to say to date the period, what years would you put this in, what timeframe?

Max: 2000 to 2004, you know, but I won't tell the whole story, but I was at Ubisoft Paris for three years and at some point they're like, "Hey, we need to kind of clone the work that we've done here for EMEA, or for the European Zone for the North American Zone, so like do you want to go to San Francisco for a little while?"

That's what brought me to San Francisco and rest is kind of history, but I went on to Yahoo before it fully kind of sucked, you know? Or it started sucking at the time where I was at Yahoo since 2007 and then went on to back to Ubisoft, Facebook, Airbnb, Lyft, and then started a company.

CL: Wow, so I remember working with Ubisoft actually, so early 2000 I was working on version control system prior to Git and Ubisoft and Apple were the two major company using that distributed version control system I created, so we might have some overlapping friends there, and then just like what was the spark that led you to create Airflow and Superset?

Max: So I was at Facebook for a while, so 2012 to 2014 is when I was at Facebook and it was like a super cool like Cambrian explosion of data tools internally, so some of which were externalized through open source, which was like super inspiring to me, right?

So Apache Hive is, Apache Thrift, I forgot all the open source data projects from that era at Facebook, but I was in the middle of that and it was like thousands of engineers building stuff like heavy hackathon culture, "move fast and break things," so it definitely bloomed from there, like going from like a corporate world, jaded engineer, to like, "Oh my god, I can build all sorts of stuff," and kind of seeing what can be built.

So I started talking with folks at Airbnb and I was like, "If I join, could I work on, you know, do you have anything like Dataswarm," which was Dataswarm was one of the data pipeline tools internally at Facebook, so not open source, just really cool tool that ultimately became the inspiration or part of the inspiration from Airflow, right?

Like I'd used all sorts of different ETL tools at the time from Informatica, SQL Server, report, what's it called? Not reporting services, but SSIS, Integration Services, had used Ab Initio a little bit, and so like some of these different tools.

And then I took kind of everything I had in my head and kind of Dataswarms some of the ideas of like pipeline as code, Python-heavy stuff, and then I was like, "If I join Airbnb, will you let me work on what became Airflow?" And they're like, "Slowly you can come and open source it."

So that was the prompt for it. And then for Superset itself, it came within a year of Airflow. The issue is that we were playing with Druid, Apache Druid for real time, like OLAP type use cases, and we had a large Presto cluster, and then we had not enough Tableau licenses for everyone and Tableau didn't speak for these things so we had to load this data into Tableau extracts and Tableau extracts don't scale very well, or at least they didn't at the time.

So I was like, hey, maybe we can just, you know, I think there was a three day hackathon and I was like I'm going to build a thing that can query Druid, and ultimately Presto, and then that turned into Superset.

Dori: So I worked at Uber, and we also had an incredible built-out data stack. When you were talking to Airbnb, what was the pitch to let them open source Airflow?

Max: Well, I think it was like, "Okay, what do you have? I'm going to do some data engineering work here. What kind of tools do you have?" There's this thing called Chronos that was based on Mesos. It sounded like everyone hated it. And I was like, "If I'm going to come here, I'm going to need something like Dataswarm. I want to write dynamic pipelines as code."

And you know, I was like, "Well, I'm not going to work in there are these conditions." They really wanted to hire me at the time, like the ex-Facebook folks were like, "Hey, why don't you come?" So I was like, "Well, if you want to have me, that's like a part of the, but not necessarily part of the conditions, but it's just a line, right? Like if I come in, can I build this?"

I think the trade was, "Okay, but you're going to have to own these two like big subject areas in the data warehouse, or you're still a data engineer building data pipelines, but if you want to build a tool to support your own use cases, but you're going to have like internal customers you got to serve."

So I started building Airflow. I think within like a couple of months, I was running the pipeline for my two subject areas and refactoring them inside Airflow, and Airflow was like productionized internally at Airbnb within like two or three months.

Dori: Yeah, did you have a team helping you out with that? Because that's a lot of coverage.

Max: Yeah, yeah, yeah. Yes. I just had like two subject areas. There was folks at the time, Aaron Keys, Johnson Parks, and the team grew, but like these guys were working on the core data sets for Airbnb and I was working on like customer success and forgot what else, you know, and some on core, but I found a super cool team. I think they were called like the Etliens, as E-T-L -iens, you know, like aliens.

Dori: Yeah, yeah, yeah.

Max: But they were like super great. I mean, they are super great data engineers. They were like the first users of Airflow who were sitting next to each other and they're like, "Oh my god, this is cool."

Yeah, I think it really helps to have like people building actively and it's like, "Here's the tool."

And it's like, "Hey, Max, like we need the feature to do that, you know, can you make backfill do this? Or can you add the button to like clear a sub dag or whatever?"

And I'd be like, "Okay, well," you know, like it'd be, and checked in by tomorrow.

CL: So both projects kind of started as an internal tool, and then you have like internal customer help like perfecting that, turning it into something popular and then open source, and then, so when you were building that, did you know that it will kind of change the industry, or just like scratch an itch?

Max: I was going for it, I think, it was my first like real open source project, but I was pretty inspired by the work I had seen done internally at Facebook, and I was certainly going for it 'cause I used to maintain like a list of all the companies I visited and kind of shilled Airflow at, and there's like more than 50 companies.

I went on a mission around the Valley to kind of preach the value of Airflow early on and that probably really helped. I mean, there's all sorts of things that contribute to making something successful, or like building a community and building a great product, like if you don't have like a there's time and fit, they call it, project-community fit as a product-market fit kind of equivalent, so the timing was good, the tool was good, or, you know, it solved the real problem at the time was the right level abstractions to, I think, if you come with something with more constraints and guarantees, like, you know, where data engineering is going now with potentially like Airflow 3 and Dagster and Prefect, you know, like these things are more advanced, but I really feel like Airflow was the right abstraction at the time.

Like I want like a smart run type of thing and, you know, it was the right time for DAGs, but maybe it was not the right time for a lot more structure than that. But yeah, so I went through some evangelism at anyone who want to hear, and people were reaching in, people were reaching out to the CTO at Airbnb.

It's like, "Hey, can you intro us to this guy who's building this thing?"

So I was like, "Okay, I'll jump in my car and go to, you know, go to LinkedIn, go to Facebook, go to small and big companies and, you know, show what Airflow can do."

Dori: So you mentioned you started having like inbound equivalent for your open source for Airflow, and you pitched it to 50 companies. What do you think was the first thing that built the equivalent of traction of really getting people invested in the community?

Max: I mean, the core product. Look, it's a combination of things, but adding a GitHub. I know like getting to Apache, at some point, was probably a good validation. At first, it was like github.com/airbnb/airflow. Actually at first, it was like my org, you know, Mr. Crunch at Airflow, that insert Airbnb/Airflow and then ultimately Apache at Airflow.

So that helped, but I would say, I think, you know, it's like having a good repo being like, you know, I often say that building open source is like one interaction at a time on GitHub, but like good docs and being present generally, and giving talks.

I remember just being, I'm not like a public speaker type of person very much. I've made some progress in that area, but I used to be terrified, but I was like, "I'm terrified, but I want to go tell the world about it 'cause I'm excited, I want for this thing to get popular, get big, get relevant, right?"

I think the big thing with open source is when you get hired by a company to do work, your scope of impact is limited to whatever you can do within that company. So as big as some of these companies are, you're not going to be able to have like an impact on the world, you're just like scoped in.

Apps, open sourced, I saw as an opportunity to like find more relevance, you know, just as a mission to be like, I want to build something that matters kind of deal.

CL: Right, and then at like, at some point, like both got like commercialized and then offering commercial solutions to customer and then, but seems to be quite different 'cause you weren't really involve in like Astronomer and other providers, right? Preset, you are like the CEO and founder for that. Can you tell us a bit more about like how that happened?

Max: Yeah, the question everyone asks is like, "Why didn't you start an Airflow company?"

I think my passion had moved on and I was done with data pipelines. I was like, "If I have to write another line of SQL in my career, I'm going to have murder somebody."

But like I was excited to work on the meta problem, like writing pipelines, I was like, I was over it. Writing Airflow was a much more interesting problem. But then, ultimately, like even that, I was like, "Oh god, I just got that yucky feeling thinking about data engineering."

I was like, "I'm just done with that. I'll go do something else." And then data visualization is like super interactive visual. Like my hobbies at the time, I was doing a bunch of digital art for things like Burning Man or just for fun, like writing code that produce like flashy, colorful, interactive things.

So Data Viz seemed like a much more fun space, and then I really wanted to disrupt BI, like BI is just like this old kind of shitty proprietary thing, so I was like, "Let's just take a big like hammer." How do you call the big hammer with the big, you know, the big infringes en masse. I was like, I just want to go smash at the BI vendors with open source and-

Dori: Still want to do that. I don't think, unfortunately, much has changed.

Max: Still swinging, you know, but I mean it turns out it's a lot harder to disrupt a market like that. There's a lot of product surface in BI, so it's taking a little bit more time to really fully take over, but we're certainly chipping at it and taking some bigger and bigger chips as we swing, so, you know, it was a bigger challenge and maybe more fun.

I've always been really interested in like UX and visual, colorful, interactive things on screen as opposed to mountains of SQL tech debt, you know.

Dori: What do you mean? You just don't like looking at just pure table outputs and turning that into to business insights? Your stakeholders don't like that? I'm shocked.

Max: I like the analysis side. I think I'm a good data modeler, so that's kind of a lost art, so I still do a lot of our data pipelines internally at Preset 'cause I'd rather just do it than hire someone to do it. So I'm still a practitioner in all sorts of ways, the modern data stack, you know, startup style, but maybe that's like, what, like five or 10% of my time, so that's good.

And it's all directed to like, because I have the business questions that I want to answer myself, like, you know, so it's actually fun if we start logging some new like product behaviors and stuff like that. I might be the first person to plot it, say like, "How's our AI feature rolling out?" The discovery aspect of data is still super fun to me and then I don't mind creating a few DBT models or Airflow pipeline to get there.

CL: And no murders involved, right?

Dori: So you mentioned data modeling is a lost art. Can you tell us a little bit more about what you mean by that?

Max: I don't know if it's a lost art, or maybe it's just not an art.

So there used to be books on the topic, like people would read the Ralph Kimball books. Maybe people still do, arguably, but it feels dated. T here hasn't been a renaissance in data modeling.

I still don't think like star schema is the whole story. I wrote some articles on data modeling, so for a while I was passionate about functional data engineering, so I wrote some articles, gave some talk on, you know, applying the concepts of functional programming to data engineering and somehow it clicks. I think it just worked.

It was like things we were doing implicitly, you know, at Facebook and Airbnb, and then I started working on Redux. Working on the front-end, actually, I learned more about functional programming 'cause that was en vogue, you know, with front-end engineering at the time, learning all about Redux and the concepts.

I was like, "Oh, this is what we're doing. We just need to kind of put it together and establish a bit of a narrative." The parallels kind of clicked in.

So I talked about that. I think maybe two or three years ago, I wrote something about entity-centric data modeling, which is this idea of deviating or augmenting dimensional modeling, and then if you go entity-centric to say like, "It's okay to put metrics in your dimensions. It's okay to add BLOBs to your dimensions. We don't have to be as strict about fact and dimensions being segmented."

So there's an interesting article there. I don't know if it's been like well-received or if people put it in practice, or some people are like, "Oh, we've been doing that for years. Like why are you even writing about that, you know?" I was like, "Well, no one had written about it, so why not?"

Dori: Yeah. Do you think that's scalable?

Max: I don't know. It's like anything can be or cannot be. It's like anything pushed to an extreme is not good. So you could say that is putting all your facts inside your fact table strictly and no facts and dimension, is that scalable? Or putting all, I think if you push any idea to an extreme, it becomes kind of stupid.

But yeah, I think it is pretty scalable. I think the blog post on the topic tries to cover on that. There's something really interesting with like how it relates to feature engineering too, so some of the ideas came from feature engineering, which is used for ML, but like feature engineering by nature is extremely entity-centric, right?

They don't really know what fact is outside the concept of an entity, so kind of learning about that and the rise of like BLOB support in data warehouses and databases made it easier to do so. So, you know, started thinking about how do you represent time? All of a sudden, if you have one rope or entity, how do you represent time? Been talking about snapshoting dimensions for a long time.

So that's one thing, and then, you know, pivoting time inside a column potentially, or saying like DAU/MAU, you know, in a customer table, it makes sense to have, you know, DAU/MAU, like we collect of users. Those are metrics, but they're super useful.

Just like give me all my customers that have more than 50 MAU without having to do some sort of like subquery, you know, on a fact table to do, so it is like super, super great and there's a lot of stuff that if you do entity-centric data modeling, there's a lot of queries and things that come very naturally that you probably would not bother to do if you have to write a bunch of like much more complex SQL.

Dori: A lot of joins. Kind of gate keeps the data a bit more because you have to understand multiple tables versus a couple.

All right, a question on this topic, and then go onto the next one. You mentioned kind of your inspiration for visualizations, including Burning Man. Was there any one project or something that was inspirational to you?

Max: Yeah, well, so what I was talking about is like at the time, I was going to Burning Man and I would-

Dori: Which is the most Silicon Valley thing, also, by the way.

Max: Oh yeah, so if people don't know, look it up and you're probably going to be like, "What the heck is this thing?" But it's a big festival in the desert and it's bring your own contribution, so it's like made by the people for the people, it's non-commercial, but you know, people bring art and costumes and try to make it fun, so it's a big week-long party.

And then I was doing some electronic type art, so making some projects of like flashy glowing things. I have some nice stuff on my old blog, but was playing with like digital arts in general, things like a camera or kind of psychedelic merits for a camera where when you move, it leaves like trailing so you can kind of look at yourself and as you move it, you know, generates colors or particles and things like that.

So I've got a bit of a portfolio of like digital arts either in screen or, you know, with LEDs and things like that. So I built a bunch of stuff there, which is all like visual and interactive and real world in that case, but if you joined in data engineering and this hobby, maybe it somewhere in the middle fits Data Viz.

Dori: The interactive nature, certainly I see that, how that plays into data modeling when you get BI.

Max: The colors, animation, in some cases, you know?

Dori: Yeah.

Max: So yeah, color, animation, interaction, you know, so it's like probably, in the conceptual space, maybe the multidimensional LLM space, like Data Viz and data engineering, or if you have like digital arts somewhere in that multidimensional space and data engineering and then other, somewhere in the middle you might find Data Viz.

Dori: Yeah, we talked a lot about Airflow, but we haven't talked as much about, and you talked about the data visualizations with Superset, but how did you get started with the open source community there and building it out?

Max: So I think I started Airbnb similarly, and then there was an internal project and I was like, "Should we Apache it?" Or I think people from, I forgot which company came over and they're like, "We want to contribute, but we'd like for it to be Apache before we do," which turned into no contribution, but I think it's a good thing.

I've got mixed feelings about software foundation, but Apache is great. It provides good guarantees. I think people still care about governance, and adding some governance guarantees around open source.

It's like, "Oh, let's put it in Apache too. It doesn't hurt." And then maybe kind of thinking in the future if I want to start a company around it in the future, not a bad thing for the IP to be in neutral, in the Switzerland of software, at Apache.

So I was like, "Okay, let's do that." And then I think we announced it, you know, try to do things right in terms of announcing it, giving talks, writing blog posts, you know, all the normal shilling.

Dori: Yeah. True.

CL: But from there, like you actually started a company to support like Superset users and customers, right? And then how do your perspective kind of shift from open source community building to company building?

Max: Yeah.

My life's mission for the past 10 years has been to take over BI with open source and I thought that building a company was the best way to support and accelerate that.

You know, at Airbnb, if I had stayed, maybe we'd have a team of like two, three, four people working on Superset. On the other side, if I raise money and started a company believing that, you know, I think commercial interest can really coexist positively with open source, right?

But I was like this foster child of an open source project needs a parent type organization around it to structure it and to justify doing some of the things that need doing for this project and community to be successful, and, you know, raising capital from VCs, and the VCs at that point were kind of begging for me to start a company, so I think I was turning 40 at the time and I was like, "Oh man, I'm in Silicon Valley. People are just like, you know, giving me term sheets to go straight into an A round, raise a bunch of money. Like if I don't do this, I'm going to hate myself forever."

I mean, I was on a path to like chase IPOs, right, so I probably would've been Airbnb to whatever the hot next IPO was at the time, so I probably would've been at OpenAI, Anthropic, or someplace like that ultimately. But I think I would've regretted not starting a company if I had not done it.

CL: Right.

Dori: For anyone listening, considering being a founder.

Max: Oh, I try to put people to the test when people are like, "I'm thinking about starting a company." I think the best service you can do to them is to challenge them on it, to make sure that they know what they're getting into. So I'll be like, "No, you don't want to do that."

And if it stands, I'm not saying that I don't want them to start a company, I just want to only do it if they have strong convictions to do it. I think it's kind of a healthy thing for people to like, they need to survive someone like Max telling them not to make the jump.

'Cause everyone is like, "Yeah, go start a company, you know, this is this Silicon Valley Dream, the American Dream."

I'm like, "Ah, well make sure you know what you're getting into."

CL: Exactly. All right, coming back to like data engineering and how the landscape has changed in the past 10 years, what do you think that's still the hardest part of data engineering today that nobody really talks about?

Max: Yeah, I gave a talk at the Airflow Summit around like the fact that we don't share data pipeline code, right? That's the struggle for code reuse and the data transformation layer, I think is what it was called.

But you know, you go on the front-end, I just built like a front-end app in less than amount that that's like incredible and it's because people on the front-end, there's all these these frameworks and toolkits and component libraries and you know, backend and front-end, like all sorts of packages, like you need a date picker, like, you know, plug it in.

In data engineering, there's nothing like that so everyone's reinventing the wheel and every single company is computing, like they grow accounting pipeline and their DAU/MAU, their experimentation framework and it's, you know, a lot of that stuff feels like tech debt before it's even committed to a repo.

So I think SQL is like super convenient, but like templated SQL really, like I don't know, it's like not good enough for sharing. There's some issues around dialects. Even if I gave you my DVT project, you couldn't use it 'cause you don't use BitQuery and it's just a big mess.

That's a bit of an issue. Maybe that's fundamental, like we're not able to share transformations.

CL: But do you see people like trying to solve that problem? Or does this got to be solved from a very different perspective? For example, like from BI layer to figure out what the question's that being asked, or like how do we actually solve that modularity and code review problem?

Max: Yeah, well--

So we need to have unified data models. I f everyone's data models are different on the input and, potentially, on the output, and their combination of data models is also conceptually different, it doesn't work. So we need to have a unified data model of what I call parametric data pipelines.

But like there's no real way, if you wanted to write a good parametric reusable data pipeline today, until recently, there was kind of nothing to do that. Like you would try to share a DVT project with like templated SQL, maybe Spark, but like, I don't know if you had, if you're like, "Here's a Spark pipeline you can reuse," be like, "I don't run Spark, I don't want to run Spark."

So what's the right medium for people to do that? Like there's no npm package for data engineering. One thing that really helps is like we now, with like Bitran and Airbyte and other things, like the foundational data integration, like the "EL" in ELT has been standardized.

That could be a good foundation. Like now if you use Bitran, I use Bitran to sync, I don't know, my Salesforce or HubSpot, we have to write the same data model to start with so that we've seen some DBT like staging area type projects get shared, seems like a good place to start, but like, again, DVT was not the right language because it's not multi-dialect, right?

So if I write something in a dialect and it won't work in a different dialect, maybe SQLMesh looked promising too, right? SQLMesh is like you write in one language and it's like dialect agnostic, so kind of promising, but like the building blocks aren't there and then people just don't think that way, you know?

CL: Right, and they're now the same company.

Max: Right, that's it. Now they're all the same. Yeah, I don't know what's going to happen with SQLMesh, unfortunately, I don't have insider information, or like what the future of Tobiko. We depend on SQLGlot, you know, in Superset we use SQLGlot, so we're hoping the project's going to be, is going to remain like well-maintained.

CL: Okay, so throughout your career, and then working with data and the data tooling, what is the most painful bug or failure you've seen in production? What have you like learned from that?

Max: Yeah, I think like there's probably two things I can talk about, but one was called Data Apocalypse at Airbnb. I think it was Friday, and then we needed to like make some room on HDFS on the cluster and then someone in infrastructure, whose name shall remain undisclosed, ran some like Hadoop RMRF command with think they were like doing like slash TMP and maybe they wrote slash space TMP, but as the name node started crashing and then, you know, we had like Cloudera engineers on site coming to try to restore a name node and fix things, so that's pretty bad one.

When I first started writing Airflow, I thought it would be useful to have a command that would be like Airflow space reset dd, 'cause you know, you work with like dev environments and sometimes you need to reset your database. Well, someone at Lyft thought they were on their dev environment. They were in production and they nuked their database.

So then there was an Airflow outage and, but Airflow outage set to like restore backup, and then I guess, you know, if all the jobs are input and everything, you know, like the restore the backup and Airflow goes back to work and so it was not that much of a disaster, but we did remove the, I think we did remove the "Airflow reset DB" command from the CLI.

CL: Or long like, hyphen, hyphen, I'm really sure what I'm doing something.

Max: Are you certain? You know, here's your database connection. It says production in it. And I think there was confirmation on the CLI too, but you know, it's like sometimes you feel like you know what you're doing, you're in a hurry, or you know, I use this comment every day, I forget to change my environment variable.

CL: Right, and then I'm going to switch gear to like data team and data culture as in you work with a lot of different data team and then probably have your own data teams within the company. What do you think is the open source role of shaping a healthy data team?

Max: Well, I think like a theme, you know, and I think like I sit in a place in the data universe, that is like just my neighborhood in the data universe and people care a lot about open source, like where I sit, but it's probably kind of self-elected, and if I was to move to the Midwest and go sit at a Microsoft shop, they would be like, "What? What are you talking about?"

But a theme that I hear in my bubble is data sovereignty. I think people don't want to be stuck with proprietary vendors that lock you in.

So I think like if you use, you're just at the mercy of your proprietary code vendor, you know, you have to like influence them to build the stuff that you need where, in open source, you can always take your stuff and run it or contribute the fix you need or fork things.

For me, there's no going back in terms of using open source everywhere we can, wherever it's competitive, even like we're ready to take a product that's not as complete or great just for that guarantee to be true, I think that's becoming more and more true. I think in the age of AI open source, it accelerates I think the win rate of open source too.

So software's eating the world, open source is eating software, then AI's eating everything. Maybe.

But I think like because the models are trained on open source, I think there's like a more natural bias towards working and using open source from the users and their model's perspective.

CL: Right, totally, and then I think you mentioned that you're working on a new project, and then I know there's like Claudette in the Superset development and we just touched on AI and then I'd love to hear your thoughts about kind of what's kind of the agentic workload for data look like in the future.

Max: Yeah, there's like so much thought on that to unpack 'cause I've been kind of in a tunnel of like just spending my living in Claude Code for the past nine months. Started coding like 10 hours a day again, just 'cause it's like I woke up one day and I felt like I had superpowers, so I cannot, like if you're operating at 10x, so I mean, it's not a great analogy, but if a surgeon was like to wake up and they can perform 10x to surgeries, they'd be like, "All right, line them up. You know, I've got superpowers that save some lives."

We're not saving lives in data, but I definitely just feel I have this sense of like duty to like, "Oh my god, if we're flying, like let's go fly."

So that's interesting. Now in terms of like, Claudette is kind of an interesting thing. It was a pain point of like, "Oh, we're getting so fast that I can work on multiple things at the same time." Just I'm going to have like multiple AI windows and it was cloning repos multiple times, discover Git work trees allows you to have multiple brands checked out at once.

You know, sorry, I've been using Tmux forever, but I'm like, I organize my Tmux to have like one tab for project, and then having multiple Superset environments running at once, so I can have like, you know, this project's on port 9001, this one is on 9002, and I can have like multiple Dockers.

I got a Mac Studio so I could run dozens of instances of Superset without my laptop getting hot or overheating, and then Claudette was this idea of, you know, just a little CLI that allows you to stay sane while having like multiple instances of Superset running. You can look it up, there's a blog post. It's an open source project.

You might want to consider doing that for your own repos or projects. It just makes it really easy for people to create a work tree, give it a name, assign it a port, start and stop an environment, nuke a Docker database, detach a volume, like all these commands instead of being, you know, like, "Oh, what's the command again to do this?"

You're just like, "Claudette do this, Claudette do that." And then everyone at Preset and I think in the Superset community started using this workflow, and then, you know, we're mostly staying sane while working on like two, four, six things in parallel, and it seems to work pretty well.

CL: Yeah, this is so inspiring as in the traditional developer experience, people would optimize for like your own iteration, right? You got a command to run every test and all that as soon as you make changes. And then now this seem to be totally like really 10x with like we just invoke an agentic environment for the agent to work on stuff.

Max: Yeah, you know, I think that Holy Grail, like something I'm working on, I'm going to launch something soon in the agentic coding space, like kind of collaborative workspaces, you know, call it like "Figma for agentic decoding," if that makes any sense, but check it out maybe in a few weeks, or maybe depending on when you listen to it, maybe it's like sometime in the past.

But like one of the Holy Grail for me is like shared dev environments. So there's so much like pushing and pulling where you're like, "Okay, I'm going to work on this feature, open a PR," and someone reviewing the PR is either working blind, or you know, Superset is super visual, so they have to like pull the branch, fire up a Docker, test the thing, and if you're lucky, it's the same day, maybe it's the day after and they're like, "Oh, this button should be like six pixels to the left."

And you know, there's so much friction and async built up and it made sense when we're switching branch, working on a thing at a time and coding for entire days before the PR is open, but like when the PR takes 10 minutes to go, you don't want for the code review or someone pulling in, firing an environment to take a day, right? That's just slowing us down right now.

So I'm thinking of a place where people can collaborate on work trees on the same like Docker environments and like prompt the same AIs and maybe using different AIs like to be able to say like, "Oh well Claude Code failed at that. I'm going to try Claudette or I'm going to try like Gemini Claude Code Codex in parallel on different work trees and see which one builds it faster."

They're just trying to augment my carbon footprint and GPU time. No, but like there, but there is something like you literally feel like if you're not taking the max advantage of your Claude Pro max plan, like what, Like it's like what are you doing? You know, you should be reaching those GPU limits every four, six hours period, you know?

CL: Yeah, I feel this is becoming very common for like software development, but like for data development, the things that we have not had this building block for, like faster iteration or more agentic parallelized development. What do you think that's causing that?

Max: Yeah, so I started, like I am the data team for Preset, so I don't spend a whole lot of time there, but I was working on, like I wrote, like a rewrote the sessionization pipeline for marketing website recently and fired up Claude Code and then the dbt CLI does not, or at least like the old one, maybe the new one doesn't have access to SQL.

It can create a model, but it cannot read like a sample rows or run queries directly, so somehow it doesn't have access to Dataset, which is a blocker, so I wrote this CLI for Superset called Sup, as in what's it, of S-U-P, and you can say Sup SQL, you know, select whatever and it will run the SQL and you can say --JSON or --CSV and it will return things.

So I gave Claude Code Sup and dbt CLI and I'm like, all right, let's go at a rate, let's write the sessionization pipeline and it did very well. Like Claude Code is good at, I mean we knew that from GPT3.5 onwards, like it's good at writing SQL, it understands dbt, like the docs are great, it's fully trained on it, so I haven't spent a lot of time there, but I'm thinking like analytics engineering, like software engineering is getting disrupted just as fast, if not faster, and yeah, fire up Claude Code or Codex or Gemini.

I mean, I can only really vouch for Claude Code all that much, but tell it where, you know, where your dbt, or it should know that you have dbt, and you have something to write SQL that you can visualize, or not visualize, but like get some rows, you know, like what are the distinct value in this column so I can write a case one statement. Or what's the distribution of this column? What's the percentage of null?

Like it can run these things and you'll see the agent run a bunch of SQL, write SQL, tensor-ize your stuff. It's great, I think it's getting disrupted as fast as other things though SWEs are probably going to, we see a lot of SWEs that are like paralyzing workflows, writing MCP servers, and then just like all-in, like I am all-in on agentic coding.

I think it's going to take a little longer for data engineers and analyst engineer to just be like, "All right, I'm AI coding all the time. I want a max pro plan. I'm going to burn some GPU cycles today, let's get cooking." But I think it's going to come quick.

It has to become quick or someone's going to take, you know, someone's going to do the job at like five or 10 people, it better be you than one of the five or 10 other people, and I mean, I don't want to make it like a "Hunger Games" thing, but it's like yeah, get on agentic coding all day every day or you're behind.

Dori: Yeah, okay, there's a question I want to get in, kind of returns to something you brought up earlier, but we're seeing more consolidation in the data space with Mission Fivetran acquiring dbt and Tobiko, which of course makes SQLMesh.

How do you see the data space evolving in the next couple years, and do you see this consolidation, I think you kind of touched on this, like helpful or hurtful long-term for data teams?

Max: There's maybe one way to look at it. I forgot who I was talking to at Coalesce, but you know, this idea stuck with me and they were saying like, "There's two ways that the VCs make money, right? Like one is, or it's always by disrupting a market, like there's two plays."

One is to deconsolidate a market. So there's some established vendors like call it, you know, Microsoft Tableau, you know, whatever, like the data stack, what it look like in the 2015, or so there's deconsolidation, like go and invest and create a bunch of disruptive startup innovate. And then there's consolidation where like, oh, you know, so you see these cycles.

I think we've had a really, the modern data stack is just the marketing name for deconsolidation in the data space. It needed it, it needed to be disrupted, right? And I like, I'm not talking smack against the VCs that like disruption is good, investment drives innovation, so we've seen a huge cycle of like hundreds of data companies getting funded.

Now, like you look at those logo slide for the modern data stack, and you know, like you're like, okay, like what I supposed to buy, what works for, wait, what? So I think like we're officially, I think in a consolidation cycles, let's figure out which one of these companies did well, worked well together, and you know, verticalize some of them.

I mean I would rather have more open source and more defederated stuff. Personally, I think more choice, but too much choice is like no choice, right? You go to a supermarket and there's like 50 kinds of ketchups, you're like, you just don't know what kind of ketchup you want and you might just not buy ketchup because you're like, okay, decision paralysis.

Dori: Yep, yep, yep. One of our very last questions, we'll get into our wild card kind of data debug questions. What is something we should have asked you but didn't?

Max: Oh, goodness. I mean like the open-ended question is like what's a data practitioner job five years? And I don't know, I really, I think the world is changing fast. Get on the train and get moving. Maybe like the most immediate thing is like if you're not using AI all day, every day, with an agent I think you're writing any code that's not written by an AI, like I don't know what you're doing.

I don't know where that all takes us, you know? But one thing that's clear is like the whole like software is eating the world, the digital transformation from like 2010, I think that becomes even more true. Or like tech eats other profession, but then, you know, tech eats itself, and then AI eats everything.

But at least like I think on being on the tech side, I think we are, if you look at most professions, I think are going to be more technologized, you know, and then at least as a software engineer you can contribute to that. And I know otherwise we can all go and be plumbers and electrician or artists, I don't know.

Dori: We're all going to go to Burning Man, have our own projects.

Max: Yeah, I'm hoping for like universal income, you know, potentially as one solution. I mean that stuff is true, right? Like the cab drivers, truck drivers, you know, that's a lot of people to figure out what to do with.

And the tech sector, the art sector, like all the information workers are-- I think the thing that's like, I think AI is going to ready is getting good, real, real fast. The one big break on that I see is just human adoption.

So like this stuff, like everyone could be, should be using AI all day, every day maybe arguably and that's not the case, right? It's going to take a moment for some companies, like I spoke to people at the Airflow Summit, they were like, I was like, "So what do you use for agentic coding?" And they're like, "Oh, we're not allowed."

CL: Oh.

Dori: Interesting.

Max: They're like, "We don't, we can't, at my company it's not approved yet." So I was like, "Oh, got time to shift the switch. Or maybe you have really good job security or maybe quite the opposite, I don't know.

CL: Cool, so before we wrap, we're going to put you in the Data Renegades Round, quick fire questions, short answers, are you ready?

Max: Hopefully. I don't know, let's see.

CL: All right, first programming language you love or hate or have strong feelings about.

Max: Python, love, SQL, don't love, but it's okay.

CL: Okay, tabs or spaces?

Max: Oh, definitely spaces.

CL: Biggest bug you've ever shipped into production?

Max: Oh, yeah, I'm going to pass on this one. Too much all day, every day. But it's not just about quality rate and failure, it's about recovery rate, so-

CL: Mm-hmm, NPTR.

Max: Yeah, fast at fixing.

CL: What's your go-to dataset when you're like testing like a concept or visualization or whichever data tools?

Max: I mean, we have a really good revenue dashboard like SaaS revenue dashboard that, you know, and as a CEO should probably obsess about revenue, but that's my main dashboard so a lot of the bugs I report on Superset are from this revenue. I mean, it's like product usage, you know, like basically the user usage information with revenue in it and behavioral stuff, so it's basically the Preset this.

Dori: So it's on your production data.

Max: Yeah.

CL: Okay, cool. We only test in production.

Max: Yeah, yeah, who's got time for staging or development data set? Wait, do people really have, you know, non-production data in data? Maybe.

CL: Okay. What's the one lesson from outside of data or software like sports or arts, that influences how you build?

Max: I mean, there's the digital arts maybe, I don't know. I mean I live in Taos, I'm in nature, and may maybe it's like I do a lot of hiking and walking in the woods and people talk about like shower thoughts.

So I don't really think in the shower, I think while hiking and biking, you know, so I'll be like thinking about the project and the data models and design and things like that of like what I'm going to build next while I'm on hikes or mountain bike rides.

CL: Wow, amazing. What's your one hot take about data that you're willing to defend on the podcast?

Max:

I really don't know where data engineering and analytics engineering goes with AI, but I think it gets extremely accelerated.

And then tech debt, like AI can manage the mountains of tech debt we can do there, so I think there might be more contraction there than in software engineering in general. So watch out, I don't know.

CL: Oh, you think the contraction for data professional is going to be more than software.

Max: I think so, the adoption, so the brakes to that is speed of adoption, I think, but you know, the people are going to figure out agentic decoding on their large dbt projects and Airflow DAG repos I think are going to maybe have a higher multiple over their peers than SWEs, but it's unclear. It might be the same and same similar ratio, but adoption's going to be laggy, but then the ratio of replacement might be higher.

CL: Okay, well, time will tell. What's the latest thing you've automated in your life?

Max: I don't know, but I'll tell a story. What was it? Someone got Airflow to automate a coffee machine at the Airflow Summit, I think.

And then someone had done like home automation with MCP, so they wrote like an MCP server for turning on their lights and opening their garage and things like that. I have not done that. I'm too busy with here solving real problems.

CL: But you build software enabling people to do weird automations.

Max: That's it, potentially, you know, but yeah, writing MCP server for everything seems like a really interesting idea right now.

I don't know, home animation is an obvious one. I'm not sure what else, you know, but yeah, I was thinking like if I was still at Airbnb, or you know, I'd be writing like an MCP server for just about every service right now. And maybe writing like MCP Gateway type. If you think of the concept in API Gateway and you apply it to like MCP infrastructure, I think there's some really good ideas there.

CL: Cool, and one last one, what's your favorite podcast or book that's not about data?

Max: I mean, I've been obsessing over agentic coding, so there's some, I could send some notes in the keynote. I don't know the name of their creators, but I see their face in my head, so there's people like talking about like agentic coding, Claude Code specifically, like they do every week like what's new in Claude Code this week and what are all the tools in that ecosystem.

So I'll share some, but if it's not data-related, it's definitely tech-related.

CL: Okay, so we reached the end of the lightning questions and then how can listeners find you and be useful to you?

Max: GitHub, I mean, I'm Mr. Crunch on GitHub, the Superset repo. That's probably the best place to find me. The Superset community. By the way, like if you have not tried Superset in a while, you should try it.

Apache Superset is just like a great BI tool. It's fully open source, Apache 2 license ASF sponsored. You can try it for free up to five seats on Preset, so that's a great way to try it, then you can decide whether you want a commercial offering or you want to just run the open source project yourself, but recommending people to try that.

Otherwise where to find me. You know, Twitter. I try to stay away from Twitter. I just get like a bad feeling in my stomach when I spend time there, but if you DM me on Twitter, I'll eventually see it.

Dori: Awesome.

CL: Amazing.

Max: Well, thank you so much again for joining us, Max. This has been an incredible just conversation and thank you to our listeners for joining. We will see you next time.