1. Library
  2. Podcasts
  3. Data Renegades
  4. Ep. #4, Streaming Made Practical with Micah Wylde
Data Renegades
45 MIN

Ep. #4, Streaming Made Practical with Micah Wylde

light mode
about the episode

In episode 4 of Data Renegades, CL Kao and Dori Wilson sit down with Micah Wylde. They trace his journey from building fraud detection at Sift Science to architecting massive real-time systems at Lyft and ultimately founding Arroyo. In this conversation, Micah breaks down the real complexity of streaming systems, why schema evolution is still the hardest challenge in data, and where the industry might move over the next five years.

Micah Wylde is a systems engineer and entrepreneur who has built large-scale data infrastructure at companies like Sift and Lyft. He founded Arroyo, a SQL-first real-time streaming engine designed to make streaming accessible to non-experts. He now works at Cloudflare, where he is helping develop an open, object-storage-based data architecture.

transcript

Dori Wilson: Hey, welcome to another episode of Data Renegades. I'm Dori Wilson, Head of Data and Growth at Recce.

CL Kao: And I'm CL, CEO and Founder of Recce. Today, our guest is Micah Wylde. Micah created Arroyo, a cloud native streaming processing system, and the company behind it. Arroyo was acquired by Cloudflare, where he now works as principal engineer. Welcome, Micah.

Micah Wylde: Thanks, CL.

CL: So awesome to have you.

Micah: Yeah, thanks for having me on.

CL: Yeah. So let's just start with like, can you take us back to the beginning? What problem first pulled you into the data space?

Micah: Yeah, actually my first interaction with data came at a company called Sift Science, which was one of the first companies that was fighting internet fraud using machine learning.

This was pre-AI, so we called it machine learning. But just like today, there's a lot of data involved there, especially in fraud. The problem is to take everything we know about a user, every interaction they've had on a website, everything they've bought, combining that with data from other websites, other places the credit card has been used or an address has been used, and when they actually go to purchase something and we need to authorize the credit card, we need to figure out basically, are we going to allow this transaction to go through.

So this turns out to be this massive data problem where you have to pull in all of this historical data, combine it with real time data, build up all these features and push it through these machine learning models in order to get a score at the end.

And we were kind of early in that big data space where you really had to self-serve for all of this stuff and you had to really be an expert. If you're using a database, you had to know how that database worked. You had to usually push changes into the database, contribute to the projects. And so that was my introduction to all of this stuff.

CL: Just give us a sense of what year was that?

Micah: Yes. This is like early 2010s.

CL: Right. And then I'm assuming that all this credit card transaction approval needs to happen like sub-second and then you have to do this feature engineering and then do this data preparation and revision of the model is all like a whole thing at a time, right?

Micah: Yeah, exactly. Yeah. I mean typically you want to respond in like 300 milliseconds. So it's not a lot of time to pull back all this.

Dori: So it's pretty real time, right?

Micah: Right.

Dori: What were things you found that improved latency when you were doing that modeling?

Micah: Yeah. So we had to build, as I said, an entire in house data stack. We were built on top of HBase at that time, which if you were doing any kind of like real time low latency processing, that was one of the main technologies people were using.

And the really key thing was how we were storing the data. So we were doing a lot of these aggregates, like we need to know how many transactions this user had over the past hour, week, month, year. And so there's a lot of very careful basically data engineering to structure all of the events in the database in such a way that you could immediately calculate those without having to pull back a lot of unrelated data to kind of answer that question.

And then all the kind of data processing code was custom stuff we were writing in, at that time, Java. So you really couldn't use off the shelf things at that time. We kind of played around with Flink, which was pretty early at that time, um, and found it didn't quite serve our needs.

But I think that the big shift that's happened since then is that just the public technologies available has gotten so much better that you can solve those problems without self-serving in the way that you really had to back then.

Dori: Yeah. And when I think about blocking frauds, this would be like if I was traveling or like I picked up a transaction somewhere in another country and it's like, "what are you doing?"

That type of blocking?

Micah: Yeah. So this is actually on the merchant side. So an interesting thing, a little far divorced from data, but in the credit card world, merchants actually bear almost all of the risk for what are called "card not present" transactions. So when you're paying for something on the internet.

The banks kind of intermediate that and they might block stuff themselves, but ultimately if the merchant accepts a bad credit card, they're the ones who have to pay it back if there's a chargeback. So they're the ones that are incentivized to really try to figure out if a user is fraudulent, if they're using a stolen credit card.

Dori: Because it's a real cost to them.

Micah: Exactly.

Dori: Yeah. Makes sense.

Micah: Yeah. They potentially are shipping out the thing and then they have to give back the money, which they don't love.

Dori: And then they're out that product as well.

CL: Yeah. So you work on this kind of foundational data system for these real world use cases and then what eventually sparked you to create Arroyo?

Micah: Yeah, so I mean I've been thinking about these real time data problems for a long time. From Sift, I went to Lyft, which is confusing, but they obviously have a lot of these real time data problems as well.

They have to figure out where drivers are, where people who want rides are. They have to figure out how much they're going to charge for a ride, how long it's going to take each of these drivers to get to a particular place which involves ingesting all this real time traffic data, predicting road speeds, as well as all the front end safety stuff that I had been working on.

And at Lyft, we were heavy adopters of Flink. We were certainly one of the early big adopters starting in like 2017 and I'd say sort of starting to get stable enough to run at scale. And we were running at enormous scale, tens of millions of events per second.

And we kind of operated as like a platform team inside of Lyft. We were trying to provide Flink basically as a service to data engineers, data scientists, product engineers who are trying to build these real time features or real time pipelines.

Arroyo really came out of my frustration trying to use Flink in that setting where we really were trying to give it to people who weren't experts in streaming or sometimes even experts in data.

Dori: Yeah.

Micah: And we wanted them to be able to self serve and be successful, solving their problems. And with Flink, we just never were able to get there. It was just so complicated both to build these pipelines and also to operate them reliably.

And even doing a lot of work on Flink itself and building tooling and building other interfaces to it, we built an early SQL interface to it as well as a Python interface. You just never got to that point where people really could self-serve. I would say maybe out of the hundred-ish engineers I worked with, maybe two or three really became successful on their own without our team being really involved.

And ultimately I realized that this wasn't a fundamental problem with the space, it was just Flink. It was just the first product that tried to solve this well. And with the first product in the space, you always make mistakes or you do things that you would later learn weren't the best way to do it.

And so it felt like an opportunity to just start from first principles, take all the things we'd learned from working on Flink and apply it to a new system that was designed for both this problem space of giving to non-technical or non-expert users, as well as modern infrastructure, which had also changed a lot in the years since Flick was created.

Dori: Yeah, tell me more about the interface that you've built. So why didn't the SQL and the Python interface built--And I'm coming at this from somebody who worked on safety products at Uber, doing data work there.

Micah: It's great talking to Uber people because we built exactly parallel infrastructure. I mean we were solving the same problem, so it makes sense. I always see the rhyme of what we built when I see Uber infrastructure.

Dori: As an Uber person, it would be Lyft rhyming with what we built.

Micah: Well, we usually did it first, so.

Dori: Competition never dies, y'all, it never dies.

Micah: But yeah, I would say that the core challenge with both the Python and the SQL stuff that we built is that it was always just a really thin layer on top of the underlying Java API. And for those who haven't used Flink, the main programming API is what's called the Data Stream API. And it's extremely low level. It's the same API that Flink itself is built in.

So all the kind of like core operators, that's the API that users are given and it's incredibly powerful but just really, really hard to use. And the core of that challenge is how it interacts with state, which is just like the core challenge of any streaming engine is you have to remember data and you have to manage that, it has to integrate with the checkpointing system and the recovery system.

And when you just give people this API they just absolutely do not know how to use it. And then you can try to hide details like have a SQL API that compiles down to it. We did that at Lyft and there's now an official one that's been part of the project for a number of years.

But it's very, very easy to fall through into that lower-level API and start seeing Java stack traces when something goes wrong. And to get good performance or even correct results you really have to understand what is that lower level that it's compiling into.

So when we were starting Arroyo we knew we wanted to do SQL because that was by far the most requested feature when we're building this platform at Lyft. But we also knew that SQL had to be first class, that had to be the actual language of the engine as opposed to this layer that was going to get compiled down into this Java or other lower-level API.

Dori: Yeah, that makes sense to me. SQL is just I think the lowest common denominator language that connects so many people that rely on data.

CL: So you mentioned that Arroyo was born out of the frustration with Flink when you were at Lyft trying to democratize the use of streaming data. But I'm assuming the users are still technical, they're just not necessarily familiar with the API. Right? So by turning to this SQL way, that makes it more accessible for that audience? Or did you realize you have a very different audience?

Micah: Yeah, the users are definitely still technical. These are people who are data engineers or data scientists or product engineers, like people who are experts in their own domain. They're just not streaming experts which is a very, very niche world even within the broader data world.

And our goal is always make it as easy as Bitquery, make it as easy as Snowflake. If people can write the query they should be able to write a real time version of that query without understanding all of the complex details of state, of checkpointing, of the time semantics that you get in streaming.

Dori: Yeah, when I have thought about streaming data, the deepest level I got to it was when I worked at a company, Mux that did a real-time video product. One of the things I was doing there is like how do we have metrics around the performance of the video itself?

Mux has a data product where they were doing actual live metrics on it. And it was really interesting and I had so many problems around what you're talking about with timestamps and everything. It just, when you think about it from a non-streaming perspective, going to streaming, it just becomes this continuity. It's just a very different way to think about it.

CL: Right. So I want to dive into, like you mentioned streaming is a niche problem. But if we look at a broader data engineering ecosystem, I guess some people argue that if you do micro batches, it's almost like streaming and then a lot of times you don't really need real-time.

But I want to get your opinion on, not just streaming, but in general data engineering today, what is the hardest part that nobody really talks about?

Micah: Yeah, well, I just say I think micro batches versus streaming, it's really a usability thing. Micro batches force you into a certain set of semantics, I think it can actually make things more complicated to understand or to express the kind of logic that you want. But under the hood, everything, it's always sort of a spectrum of how often are you batching on a performance basis.

I just think that the micro batch API is very limiting in a way that people eventually just decide, whatever, we're just going to do batch processing. But yeah, I mean in terms of what is really challenging, I think this is true both of streaming, maybe more so of streaming, but across the board--

The problem is always when you make changes. If you have a pipeline and it's running and nothing's changing, that's usually pretty manageable. But as soon as anything changes, you have a new schema, you have new fields, you have a new product you need to onboard, you have a new data set, that's always been the unsolved problem here.

And there's a lot of tools that have been built over the years to make it better, and it's certainly gotten better. Back when I was starting in this space, there was no structure at all. If you wanted to make a change in production, you literally were just editing the SQL queries that were running directly. There was no source control, no change management.

But then, particularly on the streaming side, now that we have all of these data catalogs and data lakes and things, just expressing those changes, being able to deploy changes safely, being able to test them, being able to evolve the data that already exists, everyone just has a pile of hacks that sort of makes this work. But we still haven't figured out as an industry how to actually do this in a really well defined software engineering sort of way.

CL: This actually overlaps what we are doing because in a way it's like when you say like any change is really the logic of the pipeline, whether it's streaming or not. And then I think traditionally big data is like a huge volume of data, but the logic is relatively stable.

But nowadays we see a lot of demands like making amendments, making changes to how the pipeline works. And then we are running into this problem as an industry where there's no way to guarantee it's safe to do that before production. So what else do you think is like solving that problem?

Micah: Yeah, I mean, I think obviously tools like DBT have been a big sea change here, where we have actual testing infrastructure, we have Git, like we have source control, like all these software engineering lessons that other parts of the industry learned, you know, maybe a decade or two earlier are finally really coming into the data space.

But I think now we kind of have the next set of challenges there. Like how do we actually evolve our data pipelines in production? How do we integrate that with our schema management tools, our data lakes, our catalogs? And I think that still feels like quite an unsolved problem from my side.

Dori: Yeah. And you mean this kind of across all these different data layers and where it gets exposed?

Micah: Exactly.

Dori: Yeah. Not just in the code.

Micah: Yeah. In any one particular layer you can figure it out. But all of our data stacks now are these incredibly complicated, giant systems that involve all of these different products and potentially different vendors.

Especially as we move to a modern data stack situation where everything is decomposed, it really just means there's more pieces that you yourself have to keep in sync in some way. That's an unsolved challenge.

Dori: Yeah, there's been some consolidation in the market. Right? Like Fivetran buying SQLMesh and DBT I think is an example right here. Do you think that type of consolidation is going to help solve this problem? I mean, it certainly takes away all the disparate tools, for better or for worse.

Micah: Yeah, I mean, well, we'll see how everyone manages to integrate things. Just because it's part of one company doesn't necessarily mean that it's going to work well together, as anyone who's used AWS's suite of data products can attest to.

But yeah, I mean, we obviously are in this world of consolidation right now, and I think we probably will see more of this good integration happening. I'm definitely interested to see what happens with Fivetran and DBT and SQLMesh, if that survives. But I think people are still going to want to decompose their data stacks in various ways.

People have learned that when you give all of your control to one vendor, that vendor is incentivized over time to just keep finding ways to charge you more and more money.

Talking to heads of data platform teams, data engineering teams, CTOs over the past several years, the overwhelming message is that we pay Snowflake too much money. We pay Databricks too much money. And I think the pull of having more control over your data and being able to mix and match vendors more freely, I think that trend is going to increase just because people have learned the painful cost of being locked into these vendors. Even if you do get some benefits in terms of better integration.

Dori: Yeah, you're talking a lot about cost. How much do you see of it being performance? Are the tools performing better? Is that something they care about or is it truly just cost driven?

Micah: I think we go through different cycles and right now we're in a cost cycle. Post 2022, every team was basically asked from the board, CEO, whatever, on down to find ways of reducing costs. And so I think that's the cycle we're in right now.

Obviously that's part of the industry in the AI space where funding is absolutely free. I think there they're much more interested in how do we improve performance. And there's been a whole explosion of tools specifically for their unique data sets.

But in the more traditional data engineering world, I think it's a very cost oriented moment and we'll see if that changes. But that's definitely the environment we're all operating in here.

Dori: Yeah, I've been on teams and I think some of our listeners are on teams possibly currently that have been just called cost centers at the company. And so having that cost reduction is important.

Do you see ways that teams can make, especially having worked on a bunch of platforms, can be like, "hey, I'm not just a cost center. Here's the benefit," showing that value.

Micah: Yeah, I think the best way to do that is to actually be part of the product in some way. And we saw Lyft where we had use cases that were more product driven, we had use cases that were more analytical, post hoc.

But whenever we had a thing that actually drove revenue directly, for example, our dynamic pricing system, that was what got attention and investment and essentially funded everything and then use cases that were kind of on the back end where people are doing more traditional analytics, that was always a bigger fight to kind of get funding and resources.

So the more you can like actually be part of that story whether it's producing model data, things that are exposed in the product directly, that's what basically gets attention.

Dori: Yeah. Turning it to the revenue generator, literally data as the product, not just as an internal product but externally.

Micah: Yeah.

CL: So we touched about the modern data stack and then whether it's dying or not, or consolidation, right? And over the years we've seen the landscape change quite a bit. But if you had to rebuild one piece of the stack from scratch, what would it be? Well, definitely not streaming because you've been doing that, but what about the rest of the stack?

Micah: Yeah. If you'd asked four years ago, that's what it would be. One thing that I feel particularly pained about and I think is a common source of pain is the CDC layer. We mostly run Debezium today to solve this problem. And Debezium is a very impressive piece of technology and I'm friends with the people who've created it.

But it brings this whole painful Java ecosystem and Kafka ecosystem to what seems like it should be a pretty lightweight sort of problem of I just want to read changes off of a database and shove them into some other data system. But we all end up just suffering through this Java stuff to make that happen.

Someone's going to do it. And there's been a lot of people solving bits of this problem. I saw Supabase just released a Postgres CDC tool written in Rust I think two weeks ago. Someone's going to eventually replace Debezium as a whole and it's going to be great.

CL: And then do you see that a very specific thing from transactional database to analytical database like the CDC use case for that or is it a more general data ingestion problem?

Micah: Yeah. So I mean I think that "I have data in my transactional database. I need to run analytical queries on it, but I can't do that in my transactional database." That sort of use case is so common. I see it across virtually every company. once you hit a certain level of scale and you can't just like run things in your Postgres.

Dori: I mean you can. It's just, do you want to bring down production when you do?

Micah: Right. And I mean, every company starts that way, right? You start with a reasonable Postgres, you're using it for transactional use cases, and you also run your monthly billing on it. And then you do hit a level where you can't vertically scale that database anymore. And then you try to introduce read replicas and you run your analytical queries on the read replicas.

But then at some point, even your read replicas can't keep up with your analytical query load. And you have to solve this problem and it becomes this extremely urgent thing. Like the company cannot exist unless you're able to run the billing--

CL: What are your thoughts on, for example, like Columnar extension for Postgres. Would that solve the problem or not really?

Micah: I think they don't really solve the problem. And I say this, I'm close friends with the two main products in this space. Again, it's great at a certain level of scale where you really don't want to think about this problem, but you're hitting the limits of trying to do analytical processing in Postgres. This can give you more headroom.

But ultimately, if you're trying to share the same compute resources as your actual transactional database, you're going to hit that limit at some point. The real benefit of an analytical database is that we're able to scale out. And if you're trying to build this all in one system, you really lose that benefit.

And at the point where you're disaggregating storage anyways, you might as well just use CDC into a traditional system. Except the CDC is really hard because we're all trying to make Debezium work. So I think once someone really solves that problem, makes CDC off of Postgres easy, I'm not really sure what the room is for these kinds of specialized storage engines on top of Postgres directly.

Dori: Yeah, it's funny you bring up the billing issue. My partner is a backend engineer and unfortunately at the company he's at, he's been working on doing a billing database migration and then how to backfill as they redo calculations from a quarterly or monthly aggregation to a daily aggregation.

So it just really hit home to some of the rants I've heard him do recently. Except they migrated to Cockroach from Postgres, which did not solve their problems at all.

Micah: Yeah, I think anyone who's built a billing system, which I did once at Sift, either you decide that's your life's work or you never want to touch it again. And I'm definitely in that second category.

CL: Okay, I see this pointing to our next question. What's the most painful bug or failure you've seen in production?

Micah: I'll have to talk about a Flink bug. Not to rag on Flink more than I already do. Maybe this also gives some color on why Flink is so hard. This was at Splunk where we were also trying to build a real time data platform on top of Flink for our customers. We were using this particular Flink feature called Broadcast State.

This is a pretty important piece of the Flink state story where if you have any kind of global data, for example, you're reading off of Kafka and you have to store the partition offset for each of the partitions that you're reading from and you need to on recovery, basically restore that on whatever node is now reading from that partition. You can use this feature called Broadcast State, which basically is this global state replication system.

It has a really tricky API, it turns out, because each node gets all of the state, but they're only supposed to write their own particular part of the state. So in this case only the Kafka partitions you in particular are reading from.

If you accidentally write all of the state on every node, what happens is on every restart you get this exponential increase in the state because each one is reading 10 keys and writing 10 keys. And if you have 10 nodes doing this, then the next time you have a hundred keys, the next time you have 10,000 keys.

Dori: Oh my God.

Micah: And we had a bug that was doing this. It only triggered in a very particular situation. And there was a whole hilarious link of other issues that caused this to trigger after two months. But it basically took down every Flink pipeline in the company. They all had terabytes of state by the time we realized what was going on. And it uh, took us about three, three days to recover it.

CL: Wow, that does sound painful. So I wanted to ask you, because you share a lot about your experience in the data platform team and then also the strategy where the data team positions themselves on the critical path of revenue. That's where you get funding, right?

What's your opinion about the relationship between data engineering team s and the consumer and the stakeholder. What are the good or bad relationships you're seeing in different places?

Micah: Yeah, I think the data teams that are really successful are the ones that see what they produce as a product, even though your consumers are generally internal, but having that product team mindset and really trying to understand what are your users trying to get out of the data that you're providing and how do you best serve them.

I think there's sometimes a trend in data teams to feel like you're kind of a consultant or you're basically just doing what you're told to do. And that I've seen be much less successful. It just drives resentment sometimes on both sides, where data teams always feel like they're under provision. They can't ever make their customers happy because their customers don't really understand what they do.

But really inverting the relationship and taking charge, I think is what makes that successful is actually acting like a product manager, trying to actually understand, what is the thing that these people want and how do we provide that?

Dori: Yeah. I have found working with a bunch of PMs through my career, one of the best skills you can have is getting at what do they really want? Because a lot of times they'll tell you a metric and you're like, "but what are you going to use this metric for? What is the narrative?"

And you realize, "hey, no, actually we should be looking at these other metrics that would be stronger for your argument."

And then that's where you kind of make the shift, like you were saying, from a consultant to a thought partner and almost a co-owner in a way, of the narrative they're building.

Micah: Yeah, exactly.

The people who are using this data, they have some business thing they're trying to achieve and they don't really care about the data, they don't care about the processes. They're just trying to solve their own problem. H ow do you actually help them solve that problem? It can be extremely high impact.

I mean, I've seen when this goes badly, it can be extremely destructive to the company. These often feed into financial models. At public companies, these are the things that we're telling to analysts that are basically going to pave the way for our future stock price that affects all of us. And when the data is bad, it can be extremely disruptive to that process.

Dori: Yeah. What processes or things have you seen work to help make sure that we are putting out, as data stakeholders, good data. "Good" being a very wide term here.

Micah: Yeah, right. I mean, there's correct data, obviously, and then there's quality data that is easy to understand and doesn't lead to misinterpretations.

I think that the data quality story always has to start early. And this I think has been sort of an industry trend over the last few years to try to focus more on like let's get our data, well schematized, add ingestion, let's try to find problems as early as possible in the whole data pipeline.

And certainly from the streaming side that's something we focus on a lot is actually getting the data into a good format, making sure we're finding problems in the ingestion pipeline, making sure it doesn't get into our actual data sets.

I think you also have to invest in testing at every stage. Make sure we are able to detect problems, that we're regularly validating our data sets, we're exercising them, we're running invariant tests against them.

And then from the quality perspective, I think it comes back to actually understanding how people are going to use the data and making sure that it actually answers the questions that they think it answers as opposed to the-- Your job is basically to take the incredible messiness of the underlying world that you're trying to model and hide that as much as possible from the ultimate consumers of this data.

Dori: Yeah, yeah.

Micah: Because maybe you have 100 different Skews that are all slightly different. There's going to be a million details of whatever you're trying to model and you just want to make sure that, I mean anything important is exposed to the end user, but there's always going to be a million irrelevant details that you can basically smooth over for them.

Dori: Yeah.

Micah: And make sure that they can just focus on the high level question that they're trying to answer.

Dori: Yeah, I always thought about it when I was doing a type of presentation as the, "so what?" Like I've done all this work, I'm a nerd, my background's research, but they just want to know, cool, this is a number. What do I do with it? So what?

CL: Right. So you touched on having a product mindset for, well I guess the entire data team is super important. Like understanding where the data is being used and why. And then almost like I don't know if there is a term, like the "data UX" as in like you expose this well modeled data for your consumer. That is the most important thing. And then make sure they're creating value.

Micah: Yeah.

CL: Okay, cool. So we want to fast forward five years. What's going to be feeling laughable and outdated about how we handle data today?

Micah: Yeah.

I do think this world where you just give all of your data to one cloud vendor and we pay them for compute on the data, but also just to store the data and we're totally locked into their world, I think that's going to look crazy in five years.

The rise of open table formats and object storage, which under the hood everyone is going to be using anyways and the idea that we should just let them control it, it's just not going to make any sense.

Dori: So how would you see that transitioning, when we talk about the modern data stack, what would a setup look like?

Micah: Yeah, I mean, I think we still don't quite know. Obviously today everyone is talking about Iceberg and we're storing data in Parquet. I don't know if that's what it's going to look like in five years. I'm sure anyone who works with Iceberg has their long list of gripes with feels like we can do better. I think the same for Parquet.

There's at least five competing file formats now that are trying to replace Parquet and I think it's likely one of those will win out at some point. But ultimately it is going to be data on object storage with some sort of table format on top, some sort of metadata layer.

Object storage is just such a universal solution to this whole class of data problems, not just in the analytical data space. You're seeing it everywhere in the streaming space, in the vector database space, increasingly in the transactional database space. It's just such an incredible tool to simplify our systems. That trend is absolutely going to continue.

Dori: Yeah. Do you see this, because you talked about earlier making your work with Flink more accessible to non-technical people or less technical people, do you see that as part of the five year journey as well?

Micah: Yeah, I mean I think if we're going to talk about AI, for example. We haven't really talked about it yet in this conversation, which is probably illegal for a tech podcast.

But yeah, I mean, I think AI is absolutely going to make data more accessible to people. We've seen a lot of attempts at that over the past two years and I don't think any of them have been wildly successful, but it's absolutely going to happen. Users are going to be able to ask questions in English of their data and our challenge as practitioners is going to be figuring out how to support that, make sure that the AI has enough context to build these queries correctly.

But also I think it means there's just going to be a lot more queries happening. Whenever you make something more accessible to a wider audience, you get a lot more of that thing. And so that's also going to be a challenge both on the infra vendor side and also on the data team side of how do we support this fast increasing usage from agents and from people using AI tools to query our data?

Dori: Yeah, something we talked about earlier, you know, talking to stakeholders, that I've been thinking about is how do we get people to ask the right questions as well. Y ou're saying there'll be a lot more queries run, but are they going to be asking the right questions when they do these queries? Because you can always get a number. It's just, is that going to be the best number for your narrative?

Micah: Yeah, it's true. Right. When the output is something that is, there's a lot of black box logic in between you asking the question and getting the answer. That's always the question. Is this the right answer for the question I actually have?

And I think this is an open question. I don't know. Is AI going to get better at that level of thoughtfulness? But I do think our role as practitioners is to give it as much context as possible. So this means really investing in schema tools, in semantic layers, in lineage.

We need to be able to give all the context that we personally have about our data to the AI, because otherwise there's no chance that it's going to do the right thing.

Dori: Yeah, now you've mentioned semantic layer. Do you have a favorite semantic layer in mind?

Micah: I don't, I don't think any of them have really solved the problem at a level of quality yet, but I think something in that space is going to be very, very important as we're throwing more agents at all of these problems.

CL: Yeah. So I think we touched on AI a little bit. Providing context is probably going to be the most important thing because the agent can perform a lot of queries. It's no longer just text to SQL. It's a lot of figuring out what to do and then it could maybe do 10 times more queries than a human to figure out a problem and throw away bad answers.

So in this world, how do you see the AI reshaping the role for data engineer? Are we all becoming metadata management people or?

Micah: Yeah, I mean, I think data engineering is still going to be incredibly important as the input to all of this, we still need to basically build these pipelines, we still need to have good clean data sets for the agents to use and there might be a little bit less on the actual consumption side but I think it just gets so much more important to focus on quality in everything up to that point, because the AI, at least at current levels of technology, it's not going to be able to self serve on top of bad data sets, it's not going to be able to understand the nuances of those data sets.

So it's just so critical for data engineers to focus even more on these problems upstream of that point.

Dori: Yeah. What do you think about also expanding datasets since AI is really good at doing maybe some more non-traditional data sources, a lot more unstructured data. How do you see that playing in here?

Micah: Yeah, I think it's a fascinating area and there's a bunch of startups that are attacking this problem right now and we already touched on other file formats that are focused on this like Lance and Vortex. I think we will see more adoption of all of that stuff.

It's such a greenfield space which is pretty exciting. It's something that Spark doesn't do well. Existing tools in the space don't manage well. So as a startup it's really exciting to basically not have any incumbents and just have this greenfield opportunity. But yeah, again I think it's basically unsolved today.

A bunch of half solutions and things that are, if you're willing to invest a lot of energy you can make work. But I'm sure in five years we'll have the really production quality version of all the stuff for working with unstructured data and multimodal data and other kind of non-traditional data sets.

Dori: Yeah, if you were going to advise data engineers right now earlier in their career or mid in the career or I guess even senior, what are skills they should be doing to help future proof?

Micah: I mean in the world where AI takes all of our jobs, I don't know, best of luck. But if we're sort of in the same state where we are now but with slightly better models, honestly I think the fundamentals haven't changed.

The same things you're doing today I think will still be relevant in five years but it becomes even more important to be this product-focused person where you're actually trying to solve problems rather than just being a SQL monkey. That probably is a less useful skill set.

Dori: Do what the AI can't and being a true thought partner that understands, fully understands the context on the input and the output side.

Micah: Yeah.

Dori: Yeah.

CL: Thank you so much. Micah. This has been a great conversation. What is something that we should have asked you but didn't?

Micah: Well, we've touched on the compositional data stack, modern data stack, a bit. I think what we're doing at Cloudflare is pretty interesting in that dimension. We're building everything on top of our R2 object storage product, which is, I think, a very powerful building block for all of these new data stacks because we don't charge egress fees, which means you can store your data on R2. You can query it from any vendor, any cloud.

We don't want to control your data. We just want to be the place you put it and we'll charge you a reasonable fee to store it. And, you know, we're building streaming ingestion into that. We have a managed Iceberg catalog and we're building a query engine on top of it.

But ultimately we want to be that pluggable piece for whatever bits you want to use. I mean, if the modern data stack wins out, I think that that's going to be a very powerful piece of it. So just needed to get that in there.

Dori: Absolutely. All right. Some quick, as we wrap up here, some quick little, more popcorny questions. So this is just off the top of your head. This is our lightning round, calling it the data debug round. I'll start with the first question.

What is the first programming language that you loved or hated?

Micah: Ruby. I got my start in programming building websites in Ruby on Rails in high school, and I still really wish Ruby had won out over Python for all of these kind of data use cases. It's a much more pleasant language to write these sort of transform pipelines in.

Dori: Hmm, good to know. All right. Tabs or spaces?

Micah: I was a tab person and I eventually gave in. Now it's four spaces.

Dori: Interesting.

Micah: Yeah, you're never gonna get everyone to render tabs consistently enough.

Dori: See, I'm a tab person, but as we've been doing this podcast, I've been realizing how more and more I'm in a minority at least among this data renegades group. What's your go to data set for testing?

Micah: New York City Taxi. I have the rideshare background. Actually, I would say a challenge of streaming is there are no good streaming data sets. There's some crypto ones. Bluesky now has a data set, but we really are lacking for just like, if I'm building a tutorial, I have no data set to point people to. I'm jealous of the batch world.

Dori: That makes sense. What is one lesson from outside of tech that influences how you build?

Micah: I think especially as I get more senior in my career, most of the challenges become organizational and social rather than technical. And I think this stuff all generalizes outside of tech as well.

Looking at people who are good leaders in the world of business, politics, sports, whatever. Those lessons I think really apply to anyone trying to do something big with a group of people who are smart and motivated, but maybe not always directed in the most productive path.

Dori: Anyone top of mind?

Micah: Well, I'm a big Warriors fan, so I'll just say Steve Kerr has been a big inspiration. Just watching him over the past decade of Warriors dominance and obviously some unruly, challenging people that he has to make work in the same direction. And he's been able to do that.

Dori: Yeah. And especially as a team unit. I also am a Gold State Warriors fan. And so seeing just how truly it's like, you know, the whole team is there, it's not just on Curry or on anyone else that they really share the burden. Even though the non-Curry minutes still get us, you know, still didn't get better than the non-Curry minutes.

What is the nerdiest thing you've automated in your own life?

Micah: Well, like probably many of us, I got very into home automation at one point. I was too frustrated with all of the built in tools, so I actually wrote my own Bluetooth library in rust and I run it on a Raspberry PI and that powers my lights and my window coverings and various other things in my house.

Dori: All right, what's your favorite podcast or book that's not about data?

Micah: Oh man, I'm such a normie podcast listener. But I love This American Life. I've been listening to that since I was a kid on NPR and now I'm a regular listener of the podcast version of it.

I also really love the Odd Lots podcast from Bloomberg. It's ostensibly a financial podcast, but it's really just about how do things actually work in the world if we go down to the lowest level. All these things you've maybe wondered about, why are there so many pizza places in Manhattan? They will find a guest who can really answer that question.

And they just are very curious about an area where, I don't know, finance I feel like, gets a lot of polemics and people just get upset about it, but they're actually interested in understanding it and how does it actually work, and I think it's really fascinating.

Dori: I'm definitely checking it out.

CL: Well, one final question. Where can a listener find you?

Micah: Yeah well, you can find me on X, the Everything app. @mwylde. I'm on LinkedIn because I'm a former founder and it's basically obligatory, but I do post some stuff there.

And also on the Arroyo blog is where I tend to post deep, technical blog posts. It's gotten a little bit light since we've been busy with the Cloudflare stuff, but I'm hoping to bring that blog back and get some meaty stuff up there.

Dori: Awesome. Well, thank you so much. This has been just a fantastic podcast. We really enjoyed it.

Micah: Yeah, likewise. Thanks so much for having me on.