1. Library
  2. Podcasts
  3. Data Renegades
  4. Ep. #11, Contrarian Bets and AI Skepticism with Michael Stonebraker
Data Renegades
51 MIN

Ep. #11, Contrarian Bets and AI Skepticism with Michael Stonebraker

light mode
about the episode

On episode 11 of Data Renegades, CL Kao sits down with Michael Stonebraker, legendary database pioneer and creator of Ingres and Postgres. They explore the origins of modern relational databases, why specialized systems outperform one-size-fits-all platforms, and what today’s AI hype gets wrong about enterprise data. Stonebraker also shares hard-earned lessons on startups, research, and staying relevant over five decades in tech.

Michael Stonebraker is a pioneering computer scientist best known for creating Ingres and Postgres, two foundational database systems that shaped modern data infrastructure. A recipient of the ACM A.M. Turing Award, he has also founded numerous successful startups including Vertica and StreamBase. For more than five decades, he has remained one of the most influential voices in databases, systems architecture, and applied computer science.

transcript

CL Kao: I'm CL, CEO and founder of Recce and your host on Data Renegades. Today, our guest is Mike Stonebraker.

Mike is a computer scientist who has shaped the way the world stores and manages data for over 50 years. He created Ingres and Postgres, which became PostgreSQL, now arguably the most important database in the world.

He has founded more than 10 venture backed startups as well. Well, Mike, thank you so much for being here. Welcome to the podcast.

Michael Stonebraker: My pleasure.

CL: Well, I can't tell you how pumped I am, talking to you on this podcast because, you know, I'm a computer science dropout and you are a Turing Award winner. So thank you again.

I wanted to start with your journey. Everyone talks about the Ingres and Postgres period and I think at one point you joked about that being kind of your hustle, for being a fresh assistant professor fighting for tenure. But you read a paper on designing the first working relational database. Did you have any sense how big this would become at the time?

Michael: Oh, absolutely not. I was just intent on getting tenure. I mean, as I say, when you take an assistant professor position, even now you're given five years to prove that you have research chops. And I was betting the ranch on Ingres.

And my colleague Gene Wong was the one responsible for suggesting we read Ted Codd's original 1970 CACM paper and this was the next year. And we also read the codicil proposal, which is probably way before your time. And codicil proposal, I couldn't understand how anything that complicated could be good. And it had all kinds of horrible properties.

Like the minute the schema changed, you had to redo the entire database. And so Ted Cod's ideas made a lot of sense. And it was obvious that what would make sense would be to try an implementation. And so we did. And the only thing that's different is that a lot of people, you know, get prototypes to where they can make them work but no one else can.

And so that's the first 90% of the effort. The second 90% is getting it to where other people can use it. And we kept going and putting in the other 90%. And so we got to where Ingres was readily usable and we had maybe 100 installations running it around the country. And the rest is history.

CL: Right. Maybe walk us through kind of the leap from Ingres to Postgres. What was broken about Ingres that made you almost start over?

Michael: Well, it turned out Larry Ellison in 1979 said that Oracle was 10 times faster than Ingres, when in fact it didn't even run at all. And that got my hackles up. Also, we got asked by everyone who visited Berkeley, what's your biggest installation? And we'd be forced to admit not very big.

And this was made crystal clear in 1979 by Arizona State University, which wanted to put their entire student record system onto an Ingres database. And they could get by that-- You had to get this unsupported database system from these goofy professors at Berkeley. They could get by that you had to get an unsupported operating system, namely Unix, from some folks in North Carolina.

But the project came to a screeching halt when they realized that there was no COBOL available for Unix and they were a COBOL shop. So, unsupported operating system, unsupported database system, no COBOL basically doomed Ingres to be a curiosity. And the only way around that was to start a company. And so we did.

And so Ingres Corporation, as it turned out to be called, was started in 1980 and they had 10 engineers working on the code. They moved it to deck VMS and started adding features. And it was crystal clear that we couldn't compete with that. The academic research project had to do something else.

What got my interest up was it turned out that one of the original use cases for Ingres was supposed to be geographic data. You know, that's point lines, polygons, point in polygon, all that stuff.

CL: Yep.

Michael: And of Course in a relational database system, that stuff, you know, worked horribly. I mean, it was incredibly slow. And a simpler use case, which was brought to my attention a little bit later, but it's easier to explain, is that I got a call one day from a programmer at one of those large investment banks.

And this was after the commercial version of Ingres had just implemented date and time. And so the ANSI standard was about to come out. And so I said, "what do you mean? We implemented date and time according to the Gregorian calendar, which is what the standards say, so what's wrong with it?"

And the guy proceeded to explain to me that at least at the time, in his corner of the bond market, you got the same amount of interest on a financial bond during every month, regardless of how long the month was.

So what he wanted was he wanted to take the date sold, subtract it from the date bought, and multiply it by the interest rate. But of course that got the wrong answer. And so he said, why can't I overload subtraction and use a calendar of 360 days divided into equal lengths? And of course Ingres wasn't written that way.

And so that plus the previous example of wanting geographic types made it clear to me that you had to support what came to be called user defined types and user defined functions. And it was crystal clear that wasn't going to be put into the commercial system anytime soon.

So the academic project, you know, threw everything, threw the academic Ingres code off the cliff and started building Postgres.

CL: Right, and then that was actually before SQL, right?

Michael: Yes.

CL: So this is its own DSL, right?

Michael: Yes.

CL: Okay.

Michael: But this was a Quell based system. And one of the things that's kind of interesting was that Oracle Corporation started a year earlier than we did. And so they had kind of a little bit of a head start, but we were gaining traction against them and winning a lot of deals.

And that abruptly stopped in 1984 when IBM released Db2, which Oracle had a SQL system and we didn't. And although we got SQL within 18 months, the damage was done and Oracle rocketed ahead. So the only reason Ingres and Oracle didn't change places in the commercial landscape was Db2.

CL: So Postgres was created like almost 40 years ago, but at that time you kind of moved on to work on early like columnar store or stream processing database, among others. Right? And then meanwhile the community kind of picked up and added a SQL to Postgres to become PostgreSQL.

Michael: Well, that's not quite right.

CL: Okay, tell the story.

Michael: What happened was Postgres was a Quell based system. And in 1994 two Berkeley students, Jolly Chen and Wei Hong, picked up a SQL parser and created PostgreSQL. That was done by the academic project.

Then in the miracle of open source, of the way open source is supposed to be, this pickup team of programmers who had nothing to do with Berkeley, who I didn't know, picked up the public domain Berkeley PostgreSQL and ran with it and have been running with it for the last 30 years.

And I think that Postgres is the only database system that is not controlled by some commercial venture. And I think it's the way open source is supposed to work.

CL: Yeah, totally. It just feels so amazing that you weren't like really directly involved in that transition. Right?

Michael: No.

CL: And then how does it feel like watching kind of like a child grow and like growing up without you? Or is it exactly what you expect for open source, like you say?

Michael: Well, I mean at the time I was involved in commercializing Postgres with a different venture. And so the fact that some other people picked up the Berkeley code line I thought was wonderful. And so, I mean it didn't at all conflict with what we were trying to do and so it was totally independent of me or the commercial Postgres company.

CL: Okay, so I want to talk to you about your paper in 2005 arguing one size fits all was bad. And then kind of the general purpose database will fracture to kind of purpose built one like columnar store streaming in memory and all that.

What was the signal or what you observed in the industry that kind of drew that conclusion?

Michael: Well, it wasn't that at all. So I was involved in a company called StreamBase. And so we had a streaming platform way back when and it was architected nothing like the Postgres row stores of the world. And I had started to be involved in an enterprise that had the gist of an idea for a column store. And it was totally obvious that column stores were an order of magnitude faster than row stores on data warehouses.

So here were three technical examples of markets where you could just beat the heck out of a traditional row store with something else. And so based on a sample of three, there were at least three perfectly reasonable, totally different implementations that were each wildly faster than the other guys in a vertical market.

And so that was the gist of the 2004 paper and I think it's completely and totally true today. I mean, I think ClickStore is an exemplar of a new system, I think-- Is it Pine Tree that does vectors?

CL: Oh. Pinecone?

Michael: Pinecone, yeah.

CL: Right, right, right. And ClickHouse.

Michael: Right and ClickHouse. And so I think, you know, there continue to be examples of special purpose systems that are considerably faster than the general purpose ones. And I think meanwhile, you know, Postgres is of course being extended to support all kinds of stuff, but it is still a single node system and it is not a column store.

And they've deliberately chosen not to make it a column store because in effect if you want to support a column store and a row store, it's two different engines supported by a common parser. And the Postgres guys decided not to do that. Also in the data warehouse market the databases get gigantic and you need a multi node system. And so again the Postgres guys decided to pass on that.

So I think they've engineered really smart for a lowest common denominator system that has been extended with all kinds of stuff. And simultaneously there are a bunch of special purpose engines and I expect that to continue. I mean I don't see that stopping.

CL: Yeah. I think that the interesting part that, for example, PostgreSQL is so extensible. So there are like say vector store extensions or columnar extensions. And it seems like people are still putting things into that. Right? Even like the stored procedures or different language runtime.

Michael: One of the things that mystifies me is stored procedures, user defined types, user defined functions are really a terrific idea and you would think they would just take over the world because they really, they work beautifully. And so back to my example of the 360 day calendar.

If you have a 30 day bond time, that's a factor of three faster than if you have to retrieve two dates out to user code, do the subtraction out there and then put the answer back. It's just wildly performant in all kinds of areas.

The only explanation I can give is that there isn't really a good debugger for stored procedures. It would be a great idea if somebody built the tooling necessary to support types, functions and stored procedures. And I hope that happens someday.

CL: So what I'm hearing is that stored procedure or kind of putting compute to the data kind of makes sense when of course you don't have to round trip to Userland.

Michael: Right, Exactly.

CL: But ergonomics for debugging or-- That is harder than the other way around.

Michael: Yes.

CL: Okay.

Michael: So I mean it's tough to build stored, you know, user defined functions. It's tough to build them, tough to debug them. So the tooling just isn't there.

CL: Yeah, I had a firsthand experience. It was early, like I guess almost 10 years ago, trying to build the entire REST server into Postgres through PLV8. And then, yeah, you helped me, it's very hard to debug. Haha.

So I think there is still room for innovation there. And then I'm curious about one of your latest ventures like DBOS. Right? It's essentially Postgres as kind of a durable kind of execution engine. Can you tell us more about that?

Michael: Well sure. I mean like all the other things I've done, that started off as an academic research prototype. And I think it started off with, you know, Linux is 40 years old and it doesn't scale worth the crap. And so one of my colleagues, Matei Zaharia, who was at Stanford at the time, is now at Berkeley and was the founder of Databricks.

So he told me this was maybe 2019 or anyway quite a while ago he said that Databricks on a day in, day out basis was managing a million Spark instances, trying to execute them in parallel and they needed a scheduler that would operate on a million things. And they quickly discarded the Linux scheduler, saying it just doesn't scale. And so they said, well, the answer is you put the scheduling data in a Postgres database and then you run the scheduler as a database application.

And that made me realize that not only scheduling, but the file system and most everything else in Linux would be better off as a database application.

So as a research project we constructed DBOS, which was intended to be a Linux replacement. And we got enough of it to work that it was clear it was performance viable and had all kinds of nice features. Like scheduling was transactional, file system was transactional. If you run multi-node, multi-user and the research prototype ran that way it had all kinds of really nice features.

The students were, like students everywhere are, completely focused on let's do a startup. We said, let's do a startup. So we had this replacement for Linux and we also had this bunch of extensions to JavaScript that would allow you to very easily provide durability.

And so those were the two ideas we could leverage. And the VCs, when we started trying to raise money, immediately barfed all over the Replace Linux idea and say, maybe in your next lifetime. So focus on this programming environment that does durability. And so this was basically a requirement of getting funding.

And so that's what the commercial DBOS company has focused on is the programming environment. And they've moved to run on top of Postgres, they've moved to TypeScript and Go and Java. And so you know, they're building out, you know a complete runtime and they seem to be doing very well.

So check it out. It's very, very easy to use and blindingly fast.

CL: All right, we'll definitely link in the show notes. So in 2010 you were pretty loudly critical of NoSQL, when like MongoDB and Cassandra were the hottest thing in tech. And then you called Hadoop dead before the rest of the industry caught on. So both times you were right.

So how do you develop such conviction against, I guess that was the hype at that time, right?

Michael: Well who do you want me to pick on? Hadoop or Cassandra or Mongo?

CL: Let's do Mongo because they're still public. Haha.

Michael: Okay, so Mongo, I mean at the time was a complete piece of crap. I mean their transactions didn't really work. They had a bunch of ad hoc, I mean they-- Here was the marketing message: Transactions are too slow, so we're going to do an anemic version of them and you don't want a high level language because it's too slow.

And over 40 years I've learned that don't ever bet against the compiler.

CL: Okay.

Michael: Compilers, database optimizers, think of them as a SQL compiler, may not have been terrific in the 80s, but they've gotten awfully good. And so betting against the optimizer and betting against transactions just seems stupid.

And of course at the time the NoSQL world was 40 or 50 different systems all with different query notations. There was no standard. Over time what's happened is Mongo has junked their implementation and replaced it with one they bought from WiredTiger I think. Okay. Which had a real transaction system in it. And they've added joins and you know, now they have this strange mix of, of SQL and NoSQL.

And so unless you squint, Mongo has moved towards SQL. But I think at the time, you know, it was technically foolish to bet against the compiler and the entire world needs transactions. Even though at the time Jeff Dean of Google claimed eventual consistency was the answer, it was crystal clear to most everybody that that was nonsense.

CL: Yeah, I remember playing with MongoDB and Couchbase when they were first out. It was like, okay, so you are expecting me as a user to do all this kind of denormalization myself because there is no transaction. Okay.

Michael: Right. I think, you know, Mongo was never very fast. You know, this was just all marketing bullshit. But I think in their defense, they had an astronomic amount of technical debt. They managed to get enough sales that they got enough money that they could then replace the implementation. And so now they just have a thing that doesn't quite look like anything.

But I think what's really true is that Mongo and Cassandra are moving toward looking like SQL. Meanwhile, Mongo and Cassandra did have a couple of very good ideas.

The trouble with a relational database system is you gotta define a schema up front, you gotta define tables upfront, you gotta define cursors, you gotta define a query. And so there's all this crap you have to do in order to get going. So the out of box experience is miserable, whereas it's much better with the NoSQL guys.

And so the relational guys are moving in that direction. Also supporting-- The big thing that Mongo and Cassandra have done is support JSON as a data type. And the SQL guys have all done that and that's a good way to encode what amounts to sparse matrices. So I think the two camps have drifted very close together.

CL: Yeah, I feel it's almost like the industry trying to innovate on the front. And then I remember after the NoSQL movement, people were saying "not just SQL," so traditional SQL, but plus the document store, especially like Postgres not only have JSON but JSON binary. So it's extremely fast to do queries for JSON traversal.

Michael: My favorite joke is to say at the beginning the NoSQL guys said "no, don't use SQL." Then that quickly morphed into "not only SQL," which is what you mentioned. In my opinion, it's now moving toward "not yet SQL."

CL: Haha. Not yet SQL. Indeed. So for those kinds of contrarian takes, have you ever been contrarian but wrong?

Michael: Have I ever been wrong? Let me talk about large language models, agentic AI.

CL: Oh great.

Michael: We started playing with trying to do what's now called Text-to-SQL. And so there are a couple of public benchmarks, Spider and BERGEN, where large LLMs are pretty good at it. Like the leaderboard is 80 something percent in accuracy. And so you would think that the stuff ought to work. But we've been trying large language models and agentic AI on real data warehouses, we've got access to four of them.

One of them is the MIT Data Administration Warehouse, which is an Oracle database with 1400 tables. And we tried out the LLMs that everyone else was touting as well as our own, and accuracy was about 10%, not 80%, 10%. And if you gave the LLM the FROM clause, meaning here are all the tables that you need to use and here's all the join connectors you need to use. If you gave it the joint terms and the FROM clause, then accuracy improved to 30 something percent, but again there's a 50% gap.

And so on real warehouses, LLMs and agentic AI don't work worth the crap. And the reason for it is actually not unusual. Number one, the MIT data warehouse is of course not in what's called the pile, so LLMs can't train on it. And the general wisdom is that if you haven't seen the data a couple times before, you have no chance of being able to retrieve it.

So data not in the pile is number one. Number two, real data warehouse queries are like 100 lines of SQL and Spider and BIRD are 20 lines of SQL. So undergraduates producing simple SQL is just not the workload that real warehouses are subjected to. So more complicated queries.

The third thing is idiosyncratic data. So for example, at MIT, buildings have a number, they don't have a name. So the status center is officially, which is the computer science building, is officially Building 32. And MIT has a thing called the J term, which is a one month term in January. And so there's just all kinds of idiosyncratic data, which of course an LLM train on the pile is not going to do well on.

And then the fourth thing is that real data warehouses have materialized views, have overlapping semantic schemas, you know, they're not clean schemas. And the names of columns are often bizarre things that are not mnemonic. So bizarre schema, bizarre data, complicated queries and not in the pile sort of makes it extremely difficult for Text-to-SQL to work. And so I'm not expecting it to work in the absence of being able to train--

You know, I suppose you could train, you know, a large language model on the actual MIT data, but that's a petabyte of stuff and that's not exactly cheap. And so I'm not expecting the technology to work, but the entire world thinks it will.

And so what I've been working on is very contrarian, which is the following, which is unless you give it the joint terms and the FROM clause, you know, it's hopeless on real, real warehouses unless they have for some reason are in the public domain or have simple queries or don't have idiosyncratic data or have clean schemas. If you get all of those aren't true, then I withdraw my objection.

But I've never seen a data warehouse that looks like that. So anyway you've got to give it the FROM clause and the JOIN terms and that takes a human.

CL: Right.

Michael: And so we've designed an interactive system called Rubicon, where you have to tell it this metadata information. And in addition a lot of the-- I guess my favorite example is we started working with the Munich Germany Department of Transportation because my student who's working on this stuff is German and is back at the Technical University of Munich.

So the Department of Transportation, which is called the Department of Mobility in German. So think of it as the Department of Transportation. They have six full time engineers who are answering citizens queries or citizens complaints. And a typical complaint is I don't have time to cross this intersection before either the trolley goes by or the light turns.

And so the data that they have a variable to be responsive to that sort of query is, one, German Federal government regulations which are text, City of Munich regulations which are text, CAD drawings of all the intersections, that's CAD, the trolley schedule which is SQL database and the light sequencing which is a SQL database.

So you've got to query those five databases and it seems to me that if you want to downscale all this stuff to text and then try and do the joins in an LLM or in agentic AI is just not very likely to work, that you want to upscale everything to SQL. And so we are upscaling data to tables and then doing the join and what looks like a query optimizer.

So instead of betting on an LLM as the overall orchestrator, we're betting on SQL and using an LLM where it looks like it will work. So this is a very contrarian point of view and one that I'd say a trillion dollars is arrayed against us saying no, no, no, no, no let's do it with traditional agentic AI.

CL: We'll fix everything. Haha.

Michael: We'll see how it works out.

But in the research racket you never are successful by doing what the other guys are doing. It always pays to be a renegade.

CL: Yeah. What an amazing theme. Okay, so you say the project is called Rubicon.

Michael: Called Rubicon, yeah.

CL: Okay. Okay. And I know you're also involved in the Beaver benchmark set.

Michael: Yeah.

CL: How are they connected?

Michael: Well Beaver is a Text-to-SQL benchmark on which the best stuff in the world gets 30ish percent. Even after you give it the JOIN terms and the FROM clause. And so we've abstracted the details of these various data warehouses into a benchmark. And I'm tired of people saying BIRD is a good benchmark. So my systematic answer is prove you can do this on Beaver and then we can talk.

CL: Mhm. Right.

Michael: And so we've learned from how difficult Beaver is in building Rubicon. So it's the same group that's doing this.

CL: Okay, very cool. And I think there was just a new benchmark set released by Berkeley called DAB. Also cross database queries and all that looks pretty interesting.

Michael: Yes. Rubicon and Beaver are focused on being able to retrieve information. The Berkeley benchmark is focused on taking actions.

CL: Mhm.

Michael: Which is fundamentally a great deal harder.

CL: I see. I see.

Michael: And so I'm not optimistic that anything harder than what we're doing-- I say Rubicon is a query notation dumbed down to something that looks like it might work. And I think the Berkeley guys, that's a great goal. But I don't see without a breakthrough in LLMs and agentic orchestration, I think this stuff is turning out to be very, very, very hard.

And I keep reading that the number of successful production agentic AI projects is like 20% or 10%. People are having trouble just getting the basics. And so I expect the going is going to get is a lot tougher than people think. And picking the low hanging fruit I think is of course a great idea. And I think Claude and Anthropic is the fabulous success story. But I think that this stuff is not going to universally be true. We'll see.

CL: Yeah. So I wanted to kind of ask you about your kind of like the venture side of things because I don't think I know anyone that started more than 10 companies. And you had a lot of bets, a lot of them came from academic. But which idea have you killed along the way that you didn't think was worthwhile turning into a venture?

Michael: I think--

I have this philosophy to only work on stuff for which somebody cares what the answer is. I mean that's the difference between theory and systems. And so everything I've done has been because somebody was interested in the answer.

We commercialized a system called Morpheus back in the late 90s and that turned out not to go anywhere. I mean that was a wide scale data integration project which didn't really work very well.

I think the problem with data integration is the data is always dirty and the cost of cleaning it is really, really high.

And so my ideas in that area haven't, haven't been very successful. And I think, you know, StreamBase was ahead of its time. You know, it got an exit, but it wasn't a very good one. I think Vertica was very, very successful. Postgres was very successful. Ingres was very successful. Vertica was very successful.

I think Tamer is still a private company, but I think it will be very successful and I have high hopes for DBOS. I think that will be very successful. And I think the gist of DBOS is that the world is working on foundation models, but I think the big boys are going to win that battle because they have 100 times more money than anybody else.

And that infrastructure is where the big play is, which is what DBOS is doing. And a bunch of other companies are providing the underpinnings underneath foundation models. So I think that's where the real action is. And so I'm a big fan of companies in that area, which is again, kind of contrarian.

CL: Not chasing the fad. Right. And then I think this is like such inspiring wisdom. You only work on things that people care about answering. Right? And then you famously, I think, say that the best way to get your idea into real world is to start a company and threaten the elephants.

Michael: Well, I think the-- Well, two things. One is most academics, you know, it takes shoe leather to go find real world applications. And that's, you know, painful. You know, like getting access to 4 real world data warehouses was a huge pain in the butt. And I think, you know, going to that level of effort is something almost nobody is willing to do because you get rewarded for writing papers, not for making sure that you're doing something that people care about.

So I think the whole academic field is moving into what I call least publishable units, which is working on problems that are easy to solve, not problems that are important and for which who knows if anyone's interested. And so I think for some reason I just can't do that.

First of all, I'm not a very good theoretician and so, I'm using my skills at what I'm better at, which is not theory. And so after you start a few companies, you get to know a lot of people. And so I can talk to a lot of people and they all tell me where their pain points are. And that's incredibly helpful.

CL: Yeah. Solve hard problems. So I think like over the years you've trained like generations of engineers and researchers at both Berkeley and MIT. What's your advice for someone kind of relatively new to data or their career? What should they learn and what should they ignore?

Michael: I have a whole bunch of reactions.

First of all, I think 90% of all enterprise programmers are doing maintenance. And I think figuring out how to get Claude to do maintenance for you in an existing system is really something that could bear a lot of fruit.

And I think the problem is that-- Here's my cynical way of describing how real development works. You started out 30 years ago with a green field and you wrote something fairly clean and that it has been migrated and patched and extended and changed once per year since then. And it's now a complete mess.

And it's difficult to do maintenance on code that's a complete mess. It remains to be seen how successful Claude would be, but I think that that's something, you know, apply LLMs to something important I think is first piece of advice.

Another piece of advice is there's no reason for me to believe, again, this is my cynical view of things, that you start off with-- You have a big application doing something or other and it's split up across groups until you've got four or five groups or more working on any given release.

And so you have to coordinate between all of them. And you start off with management saying, I want this implemented as quickly as possible with minimum budget. So the way to do that is to make a current mess a bigger mess.

CL: Yeah. Cheap, fast, good. Haha.

Michael: And so I think somehow that has to change. And so I would be working on how you can cost effectively, inside the enterprise, not just pile band aids upon band aids. I think schema evolution is a huge problem, which is what we're talking about. And I think somehow getting better at doing that--

The other thing one of my good friends is Andy Palmer, who used to be the CEO of Tamer and Vertica. And for a while he worked at Novartis, the drug company. And his whole shtick was you want to outsource everything that isn't your crown jewels.

There's no reason in the world why anybody is running their own email system for example. So MIT is running their own email system and that's the dumbest thing on the planet. The lab I'm in, CSAIL, is also running its own email system.

So go through and figure out what your crown jewels are and try and outsource everything else. You know, try and knock down the amount of stuff you've got to deal with.

If you're in the research racket and you have a PhD, get an academic job at the absolute best place that you get an offer. When I took an academic job, I accepted the offer at Berkeley, which was by far the lowest paid position. So bank on high prestige and low pay because that will get you the best students.

CL: Right.

Michael: Because without good students you have no chance of doing anything. And then get a mentor like Gene Wong was for me, who can help you get going. So somebody who knows the ropes. So don't take any position where you're the only person in an area because that means you don't have a mentor.

And if you absolutely have to, figure out how to deal with that. So for instance, when I arrived at MIT in 2000, there was nobody in databases, no courses, no faculty, no students, no nothing.

CL: Wow you're bootstrapping that. Haha.

Michael: And so what I did was, well, Brown, Worcester, Polytech, Brandeis, UMass Boston, all had database presence. So we formed a multi campus research group in effect and that allowed us to get going. But figure out how you can get a research group off the ground.

And if you're in industry, my point of view is go work for a startup. There's absolutely no downside to doing that because if you fail, you can just go get another job and you'll get much better experience than if you work for one of the big boys.

And I think the only counterexample to that is the big web platforms have all the data and they don't share it with anybody. And so if you need to do cloud oriented research, you pretty much have to work for one of the big boys. I don't see how to get the data any other way.

CL: Yeah, thank you for sharing the wisdom. I think this is clearly the thing to pick the important thing, prioritize, execute, regardless of the field. Right? So thank you for sharing that.

Michael: Sure.

CL: One thing I want to ask before we go into the lightning round is that, well, you are like 82 and still starting companies, publishing papers and then fighting with incumbents. What keep you going? What's kind of the motivation for you?

Michael: I would be bored to tears otherwise. I'm always expecting to run out of ideas and become irrelevant. And when that happens, I hope I have the good sense to vanish into the sunset like a lot of people my age have done.

But until that time, I like what I'm doing, I'm having fun, and I don't really have any interest in spending my life on the golf course. And hanging around MIT, it has some of the smartest people on the planet and they really keep you on your toes, which I really like. So I can't imagine going off to a gated community in Florida or any such thing.

CL: Yeah, and then, well, the ideas keep flowing and then you keep having new ideas and then keep doing it.

Michael: So far.

CL: So, before we wrap, we're going to put you in the Data Debug round. so, very quick fire question, short answers. Are you ready?

Michael: Sure.

CL: Okay, first programming language you loved or hated.

Michael: Fortran.

CL: Love or hate? Haha.

Michael: I loved it. I mean, the first language I ever learned was Fortran. And of course Luddites, you know, like the first thing they learned.

CL: What is your go to data set if you're testing for something?

Michael: The dirty secret, which will be told to you by most of my colleagues is I don't really code, so the answer is, "not applicable."

CL: Okay, so depending on what their go to data sets are.

Michael: I mean I think the answer is there are 10 startups who I know people at, and depending on what I'm interested in, I get them to test out whatever I want to know.

CL: Right.

Michael: And also I get students to do it too.

CL: Right. And then you don't dictate what data set they use?

Michael: No.

CL: What's one lesson outside of tech that influences how you build and work?

Michael:

Simplicity is the answer. It's always the answer. Anything that's complicated is unlikely to be a good idea.

CL: Wow, that's real wisdom there. Okay, what's your favorite podcast or book that is not about tech or data or anything in our field?

Michael: well, there's a lady named Heather Cox Richardson. I don't know if you know her. She's a professor at Boston College and writes a political historical commentary that I really enjoy a lot. So I read her religiously.

I think I read Apple News religiously. I read pieces of the New York Times. I keep up on current affairs. I read some technical blogs these days on agentic AI mostly.

CL: Thank you for sharing. Well, that's all we have today. Thank you so much, Mike, again, for being here on the podcast.

Michael: Sure, CL. Thanks for all the interesting questions.