1. Library
  2. Podcasts
  3. Generationship
  4. Ep. #54, Human-like Memory with Vishakha Gupta
Generationship
35 MIN

Ep. #54, Human-like Memory with Vishakha Gupta

light mode
about the episode

On episode 54 of Generationship, Rachel Chalmers sits down with Vishakha Gupta to explore the hidden infrastructure challenges behind modern AI. They unpack why multimodal data systems are still fragmented, how graph and vector approaches can be unified, and what it takes to build production-ready AI pipelines. This conversation offers a systems-level perspective on scaling AI beyond prototypes.

Vishakha Gupta is the founder and CEO of ApertureData, a company focused on building unified data infrastructure for multimodal AI systems. She previously worked at Intel Labs, where she developed systems technologies at the intersection of hardware and software. Vishakha holds a PhD in Computer Science from Georgia Tech and a Master’s degree in Information Networking from Carnegie Mellon University.

transcript

Rachel Chalmers: Today I am thrilled to have Vishakha Gupta on the call. Vishakha is the founder of ApertureData. She has experience at Intel Labs in designing and developing systems technologies optimized for new hardware.

She has a PhD in computer science from Georgia Tech and a Master's of Science in Information Networking from Carnegie Mellon University. She's a wonderful thinker and speaker about all kinds of topics. Vishaka, thank you so much for coming on the show.

Vishakha Gupta: Thank you so much for inviting me, Rachel.

Rachel: You and Luis, your co-founder, must have been really frustrated to leave Intel and do a startup. What was a frustration or an insight with the way visual and multimodal data were being handled at Intel that made you decide to leave?

Vishakha: Yeah, you know, our time at Intel itself was really fun because we got to work with people who are so close to the hardware, software co-design boundary, which is something that-- That's the level of the stack that I really like to build in. But when we started talking to the data scientists and the AI team, this was the time when ImageNet became popular.

And then suddenly we are like, oh, we won't have to be observing or watching the images and videos to understand what's going on. We could automate everything and the models would help us figure out was the person walking, was there some security threat somewhere? So there was all this stuff happening. There was a lot of excitement around what machine learning could help us accomplish.

But people were building data solutions like they were streaming information and by what I mean is like you know how you can just look up a file, like give a date of this podcast and who are the speakers, and you just find it? And you stream it, right? You're not thinking about extracting more and more information. Like, what were the intonations, what were the emotions? When were the two people arguing? When were they talking cordially?

There's all this stuff that as machine learning got better and today, like today's VLMs and LLMs get better, you can get more and more information, which means you're not stuck with the original just timestamp and file name metadata. You go back and enhance it, you go back and keep enriching it, and then you can do better and better analytics on it. And you don't always have to run the entire AI machine learning pipeline.

But we kept seeing the machine learning models or the AI models that were coming up. They were the shiny things like, "can we improve this model? Oh, now it can detect a brain cancer in our test data set." But no one was thinking about the challenges at scale or what does it mean when you go in production? And everyone would create, just bring together different data sets in various folder hierarchies, kind of just dump some metadata in Postgres and try to link it all externally.

And every once in a while someone would say, "hey, maybe vector search would help us find things better. How do we extract embedding?" But that was all it was. It was literally like trying to make a Michelin star dish with grass from the backyard, just because that's all we got. So that was one side of like, you know, because--

There was also the reality that data scientists and AI teams in general are more trained to look at data and understand it and then draw conclusions or build something on top of it. They are not the people who are necessarily trained or, should be expected to spend their time doing this database cobbling or even knowing what's the best database. But that's what they were expected to do. So they had to wing it or they would create Frankenstein's data monster.

Rachel: Right. They're Michelin chefs, they're not the farmers. They need somebody else to produce the ingredients.

Vishakha: Exactly. And so it was very easy to say, hey, production is going to be in the future. First we have to prove the viability of AI itself. It was true that, you know first you had to prove that it was actually like, you know, the promise of machine learning and AI was real. So I can understand that.

But you don't create a database or the right data infrastructure in a matter of minutes. Even today with vibe coding, I bet it'll still take us a very long time to get the data infrastructure right.

And so somebody had to think about it. And you know, we at Intel, because Intel had like, everybody was a customer, right? Every database vendor, everything. So it was conflicting. So we would do reference prototype and we are like, well, if this has to be real, it needs to be built like a product. So that was like, in all those things coming together, were the trigger to leave Intel and build Aperture DB as a product that companies can use as they build their AI pipelines.

Rachel: How did you get it so right? I mean this was a while ago that you left. How did you see where the market was going so accurately?

Vishakha: Well, I can't say that we predicted that we'll all be fearing for our jobs today because AI would get so good at creating code and everything. But there were certain things that we observed. One of the differences in our approach, we were coming at it from the systems level.

If you think back to that time, back when labeling was a big deal, back when suddenly you realized vector search could give you search in vast collection of data, people were coming at it from like, hey, I'm trying to label, but I am having a hard time indexing my annotations or finding what data is not annotated. They weren't necessarily always looking at the entire pipeline.

For example, there is the data collection phase, there's the data curation, then labeling, then training, then inference and using validation and inference and then analytics. Coming from system standpoint, everything is an application. If you are like at the OS or middleware layer, you look up like, okay, what are my applications? What are the characteristics that we have to face?

So we actually looked at that and we realized-- So for one thing, machine learning and AI would make it possible for us to start making complex data more accessible. That was a given like, you know, what was based on what was going on just even at that very moment that models could tell you what are the activities in a video. They were not the best, but you could imagine that they would get better over time, right?

So that was the aspect of the data itself. Then the other aspect was every time you have this sort of large data, there's always some meta information that you start with, even if you have not annotated it. Just when did you collect it? Where did you collect it, who collected it? That's just fundamental metadata that's there. And over time it can just get richer and richer. So we are like we got to represent the metadata somehow.

We looked at relational databases, which is what everyone's go to solution, right? But this information was always very connected. So if you were looking at a document, who created the document, which org does that person belong to, which organization owns that? If you start drawing those relationships, you very quickly arrive at this graph structure. And it so happened we had also worked on a really fast in memory graph database at Intel just prior to us going down this path.

So we were very open to the fact that not only can graphs be made scalable and low latency, they can actually be the right data model to represent metadata like this. So we went with graph because it was the right data model. And if you fast forward to today, people have again come back to realizing the value of knowledge graphs and how to think about them and why they are important in capturing some of this information.

And then the third thing we realized was, well, we are also getting to a point where you can use a machine learning model to extract signatures and start doing similarity search. And if you're talking about such large volumes of data-- That was our theory. I think we've always known that there is-- At the time you were talking about, I remember one statistic about exabyte of video data, right? And this wasn't even counting how much document stuff there is, how many images there are.

So it was going to be really large. And so if you could find by similarity and narrow down the search space that's always, you know, that sounded like the right thing to do. And then first narrow down your search space and then get to the precise things with the metadata you manage. So we kind of arrived at each step by step by talking to people and looking at what was happening.

And then we realized, well, one of the focus for us was simplifying the life of AI teams which meant we cannot live with these data in buckets or file systems. Metadata and graph and then vector in some vector search index. We had to unify it. We had to make a query engine that would itself be able to decide, okay, this is where I go to the metadata, this is where I go to embedding.

So that led us to do certain choices like represent everything in the graph so that even when you did vector search, you can trace it back to where things are support some fundamental metadata queries, kind of like what you do with relational databases, but make it easy to access data. And then we map to like, what do we know about databases? We understand you can query for salaries of people and do an average.

Traditionally that is what is very common. And so we're like, what does that look like for multimodal objects? And we were a bit more influenced at the time by training workloads. So we're like, oh, if we can represent annotations we could pull pixels in the regions of interest. Or if we were feeding this to a training pipeline. Every time we see there is a step where you fetch large data and scale it down to small data because that's what the model accepts.

So we started building those operations in our query language. So it became like this, go beyond the database because it supported some pre-processing. But the data scientists using our database, for example, to do any image training, video training at the time, they were very focused on computer vision. They did not have to deal with FFmpeg, they did not have to understand cloud buckets, they did not have to go to multiple databases.

It talked to one database. That was the goal we had. And because we kind of approached it from that system standpoint, I think we captured the fundamental requirements of what happens when data needs to be stored, what happens when it needs to be searched, what happens when it needs to be retrieved. And we had this whole, I call the Intel blue blood of performance and scale kind of bring into our system.

So we were always like optimizing for it, which meant production level at large companies, it would still keep working and scaling the way people want it. So I think that's how we kind of got it right. I wouldn't say like we had this major vision or something, but we followed and listened and observed and then used our own skills and expertise and background applied to it.

Rachel: There's so much I want to respond to in that. But listeners if you can hear tiny meows, it's because the podcast cat Alice just climbed onto my lap. She's clearly a multimodal database fan. Haha.

I mean you talked about the teams who are still cobbling together vector DBs and graphs and object storage. And honestly that's still the state of the art in a lot of places. Can you give us a real example where that daisy chain of different products is slowing things down and how Aperture DB can speed things back up again?

Vishakha: Yeah. So I can give you grounded examples from some verticals that we have worked with retail being one. So an example there is you want to build a recommendation system and you want to recommend based on how we shop, how do we shop? We look at things, we like the texture. There's something that appeals and catches our eye. Which means you want to capture or use all those images per product that you have in your catalog, but you also want to ground it in like metadata: Is this product available? Is this the brand the user buys?

So it essentially becomes this combination of find similar images to let's say the user saw a sofa in somebody's house and they wanted to buy something similar, but it needed to fit in their house. So you would want them to be able to take a picture, ask you for similarity and then adjust like what sort of colors, metrics, all these. So that's a vector search point. And then you have to do the metadata filtering in terms of is this sofa available, can it be shipped?

And things like that. Right? And then you actually have to display the images in the browser at whatever in the form factor that the user wants. We ran into situation where because product metadata sat in like relational databases with URLs to the asset management system where the images were and vector search would be something outside, people were just writing their DIY solution to like first query, then get the images.

If a million images matched, they had to be able to do the parallel access or become good at downloading all of that stuff, have enough resources to do it and then generate embeddings.

So the solution was let's just pre-generate these embeddings and populate them in a key value store that you can just find similar when needed. But that keeps you in this stale data cycle. You're not using the latest data, you're not combining the latest metadata and you can end up recommending products that when you click on the link, they are out of stock or not necessarily the latest of the user's expectation. And it gets worse as you go from text to multimodal.

So when you're just doing text based matching, okay, you can store all of that in one place potentially maybe just an additional vector database. But when you want to start fetching images, or maybe you want to capture signals from product reviews and things like that, the data is scattered across in different buckets. You might not have permission to those buckets. You spend weeks and this I'm talking more of a larger enterprise scale.

In smaller companies the problem is less acute because we're not talking like millions and millions of data points. But even then you're talking about a growing set of requirements. Like today you only need a vector search, but tomorrow maybe you realize, hey, a graph RAG would ground us better. So now you got to go evaluate a graph database.

So you have to think about it in terms of not just what your today's problem is, but now you've seen enough. Like going forward, am I only going to be able to stick to my relational database queries? If not, what price am I willing to pay for my engineers spending time in bringing all of these things together? Even if I can automate it, how efficient is it, how secure is it, how consistent is it, how do I do governance on all of this stuff?

It's one of those problems that is very easy to punt because you don't want to pay attention to, but it's one of those things that if you did, you would have really different outcomes. And the reason I say that is it's like the delays come in insidious ways in this case.

Like one of the Fortune 50 retailers we work with. When we poked and prodded at like how much time people were spending and then started attaching time and dollar values to it, we essentially realized that every team, to accomplish some stuff, can spend six to nine months just cobbling some of these things together and getting them right if they wanted to scale. And it can cost 2 million per team. And when you have tens of such teams, millions of dollars.

But more importantly, more than the resource, more than how much extra you spend on cloud costs while doing all of this Frankenstein solution, you can very easily reach the wrong conclusion that AI doesn't work if you didn't give it all of the data that was relevant at the right time in the right manner. Well, I mean, you know, it can't divine in something on its own. It needs to see what's actually happening and then, you know, correlate things and reach conclusions.

And I've spoken with teams where they were asked to like, you know work with data that they had no way of visualizing because it was complex data. they would do like they would manage to query one or two samples out of it and spend three months of three different like a data engineer, data scientist and data analyst and reached the conclusion that, oh, ah, it's not getting us the accuracy we want, it's not recommending the right products, let's scrap the whole thing.

It was the opportunity cost, not just in terms of how delayed you get, but also what you conclude. And based on what we discussed about how we build Aperture DB this is exactly the kind of problem that it avoids.

It helps you unify things. It helps you connect the dots. It helps you query data of different modalities so that you can feed very rich information to your AI models and give it the capability to reason and think across the board like a human, or even better. It has the advantage of speed and the ability to connect a lot more dots than one human can. But you've got to give it the dots to connect. So that's what we fix.

Rachel: Let's come back to that note about human memory because there's a lot there that I want to dig into. What are some of the worst misconceptions that people have around multimodal data and graphs?

Vishakha: That's a great question.

It's one of my pet peeves. People think that if they're just working with text and tables that they don't need multimodal. They think multimodal only means images and videos.

So like I can't tell you how many times I get this thing about, "we are not doing images yet, so we don't need a multimodal thing." Are you doing vector search? Are you doing metadata in your tables? Are you working with text? Then you are multimodal and you can benefit from what we have. It's been a very interesting conceptual mismatch there. Somehow multimodal has become synonymous with image video stuff, which it is not.

Then with graphs that's even worse. And I've written blogs on each of these topics to kind of start to clarify the terminologies. Somehow there is this-- And you work with graphs a lot, so you know the fundamental things behind graphs is the whole connected data representation. So you know, if you are doing a lot of foreign keys across your tables, you should give graphs a chance. Right?

But people somehow think that they need to have a very advanced schema definition. Like they need to know their schema well in advance if they should use graphs, which is the opposite of what it is. Schemas can be evolved a lot more easily in graph databases compared to relational tables. Then the other misconceptions like graphs can't scale, graphs can't perform. And I think I partially blame it to the popular options people have which have in the past not scaled and not given the right response time.

But that's not a blanket statement. We can do, in the graph queries in Aperture, we can support sub 15 millisecond lookups on a billion scale graph and on one machine we've easily scaled to over billion entities and as many relationships among them.

It's possible to do it if you build it right. It's not fundamentally that the graph model or the graph database can't do something, it just has to be done right.

Then the other thing I recently heard was, well, we just need to do simple metadata lookups. Like we want to find the names of all employees or something. A graph can't do that. It has to only do connections. I'm like, no, it's like any database you have, like, okay instead of a people table, there is a person node. Person has a name. You can index the name or index the email and just look it up like you do with any.

So any fundamental query that a relational database can support, a graph database can support whether that particular graph database tool decided to implement it or not. That's different. But you can. And then it gets even better and better as you start to do more traversals, which is equivalent in terms of joins and stuff. So those are some of the things that they always like-- I don't know how to address them because if I try to write all of this stuff on a website, it'd become a really complicated website. Haha.

Rachel: It's a good thing we have a podcast.

Vishakha: Yeah.

Rachel: Back to human memory. What was your aha moment there?

Vishakha: I don't know. At some point when I was, I was reading about a lot of this agents and how they are going to be a representation of us or they're going to start doing some of these tasks. But you know, I kind of like consciously started thinking on how I operated and I was mainly focused on like this whole enterprise context, not necessarily in the personal life. I think a lot of these things apply in personal life too.

But, you know, since we are more like an enterprise, we are more like a B2B play, I was very influenced by, you know, how organizations at different levels think. So I kind of went back to the Intel days and I was thinking like, how did I operate? Why were certain things that did successful? Why some were not? And I realized there was like some things I just knew, some things I knew who knew them, to ask them.

So I could ask those people and then I could search up existing data sources I could match with outside research that was coming in. It was a combination. I would form connections in my head about concepts. I would remember snippets of conversations, some images of like, "hey, this is the person I met at this time in this hallway and we talked about this topic. Let me follow up on that."

That is what made my response or the research direction align and do the right thing well, if I had the right intuition. The intuition part will always be a variable. Right?

But that's what made me realize if you want AI agents to be human quality or better, they have to have a way of doing some of these things at enterprise scale.

Right? It's not only the ability to deal with multimodal data, because I mean, if you notice in that thought process, I was remembering snippets, I was remembering scenes. It wasn't just text, it wasn't tables of information. My brain is not tabular, I don't think anybody is. We just force it into that structure. Right? And so it has to not only be able to natively deal with multimodal data, but also search through vast amounts of information and connect the dots. Like we're talking earlier.

Vector databases give you one of these three things like search in the vast collection through, and multimodal embeddings can let you find by text to image, text to video, or vice versa. In that, I mean, a lot of them don't scale easily, but that's different. But they certainly miss out the graph nature of multimodal objects. What I mean by that is a video can have clips, frames that are interesting. A document can have paragraphs that are interesting. There's a hierarchy in the representation.

And if you don't understand multimodal natively, you miss out on that specific pieces that you need to pull. So just trying to build something with vector databases doesn't get you the whole way. And so thinking about human memory and all the pieces, you realize you need that graph, you need the vector, and you need the multimodal native. So all of that needs to come together at scale with the security and privacy stuff that's required, at least for memory in an enterprise context.

Even on a personal context, I would say the scale isn't necessarily as big and there is less coordination and less breaking down silos and stuff because hopefully you can pull information from different parts of your life together. But those are some of the characteristics that I was like, oh, for memory, you know, that's what we need to do. We need to bring all that and make it feasible to represent it, to query it, to share it, to keep it secure, and then, you know, be able to continue to add rich relationships.

Rachel: I wish my human memory worked that well.

Vishakha: You know, if you think consciously, you already do. Like, you know, the number of times you have offered to help me, it's through joining what I'm telling you with who you know, and who has told you before, maybe in an event that you attended.

Rachel: Yeah, no, the challenge is retrieval. There's a lot in there. It's just indexing it properly.

Vishakha: Yeah.

Rachel: You and Luis take the big jump. You hold hands, you jump into space, and like all good founders, you go and talk to customers. What surprised you, talking to real customers in this space?

Vishakha: At the beginning-- So remember, we were coming from research and suddenly decided to sell and expect paying customers. But very quickly we realized we care a lot about scale and performance and to tout the fact that we give you 35x improvement in your data set preparation. Like I was saying, it's that Intel training. Right? But AI teams and, you know, the very first customer we landed was this Fortune 50 retailer.

And what they liked the most was the productivity boost because it was an AI team. Right? They weren't getting enough attention from IT platform teams to build something custom to them understanding their requirements, which was understandable too. I mean, like, IT teams had their own website and everything to pay attention to. They weren't going to suddenly create a new system that's easy to use by data science, which wasn't even like, okay, it's always evolving. What are you building?

And DIY for them was painful. And so that was one. It was surprising. I had shipped the presentations mainly to focus on the productivity aspect of things. And then performance kind of just tagged along, which was like such a shocker, right?

The other things that we realized was that the world wasn't ready for computer vision. We were, but the world wasn't because there were other aspects like data collection, permission to use the data, the model accuracy on those things.

I mean, it was good to show in the lab, but I literally had a model like run on a brain scan and call it a telephone lobe. I mean, that's where we were at, at some point. And it's still gradually getting there. We've gotten really good with text, but vision, language, models and things like that, they still need improvement. They're getting better. We're almost there. But we were too early in pushing for it.

And then we talked about concept confusions around multimodality graph. I think I used to confuse a lot of people back when now, at least people have come to terms with, okay, there is this embedding stuff and there is also this knowledge graph stuff. So there is more openness and understanding about it. And then the biggest thing that, you know, I've realized in the last year or so, focusing a lot more on go to market and these things, there is a lot of human variance that plays a role and relationships matter so much more than I had previously realized.

You know, a champion who truly sees what you build, who really believes can make a deal happen in two months. Whereas there are situations where you can prove all your technical chops, you can be really the best product that they've evaluated, but they have some other, somebody has some other side deal which, you know, I mean we've gotten stuck in products we don't like, but they're forced on us because of some level of decision making we don't fully understand.

And that has been like a biggest lesson in terms of visual and they can solve the whole thing. But there is like this people problem that we gotta solve.

Rachel: That's a perfect segue into my next question. What's one piece of advice you wish you'd gotten earlier about moving from Intel to a startup?

Vishakha: Build network or a brand online or both. It's like mainly distribution channel. That's really what it comes down to. You know, as a researcher it's very, I want to say easy. You just don't consciously think about the value of marketing. And marketing, I'm not talking about like another SEO growth engines just like finding a way to let people know you are there and what you can solve for them. That's all I mean by marketing and the sales connection and the key growth partnership.

And we think like, oh, if we get the product right today, we can figure the other things out later. But I think time and time again, especially in today's world, if you look at a lot of the companies, they go viral. It's not always because the product is perfect or it works really well. Especially like in a-- The one day you see oh, this is this awesome tool and the very next day you see all these articles about the security holes and how it crashes and how it sent off your data to something that it shouldn't have.

So there is more things like, you know you just have to understand that old distribution channel really well and pay attention to it from the very beginning.

Rachel: Totally makes sense. Vishakha, how do you stay so current and so knowledgeable about AI? What are your favorite sources for learning?

Vishakha: So I like the emails I get from DeepLearning.AI. Andrew Ng summarizes concepts really well. Then there are certain Substacks like Data Engineering Weekly and some newsletters that I follow. And then I talk a lot to Copilot, Gemini, Perplexity and then go off and validate with some articles and I sometimes go attend events, especially like upcoming founder events. There's a lot that happens in San Francisco Bay area.

It's been interesting. Now I'm basically getting into the phase where, okay, I'm going to build my agents, use Aperture as part of it, not only to educate myself, but also make it so that our documentation is better structured towards all the new tools. And if somebody asks me a question, like occasionally when I'm doing a presentation, I get a question.

How can we improve our search results? What models do you recommend? And if someone asked me a question, I really like to know the answer. So now I'm like, gotta build it myself.

Rachel: I love everything you've told us. So I'm gonna make you Prime Minister of the Solar System. If everything goes exactly how you think it should go for the next five years, what does the future look like?

Vishakha: You know, I think I am a very strong proponent of AI should be replacing mundane jobs, not things we actually enjoy doing. You know, all this narrative around coding being replaced, art being replaced and all these things. In some ways I feel like, okay, it's a step towards it, but hopefully we do the right thing and we replace mundane jobs.

Like I, personally am not a big fan of loading the dishwasher. I would rather have all these advances we do in physical and embodied AI get to a point where that stuff is taken care of and we get time to enjoy time with kids, but also have the money to support all of this because we haven't lost our jobs.

Rachel: Yes.

Vishakha: And we get do more. We solve bigger problems like diseases, all these wars, travel, space travel. I'm a big Star Trek fan. Yes. Won't happen in the next five years, but I'm hoping someday. Especially the transporter, you know, that's what I'm rooting for.

Rachel: That's a big ask, the transporter.

Vishakha: But I mean, you know we are saying AGI is here, right? A lot of people are saying that.

Rachel: Right.

Vishakha: It's supposed to help us get to that world where we get the things we really want. So transporter, I mean especially, you know, if you have family far away in places like India, that's the first thing you think about.

Rachel: It really is. To visit my family in Australia would be super handy. Well, I don't have a transporter for you, but I do have a generation ship, a starship that takes more than 100 years to get to its destination, so it supports multiple human generations. This is your prime ministerial vehicle. So what would you like to name it?

Vishakha: That's a very interesting question. And you know I'm divided between two names. Mystery or Cacao. Mystery is just, you know, like mysterious. Like what was, you don't know what happened in the hundred years. You gotta learn or what is happening in the future.

But Cacao, it's just like-- I feel like if you are able to travel to other places, what if you don't find chocolate?

Rachel: That would be appalling. Earth is the only source of chocolate that we know of. Oh my God. What an uncaring universe. Haha.

Vishakha: Imagine that! Haha.

Rachel: Vishakha, what a delight to have you on the show. Thank you so much for spending the time with us. It's been absolutely great.

Vishakha: Thanks a lot for inviting me. This was a great discussion.