Best Practices for Developing Data Pipelines in Regulated Spaces
- Heavybit
Heavybit
How to Think About Data Pipelines in Regulated Spaces
Tech teams standing up new AI programs, or scaling existing programs, need to determine how to build and operate data pipelines, a key infrastructure component for AI programs at the enterprise., and without good data pipelines, that data will suffer from quality, accuracy, and freshness problems – no matter which vendors you rely on. In this article, we’ll cover expert insights on best practices for data pipelines in regulated spaces.
Roshan Nanu, PhD, Director of AI at Prompt Health, is an expert on building data pipelines to support AI applications in healthcare, a highly-regulated industry with strict data privacy and transport regulations. We spoke with him about why data pipelines are so important, walked through the best practices and common mistakes, and learned how tech teams can implement these ideas in any industry.
AI Is Only as Good as the Data Underpinning It
The first mistake tech teams tend to make happens well before the first data pipeline plans are even scoped. Once an engineering team takes on a data pipeline project, it’s tempting to assume the data in question is all from the same domain and follows the same standards.
Nanu explains that the data conversation has to start well before the project gets to the engineering team. “I'm talking to our sales team, our marketing team, and our billing team about their practices,” Nanu explains. Over time, he says, “It becomes more and more important for these teams to create good free-text data.”
Generative AI has, of course, become dramatically more sophisticated in the past few years, but it still abides by a classic principle: Garbage in, garbage out. The information in the data coming from your sales, marketing, and billing teams might be very meaningful, but if it’s not formatted for the data pipeline you’re building, that meaning can be lost.
Before you build the data pipeline, you need to zoom out and assess the quality of your entire data corpus. “With companies wanting to leverage AI and develop LLMs (large language models) to scale, we can only get reliability if we have a good corpus of data to start with,” Nanu explains.
Reliability for AI applications is critical. If reliability is low, otherwise-effective AI applications can face adoption issues. This is especially true for retrieval augmented generation (RAG) applications. If an enterprise wants to enable users to get precise answers to nuanced questions, users need to be able to trust those answers.
Nanu explains, “We can only make good RAG pipelines to get information to people if we have good documents to use as sources.” For the sake of AI, “good documents” go far beyond the quality of the information within them. Good documents also exhibit high data quality, being uniform and standardized, ensuring that terms are consistent and formatting is clean.
Nanu frequently has conversations with non-engineering teams, showing them how to adapt their processes to support AI applications that will, in turn, help them. “I’m telling them to standardize their processes, write down their heuristics and how they do things, and write down good documentation based on their processes and best practices,” he explains.
Investing in this work before building your data pipeline saves significant effort down the road. Rushing into it risks discovering months later that teams have been formatting data differently or entering it into inconsistent fields. Retroactive standardization is always harder than proactive standardization.
The Type of Data You’re Working With Matters
All data is not created equal, and the type of data you’re working with can be a huge influence on how you build your data pipeline. A focus of Nanu’s, for example, is automating note-taking for healthcare providers – an otherwise time-consuming manual process.
Different companies approach the challenge in different ways, using either scribes, intake-based generation, or note re-formatting/re-writing. The outputs may end up basically the same, but the inputs vary wildly.
“My dataset needs to have whatever I want as my input,” Nanu explains, “And then your output needs to be in a format that I can actually use.” The challenge is in the refinement, not the collection or transformation of the data.
With companies wanting to leverage AI and develop LLMs to scale, we can only get reliability if we have a good corpus of data to start with.
-Roshan Nanu, Director of AI/Prompt Health
In a healthcare setting, data meshes can be a helpful addition. A data mesh is an approach to data architecture that decentralizes data ownership and management, allowing self-service among different teams.
“We have a huge advantage in that we are an electronic medical record (EMR) company,” Nanu explains. “All of our customers’ data is stored in the same way, even if they don't create their data in the same way.” A data mesh helps ensure that the data from domain experts – physicians, in this case – can be stored correctly without overburdening those experts with data work.
That said, a data mesh is neither easy to implement nor a panacea. “It's still a very hard problem to create a data mesh that works well, that our AI developers can use and pull from for their individual use cases,” Nanu warns.
Figure Out What Your Data Really Means
When you look for technical guides around data pipelines, many assume you’re starting from square one. In reality, most tech teams inherit datasets and data infrastructure built years ago. AI might not have even been on the horizon when your system was first built, but now it’s a priority, which means you need to deeply understand your data before doing anything else.
“With developing use cases across manufacturing, across healthcare, and more, the big gap is actually less about what data is available – it’s about whether AI engineers know what it means,” Nanu says.
A developer and a physician have very different knowledge bases, and even physicians in the same hospital will likely have different ways of transcribing records. No amount of sheer AI experience will make mutual understanding simple; you have to cross the gap first.
Otherwise, Nanu says, “Your AI developer will ask, ‘How do I map these inputs to what I actually need to get?’ ‘What do all these terms mean?’ ‘What do these data columns mean?’” It can take a lot of time, he warns, but it’s worth it if it means turning that sometimes obscure domain expertise into a resource.
It can be helpful to break down the process of figuring out your data into multiple steps:
- Procure the data
- Cleanse the data as needed
- Format the data so that engineers and data scientists can query it
- Define the data so engineers and data scientists can understand what they’re modeling and trying to accomplish
The first step is impossible without the step of defining data. Defining datasets can feel like humble work, just understanding the data at hand, but it’s significant progress. “Just getting that domain expertise and having it available to your AI engineers is such a big step,” Nanu says.
Looking at it another way, this progress isn’t just a big step for your team but, cumulatively, a big step for the software industry as a whole.
“Right now, we're in a world where middlemen developed the software,” Nanu explains. “Your domain experts are using some piece of software, and you're brought in now to build on top of that, but you're stuck in whatever infrastructure your software engineers set up, without any thought to the fact that this data would be used later.”
Ingest and Store Data to Support Present and Future Use Cases
Once you understand your data and have processes in place to standardize it, your goal is to build a data pipeline that is consistent, reliable, and scalable. Companies have a lot of documentation and data, almost by definition, and your data pipeline isn’t ready for production until it can ingest, process, and store all of it.
“The next step is developing that consistent data pipeline, getting something in place to ingest your own company's documentation, your patient data, your sales data, your data from every other third-party software you're using,” Nanu explains.
Ingestion, however, is only one component. Once you can reliably and consistently ingest data across all of the sources your company uses, you need to be able to store it.
“Get all of that into a data warehouse,” Nanu says. Storage policies need to be clear and strictly reinforced. “Data needs to be in a format that your AI engineers can query for the use case at hand.”
Nanu emphasizes an upfront infrastructure investment because AI is such a rapidly evolving field.
“Right now, every time you turn around, someone has a new use case for AI they want to throw at you,” Nanu says. “Today it’s ‘Hey, automate documentation for providers.’ Tomorrow, it's ‘Hey, figure out if patients are going to show up or not.’ Or, internally, ‘Hey, get us better reporting to know what the sentiment of our customer support is.’”
The use cases for AI are already plentiful, but they’re still growing. An effective data pipeline, relying on effective data infrastructure, can pipe data in and make it usable for use cases that haven’t even been built out yet. “For all AI use cases, we need data sets that we can go to that are trustworthy,” Nanu says.
Support Your Data Pipeline with Automation
As mentioned, many data teams inherit their datasets and infrastructure, so an early investment in improving this starting point frequently pays off.
As you build your data pipeline, consider what you can automate. Once you get full access to your data and standardize it, you can start building automated processes that work from those standardized formats and allow you to avoid a lot of manual work.
“If you have a lot of that standardization in place, the most important thing is to decide on a cadence, to set up your ETL pipeline, and to get it running such that you are updating that data frequently,” Nanu says. If your data isn’t frequently updated and fresh, even an otherwise effective data pipeline can be unreliable.
It makes sense, but many feel tempted to delay automation. “Honestly, putting that in the backlog takes so much time to recover from,” Nanu says. “If you have the opportunity, if you're getting it set up, automate that data pipeline first so it's always up to date.”
Choosing not to automate your data pipeline is, in many cases, choosing to take on tech debt. “Once you're up and running, your company expects you to put out new use cases or make things shift faster,” Nanu explains. Without a mature pipeline supported by automation, he says, “It becomes harder and harder to carve out time to go and optimize that back end or go and improve on processes.”
Much of this comes back to foundational engineering best practices. Nanu has seen companies try to build pipelines with only data scientists, not engineers, “And it was the most inefficient thing I've ever experienced because no one on that team knew how to write good software,” he says.
If you have a lot of that standardization in place, the most important thing is to decide on a cadence, to set up your ETL pipeline, and to get it running such that you are updating that data frequently. If you have the opportunity…automate that data pipeline first so it's always up to date.
That said, there are limitations, and you can’t expect a perfect pipeline set up on day one. “Automation is not always feasible,” Nanu explains, in cases where the AI team is new or the company is first identifying its primary use cases. “You don't know your data markets yet until you go through a few processes. And it's hard also to get access to that data,” Nanu says.
Some enterprises can take four to six months to spin up an AI team and get them access to production data to work with and train on. In such cases, developers shouldn’t just sit and wait. Instead, Nanu says, they can “use an assumption of what they think the data might look like once they get it for their initial models.”
Companies will always have new use cases, and they will expect data pipelines to keep up and evolve. A stable foundation and robust automation are essential. “If you can, get it done early,” Nanu says. “If you can't, hire for it. Hire someone whose sole job is to come in behind that AI team and develop those backend systems to do that data and software engineering.”
Store and Manage Data Securely and Efficiently
Data privacy is important in every industry, but it’s especially important in highly regulated industries, such as healthcare, finance, insurance, and manufacturing. If an AI application were to regurgitate private information about a patient, for example, there wouldn’t just be reputational damage – there could be penalties, fines, and lawsuits.
As Nanu says, working from his experience in healthcare, “All it takes is 20 instances of private healthcare information to leak for a lawsuit. 20 individuals.” As a result, a ‘mostly’ secure application just isn’t enough. Tech teams have to thoroughly de-identify the data that’s fed into their models to ensure privacy risks are as unlikely as possible. But that’s easier said than done.
“It actually takes quite a long time to set up a robust de-identification,” Nanu warns. “And then you have to go through the extra step of getting a third party to review a sample of your dataset and your de-identification process to certify that it is sufficiently de-identified to no longer be protected under HIPAA. And they go through that data painstakingly.”
There are no shortcuts, Nanu says. Even if you purchase high-quality de-identification models from a vendor, you still have to build your own systems to do additional redactions and keep your data clean.
Of course, even properly de-identified data is only as useful as it is accessible. Once you have the data de-identified, you need to store it effectively and enable developers to query it efficiently.
This component of the pipeline is integral, and Nanu recommends assigning a whole person or team to data management. “From a logistical standpoint, it's definitely worth having a person or team managing your data,” Nanu says. “That’s probably the most important thing that you have.”
Working with someone who can focus on data management is crucial because it’s not work that you can take lightly. “When getting things set up, I would be very cautious about what resources you're using to do so,” Nanu says. “A lot of people are going to dump stuff into S3 and say, ‘Hey, it's good enough,’ but it's really hard to index and search that data.”
The other obvious option isn’t good either. “On the flip side,” Nanu says, “sticking it on RDS gets very costly, very fast for big data sets, and it's not overly efficient.”
There’s a tough balance to strike: You want to index your data properly, but you also want to store it cost-efficiently. “At this point, data is money, data is power, and data is what you need,” Nanu explains. “You need to be able to store all your data efficiently, but also query it efficiently.”
How to Think About AI Agents
The generative AI paradigm shift has blossomed into innumerable sub-trends, and right now, AI agents are capturing a lot of attention. Agents are relatively new, and they’ll require high-quality data to underpin them if they have any hope of working.
“If you have one agent that is reliable, then you’ll have to put a lot of time and effort into building that agent and training it and fine-tuning it,” Nanu explains. Until you have a robust data pipeline, agents might not warrant your focus yet.
That said, if agents are what excite company leaders and agents are what’s drawing a budget, then don’t shy away. Nanu recommends caution: “If you're stuck building agents, it really requires thinking through every step of your chain of thought pipeline. Don't leave a lot to chance.”
The more you leave to chance, the less reliable the agents will be, leading to a degraded experience. “You get an explanation one day that you don't the next day. There's a small update, you switched to a different GPU, and now you're getting slightly different results,” Nanu explains. That’s not an experience you can put in front of a customer.
The good news is that even the newest trends still point back to building a solid foundation first: Your data pipeline. But if you’re building for agents, you must pay particular attention to data diversity.
“There's just so much of a gap in reliability that you really do need a very large and diverse dataset in order to support good agents,” Nanu explains. “That means you need several thousand examples of realistic input and very high-quality output.”
Develop Failover and Fallback Plans
Many teams are rightfully wary of becoming too dependent on any given vendor, including foundation model providers. However, for early-stage startups and teams new to AI, buying into a foundation model, even at a relatively cheap plan, is a good start.
“A lot of companies want to retain ownership of models,” Nanu explains, “But those large proprietary companies, those source models, provide something very valuable at first, and that is speed and quality.” The idea of standing up your own models can be appealing, but it’s unlikely you’ll match the providers’ speed and quality, and you certainly won’t get close to that level quickly.
Generative AI is a scaling game, and until you have the data, it’s safer to risk some vendor lock-in. “Your fine-tuned, self-hosted, open-source 7 billion parameter model is going to perform absolutely terribly until you have that 5,000-or-more data point set,” Nanu says. “Until you get to that point, you're better off throwing your prompts at a foundational model.”
At this stage, the goal isn’t to launch the ideal AI strategy on day one; the goal is to get started.
That doesn’t mean, however, that you have to throw caution to the wind. Nanu acknowledges the risks and warns, “More and more, we're seeing vendor outages. And at any point, these closed-source vendors can raise their prices from 10 cents per million tokens to ten dollars per million tokens – they can do that overnight.”
A lot of companies want to retain ownership of models, but those large proprietary companies provide something very valuable: Speed and quality. Your fine-tuned, self-hosted, open-source 7 billion parameter model is going to perform absolutely terribly until you have that 5,000-or-more data point set.
As a result, you should balance the power you’re giving these vendors with the ability to fallback in the event of outages or other issues. Nanu, for example, uses models from these providers but has backups to keep them going if anything happens.
“We build out our datasets to train our own custom adapters that we own, that we host, and that we can fall back on if those closed-source models ever go down,” Nanu says.
That safety measure even backs out to how they use the models when they’re up and running. “That's why, even though GPT-4o has that 127,000 token context window, we actually keep most of our prompt engineering and our work under an 8,000 token context window just so that we can easily fall back to our open-source models on our self-hosted infra,” Nanu explains.
There’s No Free Lunch
It’s easy to feel behind the AI wave, but everything is still new, and best practices remain unsettled. “There's no free lunch,” Nanu says, “And at this point, all the systems for AI are all constantly in development, and are brand new. The whole industry is still figuring out best practices.”
Instead of waiting for a playbook that can get you from A to Z, focus on building the foundational processes of your data pipeline:
- Build a plan with goals
- Scope out data sources
- Map out data processing processes
- Determine storage
- Determine data flows
- Build / connect / test
- Monitor and observe
Throughout, treat your data pipeline like operational architecture, similar to the approach DevOps teams take to other foundational pieces of software infrastructure. While there’s a lot of excitement around AI, standing up an AI program isn’t something to build on the cheap. A data pipeline is infrastructure, and you need to build it knowing that it’s something to maintain and evolve over time.
Content from the Library
LLM Fine-Tuning: A Guide for Engineering Teams in 2025
General-purpose large language models (LLMs) are built for broad artificial intelligence (AI) applications. The most popular...
Regulation & Copyrights: Do They Work for AI & Open Source?
Emerging Questions in Global Regulation for AI and Open Source The 46th President of the United States issued an executive order...
The Role of Synthetic Data in AI/ML Programs in Software
Why Synthetic Data Matters for Software Running AI in production requires a great deal of data to feed to models. Reddit is now...