Data Privacy, Security, and Identity in the Age of GenAI
What Generative AI Means for Privacy, Security, and Identity
Generative AI’s potential impact on software development as a productivity enhancer seems exciting–44% of developers reportedly use it in their dev processes. However, the way GenAI apps such as ChatGPT and Bard use large language models (LLMs)–machine learning models that are “trained” on large amounts of data to make them “smarter” and more performant–also poses potential risks. Modern AI models’ insatiable hunger for data has led popular content and social media publishers such as Twitter, Reddit, and Stack Overflow to close off free, open API access and move to a paid tier system. In this article, we’ll explore the potential challenges in the use of AI for software developers, particularly in terms of data privacy issues.
Beyond sci-fi movie deepfakes and more-common threats such as malware and identity theft, why should devs be concerned about privacy when using GenAI? We know that modern AI chatbots' functionality includes ingesting data, learning from it, then potentially outputting those learnings to any public user that can properly engineer a prompt–a properly-structured input/question that elicits a response. One study suggests that 46% of senior executives believe their colleagues have unwittingly shared sensitive corporate information (potentially including trade secrets or other intellectual property) with OpenAI’s popular chatbot ChatGPT. Massive enterprises have responded to potential privacy risks by banning the chatbot internally, including Samsung, which infamously suffered a data privacy breach when an engineer fed it proprietary code.
To get a better understanding of the landscape, and future challenges and opportunities offered by generative AI systems, we spoke with experts to cover:
- Data Privacy Tactics for GenAI: Day-to-day best practices to implement now to improve the privacy and security of your software development.
- Long-Term GenAI Privacy and Security Strategy: Important long-term privacy and security factors to consider without blocking your dev team’s productivity.
- Market Opportunities in the Privacy and Identity Space: How the nuances of GenAI are potentially creating new opportunities and niches to fill
LOOKING FOR MORE RESOURCES ON AI, PRIVACY, AND SECURITY?
- Article: Digging Deeper into Building on LLMs
- Article: The Demographics of AI Developers
- Article: The State of Security with Tonic, Tailscale, Aserto
- Video: Building Securely for User Privacy with Marten Mickos
- Video: Essential Cloud Infrastructure Security with Cisco and HashiCorp
Data Privacy Tactics for GenAI: What to Do Today
Ian Coe is a software and product veteran who has served tours of duty at Palantir and Tableau, and has since co-founded Tonic.ai, a startup that helps software development teams securely generate de-identified data for QA, testing, and development. He recommends the following best practices for engineering orgs concerned about data privacy, as well as some decision-making suggestions for startups that are thinking about building their own models:
- Data Privacy Options to Consider:
- High Cost: Defining/training your own model for internal use - The most costly and time-consuming option. Having to initially train an internal LLM for your own team and maintain it over time may require more time and resources than early-stage startups possess. This may require significant upfront data de-identification efforts in order to protect the data since models can learn sensitive data attributes.
- Medium Cost: Using conventional AI tools while redacting your data - AI vendors such as OpenAI offer highly involved data removal request forms which are not trivial to fill out and track. It’s also not clear how quickly or to what extent the vendors will respond; however, for nontechnical verticals such as e-commerce that are less likely to be dealing with feeding sensitive internal code to AI systems, this option might make sense. Auto-redaction products can also be implemented to scan each prompt and detect sensitive content to mask prior to sending to a third-party hosted LLM.
- Low Cost: Limit the info your team inputs into LLMs - Potentially the most lightweight option for privacy threat assessments, but possibly the most challenging to enforce, particularly for larger engineering teams with high autonomy.
- Implement Standardized Testing for Fine Tuning: When fine-tuning–retraining generative AI models to better respond to prompts they couldn’t originally handle–it’s important to standardize your test practices for consistency, reproducibility, and the ability to make objective comparisons to determine what changes actually helped your model improve.
- Implement Prompt Tracking: Logging the results from the prompts you’ve been using over time will help you unearth which reveal sensitive information and identify potential vulnerabilities. This can also point you in the right direction of any data of your own you may need to redact in order to safely interact with LLMs.
Discussion: Data Privacy Today (and Tomorrow) with Ian Coe of Tonic
In addition to prescribing the above tactics, Coe recommends avoiding processes that may be too complicated or time-consuming for dev teams to follow regularly. “For any security- or compliance-related issues, a good rule of thumb is: The easier you can make your process, the more people will be willing to take the necessary steps to protect themselves.”
Still, Coe points out that many of the engineering orgs he speaks to are “fairly early” in their AI journey and potentially not focusing much attention on privacy concerns, or on where things will eventually head with the enterprise. “We’re getting a lot of security questions about how to use generative AI without sharing sensitive data, for which we recommend trying to redact the data, but there are lots of other big questions, such as: How do I build my own models? Should I use prompt engineering? Should I fine-tune a larger or smaller model?”
Regarding how privacy will eventually look, particularly as a consequence of the ongoing battle between proprietary foundation models–privately-held LLMs built by commercial companies–and open-source models, Coe suggests that it’s unlikely there will be a single, dominant LLM...though privacy, governance, and security practices may play a role in determining whether proprietary vs. OSS models will succeed at the enterprise level.
A good rule of thumb is: The easier you can make your process, the more people will be willing to take the necessary steps to protect themselves.” - Ian Coe, Co-founder / Tonic.ai
“You may have seen the leaked memo from Google asserting there's ‘no moat,’” Coe points out. “Still, it seems fairly unlikely that the long-term future is ‘one model to rule them all.’ However, I see a lot of opportunities for companies to make it possible to run models in a way that's comfortable for enterprises specifically. For example, I don't know that I would view the enterprise AI field as being ‘OpenAI versus everyone else,’ but I might be skeptical that the dominant enterprise players would be 100% open source, since enterprise’s requirements around security and governance can make broad enterprise adoption challenging.”
Coe also points out that while government regulation that expands on GDPR or California’s CCPA may eventually pick up some of the slack, it’s not clear whether regulation or privacy laws will be timely enough or comprehensive enough to protect every org’s data. “Government regulators can be slow to take action, so I’m curious if we’ll also see the private sector take the initiative–for example, introducing a kind of SOC 2 for a particular LLM. There have been some pretty public concerns already, such as the Copilot lawsuit and Stable Diffusion going so far as to [accidentally] render the Getty Images watermark on images it generates. There might even be a business opportunity for something along these lines. My guess is, it may be a combination of public and private regulatory solutions. Certification might come more from the private sector. SOC2 is something your company does to certify you're safe–maybe we’ll see a SOC2 or GDPR for AI to ensure customers don’t unknowingly contribute their sensitive data to models.”
Long-Term Gen AI Privacy and Security Strategy: Regulation, Customer Controls
Patrick Coughlin is a security and intelligence expert, having previously worked with Booz Allen Hamilton and Good Harbor Security Risk Management before co-founding the cybersecurity threat intelligence platform TruStar (acquired by Splunk). Coughlin predicts that, in addition to day-to-day concerns about cyberattacks, GenAI may also make these strategic privacy issues crucial in the future:
- Right to Be Forgotten May Set the Bar for Customer Expectations: Government privacy regulation isn’t uniform across the globe, but the European data privacy regulation GDPR set a precedent for data collection and Right to Be Forgotten expectations that may set the tone for future privacy conversations and security measures going forward.
- Traceability to Potentially Grow in Importance: If/When/As incidents of AI products unwittingly surfacing personally identifiable information (PII) to the public occur, software providers will increasingly need the ability to trace the point of origin for that confidential information (including, but not limited to individual user data)–How was the model fed that personal data, and where, and when, and by whom?
- Data as “The New PII” for Companies Under Threat of Prompt Injection Attacks: As companies potentially use and come to rely on AI language models for more operational use cases, the proprietary dataset they use to train their models could itself become as prized an asset as personal data, vulnerable to either pollution from external prompt injection attacks–prompts that are purposefully crafted to bypass an AI chatbot’s default privacy and security instructions–or theft by competitors.
Discussion: How Data Privacy Needs to Evolve with GenAI with Patrick Coughlin of Splunk
“Every new disruptive technology that comes in usually introduces a new definition of infrastructure, new types of services, new types and dependencies in applications that expand the attack surface for the enterprise. So more and different things that need to be protected,” Coughlin points out. “We saw this with the move to desktops and then to mobile, and then to cloud and containers, and every new architectural leap forward introduces a different kind of attack surface area.”
Does AI’s potential to be a game-changer in software development mean that privacy and security will eventually be impossible? Coughlin doesn’t think so. “Everybody thinks it's going to be the end of the world. This is a very important new wave, but like other waves that we've seen in the past, you usually get some new critical assets, you get some new breadth and vectors in the attack surface area. There are new ways to lose data and have IP exfiltrated, new compliance requirements and regulation coming in for the defenders. For attackers–as always with these new technologies, they get some sort of improved economics. Maybe they can move faster. For example, you may have seen how bad guys are using Chat GPT to help them write better phishing emails.”
As much as there is to be scared about when we talk about AI, we have to remember that security is going to be one of the biggest beneficiaries of AI as well.” - Patrick Coughlin, VP - Technical GM/Splunk
Coughlin is bullish on the impact GenAI will have on software dev teams’ ability to keep their data secure and private, particularly as it acts as a lever to ramp up new developers in the future. “But don't forget that we have tools here. When you look across your security operations, your IT operations, your DevOps teams, you're going to need to have these teams continue to work together on challenges around AI and ML. Also, remember that it's always a double-edged sword: For as much as there is to be scared about when we talk about AI threats to security, we have to remember that security is going to be one of the biggest beneficiaries of AI as well.”
“We'll be able to use generative AI to help us have better detections, smarter detections of when things are going wrong. We'll be able to generate predictive response playbooks that actually take actions to automatically remediate things. And we'll be able to reduce the barrier of entry into these fields like coding and DevOps and ITOps and SecOps, where you previously had to have all these certifications and years of experience with frameworks and tools to actually have an impact. In the future, maybe you'll be able to hunt for needles in haystacks and leverage natural language to get the maximum capabilities of your tooling and your processes.”
Market Opportunities in the Privacy Space:
Shiven Ramji is a veteran in software development and launching products, having worked at Amazon, DigitalOcean, and Auth0 as Chief Product Officer before the company was acquired by Okta. Ramji suggests that the tectonic shifts that GenAI has brought to software development may affect the trajectory of new and existing companies, but also unearth interesting opportunities for startups to explore:
- Verifying Humans vs. AI Will Just Be the Beginning: Generative AI already produces human-link content at a rapid pace, and the ability of algorithms and generative AI tools to emulate human speech and mannerisms will only improve–and will make issues such as identity and authentication even more complex.
- Companies That Focus Entirely on Building Models Risk Commodification: Startups whose entire stock-in-trade is building and selling a LLM will likely find their products commoditized, as actual market advantages are likely to come from proprietary data, rather than how fast a model and process how many tokens. For example, it’s quite possible that database companies will all move to build AI capabilities to layer onto their data, with bespoke permissions to use only their own internal data and not expose that data to the outside world.
- Data Gathered by Early-Generation AI Pioneers May Pave a Path to Next-Generation Opportunity: While established verticals such as enterprise-scale identity and access management might require seismic economic shifts–such as delivering the same types of services, but at 10x speed, 10x more cheaply–earlier AI startups (Jasper, Databricks, etc.) that have been collecting data from users for years may be able to leverage their data stores to pivot into new directions.
Discussion: GenAI in Enterprise Data Privacy Today and Tomorrow, with Shiven Ramji
Ramji contextualizes the differing privacy needs of early-stage startups with a much smaller footprint, whose biggest day zero problem is just getting users into the product, versus growing companies that need to move upmarket and sell to orgs with stricter privacy and compliance requirements. “It’s over time you realize–you’re trying to sell to companies that say, ‘Well, I need to use my own enterprise connection federation,’ or they'll say, ‘My security team tells me I need compliance, auditing, rules-based access control.’ Those problems are near and dear to startups, because those are the things you're going to address early on. Most startups don't get to understand the authorization problem much later in their journey. Authorization actually needs to start from the enterprise level…which is kind of counterintuitive. But I think we're finally in a place where cloud capabilities, latency availability, and data stores can be deployed globally with caching. Finally, I think we can have highly available, four-nines authorization as a service powered by the cloud.”
Regarding the way AI is changing the face of privacy and identity management, Ramji points out that modern IAM is still a complex process, requiring machine learning teams to constantly retrain models. “You start with the attack. So you need to protect the identity tenant against unauthorized access, such as from large DDoS attacks, but wait–they may attack the users, and you need to be mindful of your ability to detect data breaches for passwords or even credential stuffing attacks, which let them take over the user's account. So, you need account takeover for data protection. After that, you need to worry about product abuse, fraud and other factors. We’re taking feeds from different third-party security vendors and analyzing the way this ecosystem is going to work. Is there a signal exchange between companies? If you know of an IP or an attack from a source that needs to be blocked, then the sooner you share that signal, the sooner everybody is protected. But threat actors are changing so fast. We have an entire ML detection team that's constantly tuning or training our model with new inputs to make sure we can detect every pattern–and make sure our products are protecting against all types of attacks.”
There are really brilliant people out there working in AI. But even the folks who manage data lakes and work in machine learning are saying it's still really hard and painful.” - Shiven Ramji, President - Customer Identity / Okta
While Ramji is skeptical of how the AI hype cycle has left the door open to startups “AI-washing” their otherwise conventional offerings, there are still question marks about data management that could turn into opportunities for ambitious startups in the future. “I haven't really seen many breakthrough companies yet that make me think, ‘Wow, this is the one!’ I think [GenAI is making] everybody get efficient, faster, maybe getting you the right answer sooner. But it's likely going to take some interrogation at the technical level to understand whether [a startup] is ‘AI-washing’ or actually something that has AI natively built in, is ‘smart,’ and actually adds something new.”
Ramji closes by suggesting that managing data could become the next big challenge for AI. “There are really brilliant people out there working in AI. But even the folks who manage data lakes and work in machine learning are saying it's still really hard and painful. So we're going to have to build machine learning models to make it easier. There's an irony in that. It's so hard to build a data lake. It's hard to keep up the ETL for the data, for the different data sources. And then, you've got to spend so much time tuning and iterating on machine learning models. And even if someone tries to build and run an entire data pipeline for you, there’s still the question of what to do when the input to that pipeline is different from what you built the model for in the first place. So, there are still a lot of questions to answer.”
Conclusion
As proprietary training data becomes an increasingly valuable resource in AI-related projects, startups will need to pay more attention to privacy best practices internally, as well as security against external threats and identity and auth issues for their growing customer bases. Learn more about how software development veterans are implementing AI at the DevGuild: Artificial Intelligence event.
More Resources:
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
How It's Tested Ep. #13, The Evolution of Testing and QA with Katja Obring
In episode 13 of How It’s Tested, Eden is joined by QA expert Katja Obring. Together they discuss Katja’s 20-year career journey...
Generationship Ep. #20, Smells Like ML with Salma Mayorquin and Terry Rodriguez of Remyx AI
In episode 20 of Generationship, Rachel Chalmers is joined by Salma Mayorquin and Terry Rodriguez of Remyx AI. Together they...
O11ycast Ep. #73, AI’s Impact on Observability with Animesh Koratana of PlayerZero
In episode 73 of o11ycast, Jessica Kerr, Martin Thwaites, and Austin Parker speak with Animesh Koratana, founder and CEO of...