Training Your LLM Dragons: Why DSPM is the Key to AI Security

Parag Bajaria

December 18, 2024

AI’s transformative potential comes with a price—its complexity and reliance on sensitive data make it a prime target for security threats. For most organizations, the two primary use cases, custom large language models (LLMs) and tools like Microsoft Copilot, introduce unique challenges.

Custom LLMs often require extensive training on organizational data, creating risks of embedding sensitive information into models. Meanwhile, Microsoft Copilot integrates with enterprise applications and processes, potentially exposing personal, financial, and proprietary data if not properly governed. Whether through intentional attacks or accidental mishandling, these implementations demand a robust security approach to prevent data exposure and ensure compliance.

Key threats to AI implementations include:

Prompt Injection Attacks: Crafty prompts can manipulate models into disclosing sensitive information indirectly, bypassing traditional security measures.
Training Data Poisoning: Malicious actors or oversights can embed sensitive or biased data into training sets, leading to unethical or insecure model outputs.
Data Leakage in Outputs: Poorly configured models may inadvertently expose private data during user interactions or as part of their outputs.
Compliance Failures: AI systems that mishandle regulated data risk steep fines under laws like GDPR, CCPA, or HIPAA and erode customer trust.

In my recent webinar, Training Your LLM Dragons: Why DSPM is Foundational for Every AI Initiative, I discussed these risks with Vamsi Koduru, Director of Product Management at Normalyze, and together we walked through practical strategies and demos illustrating how to tackle them head-on.

Use Case 1: Securing Custom LLMs

Custom LLMs allow organizations to fine-tune AI models to meet specific business needs, but they also create significant risks. Sensitive data can enter the model during training or through interactions, potentially leading to inadvertent disclosures. Security teams can secure custom LLMs with these steps:

Audit and Sanitize Training Data:
- Regularly review datasets for sensitive or regulated information before using them in training.
- Implement data anonymization techniques, such as masking or encryption, to protect PII and other critical data.
Monitor Data Lineage:
- Use tools like Normalyze to map how data flows from ingestion to model training and outputs.
- Ensure traceability to maintain compliance and quickly address vulnerabilities.
Set Strict Access Controls:
- Enforce role-based permissions for data scientists and engineers interacting with training datasets.
- Limit access to sensitive datasets to only those who absolutely need it.
Proactively Monitor Outputs:
- Analyze model responses to ensure they don’t unintentionally reveal sensitive information, particularly after updates or retraining cycles.

How Normalyze Helps

In the webinar demo, we demonstrated how Normalyze’s DSPM platform can automatically discover sensitive data across cloud environments and classify it, ensuring comprehensive visibility into structured and unstructured data sources. The platform provides a complete lineage view, illustrating how sensitive data flows through various stages, including its origin, connection to datasets, involvement in training pipelines, and its integration into custom AI models. This detailed lineage view enables organizations to trace the movement of sensitive information, maintain compliance with regulations like GDPR and CCPA, and build trust with their users. Additionally, the platform proactively notifies teams if sensitive data is being used inappropriately—whether in training data, model responses, or user interactions—empowering them to address potential risks immediately. This capability ensures businesses can mitigate data exposure, reduce accidental leakages, and secure their AI initiatives against emerging threats.

Use Case 2: Mitigating Risks in Microsoft Copilot

Microsoft Copilot delivers accurate, contextually relevant responses through a process called grounding. By accessing Microsoft Graph and the Semantic Index, grounding pulls context from across your organizational applications to generate more specific and tailored prompts for its LLM. While this enhances response quality, it also introduces risks of data leakage or misuse if sensitive or poorly governed data sources are accessed during the process. Security teams can secure Copilot implementations with these steps:

Enforce Sensitivity Labels:
- Map sensitive data to Microsoft Information Protection (MIP) labels to ensure proper access restrictions.
- Assign labels consistently across files and applications to govern what data Copilot can access.
Curate Approved Data Sources:
- Consider using a curated set of approved SharePoint sites or datasets for Copilot to minimize exposure of unvetted data.
- Ensure all included datasets are sanitized for sensitive or regulated content.
Monitor Prompt Behavior and Outputs:
- Log and analyze prompts to identify unusual or malicious behavior.
- Use tools to monitor Copilot’s outputs and flag sensitive information in real time.
Limit Access by Role:
- Configure Copilot’s access based on user roles to ensure employees only see data relevant to their responsibilities.

How Normalyze Helps

The second demo in the webinar showcased how Normalyze integrates seamlessly with Microsoft MIP labels to enhance classification and governance of sensitive content by mapping discovered data classes to existing sensitivity labels. This integration ensures that access controls and compliance requirements are consistently enforced across environments. The platform identifies potential risks tied to sensitive outputs, such as files or information surfaced through Copilot interactions, including inadvertent exposure of confidential data or outputs generated without proper contextual safeguards. By analyzing sensitive data flows and monitoring outputs, Normalyze can detect and alert teams to inappropriate or unauthorized access attempts, even when stemming from sophisticated scenarios like unauthorized prompts or accidental data leakage. This proactive approach enables organizations to respond swiftly and effectively, reducing the likelihood of sensitive information being compromised while maintaining robust data governance across AI-driven tools like Copilot.

Build a Secure AI Framework

Regardless of the use case, a proactive and layered approach is essential to securing AI infrastructure. Here’s a summary of the steps organizations should take:

Discover and Classify Sensitive Data: Use automated tools to identify PII, intellectual property, and regulated data across your cloud and on-premises environments.
Ensure Data Lineage Visibility: Track how sensitive data moves through your AI workflows, from ingestion to model training and beyond.
Establish Role-Based Access Controls: Limit access to sensitive data and ensure permissions align with employees’ responsibilities.
Audit and Anonymize Data: Sanitize training datasets and ensure outputs don’t inadvertently disclose sensitive information.
Continuously Monitor Interactions: Track user inputs, model prompts, and outputs to identify and mitigate risks as they arise.

The Path Forward

AI is a transformative tool, but its reliance on sensitive data creates unique challenges for security teams. By adopting a structured approach to securing AI infrastructure, organizations can unlock the potential of custom LLMs and tools like Microsoft Copilot without compromising data integrity, compliance, or trust.

For a deeper dive into these strategies—and to watch the live demos showing how Normalyze can help—watch the full webinar recording.

Frequently Asked Questions

1. What is DSPM, and why is it critical for AI implementations?

Data Security Posture Management (DSPM) is a strategy and set of tools designed to discover, classify, and monitor valuable and sensitive data as well as user access across an organization’s cloud and on-premises environments . For AI implementations like custom LLMs and Microsoft Copilot, DSPM is crucial for ensuring that sensitive or regulated data is properly governed, reducing the risk of data leakage, misuse, or compliance violations.

2. What are the main risks of using custom LLMs in organizations?

Custom LLMs can introduce risks such as:

Embedding sensitive data in models during training due to unsanitized datasets.
Inadvertent data leakage in model outputs.
Compliance failures if regulated data (e.g., PII) is mishandled.
Security vulnerabilities like training data poisoning or prompt injection attacks.

These risks highlight the importance of auditing training data, monitoring data flows, and enforcing strict access controls.

3. How does Microsoft Copilot use organizational data, and what risks does this create?

Microsoft Copilot uses a process called grounding, where it accesses data from Microsoft Graph and the Semantic Index to provide contextually relevant responses. While this improves accuracy, it also creates risks, such as:

Data leakage if sensitive files or emails are improperly governed.
Misuse of confidential information if role-based access controls are inadequate.
Exposure of regulated data if sensitivity labels are not consistently applied.

4. How can organizations secure sensitive data in AI workflows?

Organizations can secure AI workflows by:

Discovering and classifying data to identify sensitive or regulated information.
Enforcing role-based access controls to limit who can access what data.
Monitoring data lineage to track how data flows into and out of AI systems.
Auditing and anonymizing data to ensure sensitive information is masked or encrypted.
Continuously monitoring interactions to identify and mitigate risks in real time.

5. How does Normalyze help mitigate AI security risks?

Normalyze’s DSPM platform helps organizations secure their AI infrastructure by:

Automatically discovering and classifying sensitive data across cloud and on-premises environments.
Mapping data lineage to track how sensitive information flows into AI models.
Integrating with tools like Microsoft MIP labels for enhanced data governance.
Proactively identifying risks and notifying teams of unauthorized access or sensitive data usage, enabling quick remediation.

Unlocking the Value of AI: Safe AI Adoption for Security Practitioners

Jan 13, 2025 | All Resources, Blog Posts

As a security practitioner or CISO, you likely find yourself in a rapidly evolving landscape where the adoption of AI is both a game-changer and a challenge

Data resilience: the conversation security and data teams can’t afford to miss

Dec 9, 2024 | All Resources, Documents

Three useful ESG assets in one pack.

Cybersecurity Awareness: 31 Essential Data Security Tips

Nov 1, 2024 | All Resources, Blog Posts

Thirty-one practical data security tips, organized by category, help your team safeguard sensitive data and reduce risks.

Parag Bajaria

Parag Bajaria brings over 15 years of seasoned leadership in product management, with expertise in cloud security, DevOps, and fostering the growth of startup companies. His strategic foresight has been pivotal in developing pioneering products in CIEM, CWPP, CSPM, and CNAPP, steering the market toward innovative security solutions. At the helm of product teams, Parag's approach blends cutting-edge technical insight with keen market instincts, leading to the development and launch of innovative, user-centric solutions. At Normalyze, Parag spearheads the development of the next generation of AI-powered Data Security Platforms to discover, classify and secure data no matter where it is.

Gartner® Innovation Insight: Data Security Posture Management

DSPM Buyer's Guide

CYBER 60: The fastest-growing startups in cybersecurity

Training Your LLM Dragons: Why DSPM is the Key to AI Security

Use Case 1: Securing Custom LLMs

How Normalyze Helps

Use Case 2: Mitigating Risks in Microsoft Copilot

How Normalyze Helps

The Path Forward

Frequently Asked Questions

1. What is DSPM, and why is it critical for AI implementations?

2. What are the main risks of using custom LLMs in organizations?

3. How does Microsoft Copilot use organizational data, and what risks does this create?

4. How can organizations secure sensitive data in AI workflows?

5. How does Normalyze help mitigate AI security risks?

Unlocking the Value of AI: Safe AI Adoption for Security Practitioners

Data resilience: the conversation security and data teams can’t afford to miss

Cybersecurity Awareness: 31 Essential Data Security Tips

Parag Bajaria