AI PrivacyMarch 19, 2026· 7 min read

Building RAG with customer data. Here are the 5 things that matter

A practical guide to building RAG systems with customer data while handling GDPR obligations. Lineage tables, retrieval authorization, embedding inversion, and erasure planning.

TLDR

Vector embeddings derived from personal data are likely personal data under GDPR. Treat them accordingly. The companion piece on whether embeddings are personal data walks the legal test in detail.
Build document-to-vector lineage from day one. Without it you cannot handle erasure or access requests, and retrofit is painful.
Pseudonymize before embedding. Raw personal data in a vector store is the highest-risk configuration available.
Enforce data minimization at the retrieval step, not just at ingestion. The EDPS specifically warns about this.
RAG does not eliminate prompt injection. Poisoned documents in your knowledge base can manipulate outputs even when access controls work.

The European Data Protection Supervisor's TechSonar entry on RAG opens with a fictional scenario about a car rental company that hooks an AI chatbot up to its customer database. Within a paragraph, the chatbot is leaking the wrong things to the wrong people. The point is not the cute scenario. The point is that the EDPS, the regulator that watches EU institutions, considers RAG important enough to publish about. That tells you where this is going.

I think the lineage table is the single most undervalued part of a RAG architecture, and the cheapest one to skip during a proof of concept. Five things actually matter when you build RAG over customer data. None of them are exotic. All of them save you from a regulator question you cannot answer.

If you want the prior question (are the embeddings themselves personal data?), read Are vector embeddings personal data under GDPR?. This piece assumes you have already decided yes, and walks the build.

1. Map the personal data entering your vector store

Get a clear picture before anything else. Which documents feed your RAG pipeline? Which personal data do those documents contain?

Common sources that carry personal data into RAG systems:

Customer support tickets (names, emails, account numbers, issue descriptions)
CRM records (contact details, interaction history, sales notes)
Contracts and invoices (names, addresses, financial terms)
Internal communications (Slack threads, emails referencing customers by name)
User-generated content (reviews, feedback, forum posts)

Most teams underestimate how much personal data flows in. A support ticket that looks like "just a bug report" often contains the customer's name, email, company, and a description of what they were doing when the issue occurred. All of that gets chunked and embedded.

Build a simple inventory. Source type, data categories, volume, whether the data is filtered before embedding. You need this inventory to determine your legal basis under GDPR Article 6 and to respond to data subject requests later.

2. Pseudonymize before you embed

This is the single most impactful step you can take.

Raw personal data in a vector store is the highest-risk configuration available. Embedding inversion research has been compounding since 2023. The original Vec2Text paper recovered 32-token inputs with up to 92% precision. A February 2025 paper called ALGEN, published at ACL 2025, made the attacks much easier. ALGEN showed that embedding spaces from different encoders are nearly isomorphic. About 1,000 leaked embedding-text pairs are enough to invert vectors from any modern encoder using a one-step linear alignment, achieving ROUGE-L scores of 45-50. The successor paper ZSInvert removed the paired-data requirement entirely. OWASP added "Vector and Embedding Weaknesses" to its GenAI Top 10:2025 for the same reason.

Treat embeddings as reversible. The cost of pretending otherwise has been climbing every year.

Pseudonymization means replacing direct identifiers (names, emails, account numbers) with tokens before the data enters your embedding pipeline. Store the mapping between tokens and real identifiers separately, under stricter access controls.

import hashlib
import re

# Simple pseudonymization before embedding.
# In production, use a proper PII detection library (Presidio, custom NER).

IDENTIFIER_PATTERNS = {
    "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
    "phone": r"\+?[\d\s\-()]{10,15}",
}

def pseudonymize_chunk(text: str, mapping: dict) -> str:
    for label, pattern in IDENTIFIER_PATTERNS.items():
        for match in re.finditer(pattern, text):
            original = match.group()
            if original not in mapping:
                token = f"[{label.upper()}_{hashlib.sha256(original.encode()).hexdigest()[:8]}]"
                mapping[original] = token
            text = text.replace(original, mapping[original])
    return text

This is a simplified example. Production systems need proper PII detection (consider Microsoft's Presidio or custom NER models). The point is that this step happens before the text reaches your embedding model.

If your RAG use case does not require identifying individuals at all (product documentation, general knowledge bases), aim for full anonymization. Be honest about whether that is achievable. The EDPB stated in Opinion 28/2024 that AI models trained with personal data cannot, in all cases, be considered anonymous. The same logic applies to embedding stores, and the 2025 ALGEN result makes the bar harder to clear.

3. Build document-to-vector lineage

Most teams skip this during the initial build and regret it later.

Under GDPR Article 17 (right to erasure), a customer can request deletion of their personal data. Under Article 15 (right of access), they can request a copy of all data you hold on them. Both require you to know which vectors in your store relate to which person.

Most vector databases (Pinecone, Milvus, Weaviate, Qdrant) support deletion of individual vectors by ID. Identifying which vectors to delete is the hard part. You need a lineage table. Every vector ID maps to a source chunk. Every source chunk maps to a source document. Every source document maps to the data subjects it contains.

# Lineage record stored alongside your vectors
lineage = {
    "vector_id": "vec_a8f3e2",
    "chunk_id": "chunk_0042",
    "source_document": "ticket_2024_18837.json",
    "data_subjects": ["customer_id_9201", "customer_id_4455"],
    "ingested_at": "2026-03-15T10:30:00Z",
    "pseudonymization_applied": True,
}

When customer 9201 submits an erasure request, query the lineage table for all entries referencing that customer ID, delete the corresponding vectors, and remove or re-embed the affected source chunks.

Tip

Build the lineage table before your first embedding. Retrofitting it onto a populated vector store is a multi-week reconciliation project, and the failure mode is silent — you discover the gap when an erasure request arrives and you cannot prove which vectors belong to which person. Day-one cost: an extra column in your ingestion pipeline. Retrofit cost: a sprint, plus the legal exposure for any erasure request that arrived in the meantime.

4. Authorize at the retrieval layer

A RAG system that retrieves any document for any query is a data leak waiting to happen.

If your system serves multiple customers, a query from Customer A should never retrieve documents belonging to Customer B. If your system serves internal teams, a query from the marketing team should not surface HR records.

Enforce this at the retrieval layer, not at the application layer. Before chunks enter the LLM's context window, verify that the querying user or tenant is authorized to access each document.

Approaches that work:

Per-tenant vector collections. Separate indexes per customer. Queries only hit the relevant collection. Simple and effective if you have a manageable number of tenants.
Metadata filtering. Store access control metadata (tenant ID, department, sensitivity level) with each vector. Filter at query time. Most vector databases support this natively.
Post-retrieval policy check. Retrieve candidates first, then run them through an authorization layer (OPA, custom ABAC logic) before passing them to the LLM. More flexible, slightly more complex.

The EDPS TechSonar page specifically warns that "certain user queries could be specific enough to cause RAG systems to retrieve and disclose personal data." Access controls at the retrieval layer are your primary defense against this. Whether the EDPS TechSonar guidance will harden into formal EDPB opinion is genuinely uncertain. TechSonar pages are informal monitoring, not binding interpretation. But the direction is clear, and the practical engineering is the same either way.

5. Plan for subject access and erasure requests

You will get these requests. Plan the workflow now, not when the first one arrives.

For access requests (Article 15): The data subject wants to know what data you hold on them. Your response comes from the source document store, not the vector store. Vectors are a derived representation. Maintain your original source data in a searchable format alongside the vector database. When a request arrives, query the lineage table for the data subject's ID, retrieve the associated source documents, and provide those.

If your RAG system makes automated decisions that produce legal or significant effects on individuals, Article 22 likely applies. You would then also need to provide meaningful information about the logic involved.

For erasure requests (Article 17): Query the lineage table. Identify all vectors associated with the data subject. Delete them. Delete or redact the source documents. Log the action.

Watch out

If the same source document contains data about multiple people, you cannot delete the entire document when only one person requests erasure. You need to redact the requesting person's data and re-embed the remaining content. This is the most common failure mode I have seen flagged in published RAG architecture discussions, and the lineage table from step 3 is what makes the redact-and-re-embed flow possible. Without it, you are left with a binary keep-or-delete-everything choice that fails the proportionality requirement under Article 17.

Build a retention policy for your source documents and vector store. Define how long data stays, under what legal basis, and what triggers deletion.

What this means in practice

A few things to keep on your radar that do not fit neatly into one of the five steps.

The "we are not training" defense has limits. RAG avoids the training-data debate but storage and retrieval are still processing under GDPR. You still need a legal basis. You still need to inform data subjects. You still need to handle their rights. The "it's not training" framing should not lull you into thinking GDPR does not apply.

Prompt injection through retrieved documents. This is the top security risk for RAG systems per OWASP LLM01:2025. If an attacker can modify or insert documents in your knowledge base, those documents can contain instructions that manipulate the LLM's output. Research shows that as few as 5 carefully crafted documents can manipulate AI responses over 90% of the time. Sanitize documents at ingestion. Monitor for anomalous content patterns. Consider output filtering as an additional layer.

Output leakage is subtle. Even with access controls on retrieval, the LLM's response might combine information from multiple chunks in ways that expose personal data. A response about "customers in Amsterdam who reported billing issues in March" can be specific enough to identify someone. Post-generation filtering for personal data is worth considering, especially for customer-facing RAG systems.

Enforcement is real but messy. The Italian Garante banned DeepSeek in January 2025 within 48 hours of opening an inquiry. But the Garante's earlier €15 million fine against OpenAI was annulled by the Court of Rome on 18 March 2026. Plan your RAG architecture for the regulator-strict end of the distribution. Even when courts pull back specific fines, the underlying obligations have not changed.

Key takeaway

If you have a RAG system that ingests customer data, check one thing first. Do you have a document-to-vector lineage table? If you do not, start building one. Without it you cannot comply with erasure requests, you cannot respond to access requests, and you cannot audit what personal data lives in your vector store. Everything else in this guide follows from that foundation.

Continue reading

GDPR + AIApr 6, 2026

Are vector embeddings personal data under GDPR? A technical answer for RAG builders

Vector embeddings of personal data are likely personal data under GDPR. Here is the legal test, the 2025 attack research, the regulator convergence, and how to document your position.

8 min read

AI PrivacyMar 15, 2026

How to audit your codebase for AI data leakage

A practical, surface-by-surface audit recipe for finding personal data flowing to AI services. Covers prompt templates, observability defaults, embedding pipelines, and the limits of audit-by-grep in agent mode.

13 min read

AI PrivacyMar 9, 2026

Do you need a DPIA for your AI feature? A practical check

The trigger question is settled. The harder question is which assessment, and when. EDPB Opinion 28/2024, CNIL July 2025, and the Article 27(4) FRIA carry-over.

12 min read

Free tool · live

AI Data Flow Checker

Map how personal data flows through your AI integrations and spot the privacy risks before they spot you.