Vector embeddings of personal data are likely personal data under GDPR. Here is the legal test, the 2025 attack research, the regulator convergence, and how to document your position.
Most articles about vector databases under GDPR either repeat the OWASP "vector and embedding weaknesses" entry or dodge the legal question with a vendor pitch. The actual question for a builder is narrower. When a customer files an Article 15 access request, are the embeddings in your Pinecone instance personal data you have to return?
I think the answer is yes for most production RAG stacks built on commercial embedding models in 2026. The rest of this piece walks the legal test, the 2025 attack research, the regulator position, and the documented anonymity assessment that would let you argue otherwise.
If you want the implementation playbook (lineage tables, pseudonymization, retrieval-time access control), read Building RAG with customer data. This article is the layer above. Should you treat your vector store as a personal-data store at all?
GDPR's definition of personal data is a chain of three conditions. Each one matters for the embedding question.
Article 4(1). Personal data is "any information relating to an identified or identifiable natural person." A name is information. A transaction record is information. A vector is information too, if it relates to an identifiable person.
Recital 26. Identifiability is judged against a specific standard:
"...account should be taken of all the means reasonably likely to be used, either by the controller or by another person, to identify the natural person directly or indirectly."
Whether you intend to re-identify is irrelevant. The test asks whether reasonable means exist.
CJEU C-582/14 (Breyer). In 2016 the Court of Justice ruled on dynamic IP addresses. A website operator could not identify visitors directly. ISPs could, with legal process. The Court held the IP addresses were still personal data for the operator, because reasonable means existed to combine sources and identify individuals. The Breyer ruling on EUR-Lex is the canonical "another person with reasonable means" precedent.
The implication for embeddings is direct. If anyone with reasonable means can recover the underlying personal data from a vector, the vector is personal data in your hands. The question becomes empirical. How reasonable are those means in 2026?
Three years of attack research, summarised by the part that matters for the Recital 26 test: how much technical sophistication does an attacker need.
The 2023 baseline. John Morris and colleagues at Cornell Tech published Text Embeddings Reveal (Almost) As Much As Text at EMNLP 2023. Their method, vec2text, treats embedding inversion as controlled generation. It iteratively guesses text, re-embeds it, and corrects toward the target vector. The headline result: 92% of 32-token inputs recovered exactly from text-embedding-ada-002 vectors. The paper recovered full names from a clinical notes dataset. Cornell Tech gave it a best paper award.
The 2024 mitigation paper. A November 2024 paper, Mitigating Privacy Risks in LLM Embeddings from Embedding Inversion, proposed defenses. Defenses are now a research subfield, which tells you the underlying problem is real.
The 2025 generalisation. Two papers in 2025 made the attacks easier and more general, not harder. The first, ALGEN, published at ACL 2025, showed that embedding spaces from different encoders are nearly isomorphic at sentence level. A one-step linear alignment from as few as one leaked embedding-text pair lets an attacker invert vectors from a model they have never trained against. About 1,000 alignment samples saturate performance, achieving ROUGE-L scores of 45-50, on par with prior attacks that needed orders of magnitude more data. The second, ZSInvert and Zero2Text, achieved zero-shot inversion across arbitrary encoders via adversarial decoding, with no paired training data at all. The same algorithm works for every embedding model.
The 1,000-sample threshold from ALGEN is the practical headline for the legal test. Before 2025, an attacker needed to train a custom inversion model against your specific encoder. Now they need a thousand example pairs and one linear alignment step. Any defense premised on "our embedding model is closed-source so vec2text cannot be trained against it" is a defense that died in February 2025.
Independent confirmation. Tonic.ai published a reproducibility study in 2025 that re-ran the Morris method. Roughly 40% of sensitive data in sentence-length embeddings recovers exactly. Over 10% in essay-length text. A separate July 2025 reproducibility paper confirmed the original Morris results and noted that quantization can blunt vec2text while preserving retrieval quality, which gives mitigation work a foothold.
Two things matter for the legal test.
First, vec2text and its successors are public. The code is on GitHub. ALGEN and ZSInvert have open implementations. "Means reasonably likely to be used by another person" no longer requires a research lab. It requires a GPU and an afternoon.
Second, even partial recovery counts. Recital 26 does not require complete reconstruction. It asks whether the controller or another person can identify the individual. A recovered first name plus a recovered city plus a recovered job title is usually enough. Embeddings preserve exactly that kind of semantic context.
I am not certain how the EU courts will treat the ALGEN and ZSInvert generation of attacks specifically. But I would not bet on a court deciding in 2026 that "one thousand leaked pairs and a linear alignment" counts as unreasonable means.
Three European regulators have published guidance that touches the embedding question. They converge.
EDPB Opinion 28/2024. In December 2024 the European Data Protection Board adopted Opinion 28/2024 at the request of the Irish DPA. The opinion is about AI models, not embedding stores specifically. The reasoning transfers cleanly.
The EDPB sets a high bar for claiming an AI model is anonymous. Both of these risks must be "insignificant":
Assessment is case-by-case. The burden is on the controller to demonstrate anonymity, not on regulators to disprove it. The EDPB explicitly notes that AI models trained on personal data cannot, in all cases, be considered anonymous.
Translate that to a vector store. Your embeddings are derived representations of personal data. The "extraction" risk is embedding inversion. The "query" risk is similarity search returning chunks that contain or imply personal data. You would have to show both are insignificant. Vec2text-style attacks make the extraction risk hard to argue away. Standard RAG retrieval makes the query risk almost guaranteed.
CNIL Q&A on generative AI. The French DPA's Q&A on generative AI systems is direct on RAG. The deployer connecting a generative AI system to its own knowledge base is responsible for that processing when the knowledge base contains personal data. CNIL also recommends on-premise deployment for sensitive RAG and warns about extraction risks from third-party hosting. This is the clearest regulator statement that RAG vector stores carry full controller obligations.
ICO consultation outcomes. The UK regulator's response to its generative AI consultation, published December 2024, says developers must design systems that allow specific individuals' personal data to be identified at every stage, including the training set and the model itself. UK GDPR mirrors EU GDPR on personal data definitions, so the legal reasoning is the same.
Three regulators, one direction. The controller has to prove non-personal. The default is personal.
Enforcement is real but messy. The Italian Garante banned DeepSeek in January 2025 within 48 hours of opening an inquiry. That is the direction of travel. But the Garante's earlier €15 million fine against OpenAI was annulled by the Court of Rome on 18 March 2026, which means even high-confidence regulatory action can be overturned in court. Plan for both. The CNIL and ICO positions on RAG remain unchallenged so far, and EDPB Opinion 28/2024 has not been litigated at all.
If you want to argue your embeddings are not personal data, you need a written assessment. EDPB Opinion 28/2024 requires controllers to keep documentation of the anonymity analysis on request. Verbal assertions do not survive a regulator's questions.
A working structure. Treat this as a starting template, not legal advice.
| Section | What you document | What "passes" looks like |
|---|---|---|
| 1. Inputs | What personal data enters the embedding pipeline. Categories, volume, sensitivity. | Honest inventory, including indirect identifiers. |
| 2. Embedding model | Model name, version, dimension, provider, hosting jurisdiction. Whether weights are public. | Specific identification, dated. |
| 3. Extraction risk | Threat model for embedding inversion. Attacks tested or referenced (vec2text, ALGEN, ZSInvert). Quantified recovery rates. | Demonstrated negligible risk against current public attacks. Hard to achieve in 2026. |
| 4. Query leakage risk | Threat model for retrieval-based leakage. Will similarity search return personal data in chunks? | Demonstrated that retrieval cannot surface identifiable information. Almost never true for production RAG. |
| 5. Mitigations | Encryption at rest, access control, dimension reduction, quantization, noise injection, on-premise hosting. | Each mitigation documented with its residual risk. |
| 6. Residual risk | The honest sum of risks 3, 4, and 5. | Insignificant per Recital 26. |
| 7. Decision | Personal data or not, with reasoning. Date and author. | Defensible to a regulator. |
If you can complete sections 3 and 4 with "negligible," congratulations. Most teams cannot. The honest outcome of the exercise, for almost all production RAG systems built on commercial embedding models in 2026, is the conclusion that the embeddings are personal data.
That is fine. Once you accept the classification, the practical playbook applies. Build a document-to-vector lineage table, pseudonymize before embedding, run a DPIA, and plan for Article 15 and Article 17 requests.
The seven-section assessment is the artifact a regulator will ask for during an audit. Even if you decide your embeddings are personal data and skip sections 3 and 4, having sections 1, 2, 5, 6, and 7 in writing puts you ahead of most teams shipping RAG features in 2026.
The arguments teams make to themselves about why their embeddings are not personal data, and why each one fails when you write it down.
"The embedding model is closed-source, so vec2text cannot be trained against it." ALGEN proved this wrong. A few thousand leaked pairs from any model lets an attacker invert vectors from any other model in the same class. ZSInvert removed the paired-data requirement entirely. Closed-source is a speed bump, not a defense.
"The vector store is encrypted at rest." Encryption protects against unauthorized access by third parties. It does not change whether the data you process is personal. Your own application decrypts to query. The processing is the issue.
"We applied random projection or dimensionality reduction." Quantization helps too, per the 2025 reproducibility study. None of these zero out the recoverable signal. Document the exact method and the post-mitigation recovery rate against a current attack, ideally ALGEN or ZSInvert.
"The data is pseudonymized before embedding." Pseudonymized data remains personal data under GDPR Recital 26. Pseudonymization lowers risk and supports your legal basis but does not exempt you from controller obligations.
"We only store metadata, not full text." Depends on the metadata. A vector keyed to ticket_id, customer_id, embedding still relates to an identifiable person. The classification follows the relation to a natural person.
"This is debated in legal circles." The debate exists, but the regulator-side direction is one-sided. The EDPB, CNIL, and ICO all moved toward "presume personal." The Court of Rome's reversal of the Garante OpenAI fine in March 2026 was about transparency obligations and breach notification, not about whether AI training data was personal data. The personal data question is still settled in the EDPB direction.
Open a document called embedding-anonymity-assessment.md in your repo. Try to write the seven sections from the table above for one of your existing or planned vector stores. Two outcomes are possible. You produce a defensible position to keep on file, or you discover the embeddings are personal data and you start building the lineage table before the system goes to production. Both outcomes are useful.
The status quo of "we never thought about it" is the only outcome that fails an audit.
A practical guide to building RAG systems with customer data while handling GDPR obligations. Lineage tables, retrieval authorization, embedding inversion, and erasure planning.
The trigger question is settled. The harder question is which assessment, and when. EDPB Opinion 28/2024, CNIL July 2025, and the Article 27(4) FRIA carry-over.
The five GDPR articles that actually decide whether your AI feature ships in 2026: legal basis, transparency after Dun & Bradstreet, Article 22, privacy by design, and DPIA.
Free tool · live
AI Data Flow Checker
Map how personal data flows through your AI integrations and spot the privacy risks before they spot you.