How to log AI features without violating GDPR storage limitation or failing the EU AI Act audit. Three-tier architecture, PII redaction defaults that 2026 observability vendors get wrong, and the Article 12 vs Article 26 split.
send_default_pii=True is set. Datadog LLM Observability captures inputs and outputs by default with the Sensitive Data Scanner as a separate opt-in. Langfuse data masking lives in the Enterprise tier; the open-source build does not redact for you.GDPR Article 5(1)(e) is the storage-limitation principle. Personal data must be kept "in a form which permits identification of data subjects for no longer than is necessary." Necessary for the original processing purpose, that is. Once the purpose is done, the data should be deleted, anonymised, or have its retention justified by a different lawful basis. The CNIL has been enforcing this in earnest. On 13 January 2026 the regulator sanctioned Free and Free Mobile for €42 million combined (€27 million for Free Mobile, €15 million for Free), partly for retaining millions of former-subscriber records without justification for an excessive period. That was not an AI case. It does not need to be. The logic is the same the moment you point an LLM at customer data and start capturing the requests for debugging.
Article 12 of the EU AI Act says the opposite, sort of. High-risk AI systems must "technically allow for the automatic recording of events (logs) over the lifetime of the system." Article 26 makes deployers of those systems keep the logs for at least six months. Article 19 (provider-side) goes further on documentation retention, ten years on the technical file. These rules become applicable for high-risk systems on 2 August 2026.
So one regulator wants the data gone and the other wants the data kept. Both are right. Both are enforceable. The teams that get this wrong tend to do one of two things: either log everything and hope nobody looks, or log nothing and discover six months in that they cannot debug a hallucination that hit a real customer. Neither is the answer.
The answer is architectural. The two regulations are pointing at different categories of information that happen to live in the same Elasticsearch index in your current setup. Pull them apart and the tension dissolves.
Treat AI system logging as three distinct streams, not one. Each tier has a different content profile, a different retention window, and a different access policy. Most of the engineering work is just deciding which fields go in which stream.
Tier 1: Technical metadata. Model name and version, inference parameters (temperature, max tokens, top-p), token counts in and out, latency, status code, HTTP error class, deployment ID, request hash, cost per call, the time-of-day, the region the call hit. None of this is personal data. None of it should ever contain a customer name. Keep it for as long as it has analytic value. The AI Act technical-documentation duty under Article 11 reaches up to ten years for high-risk systems and Tier 1 is where most of the artefacts live. This is the cheap log to keep.
Tier 2: Operational logs. Pseudonymised session identifiers (a hash of the user ID, not the user ID), prompt template ID (the version of the template that was used, not the rendered text), the guardrail that triggered, the safety classifier output, the fallback that fired, the human-oversight decision if there was one, the anomaly flag, the cost-budget breach flag. This tier may carry pseudonymised references to people but not the people themselves. The window that fits both regulations is six to twelve months. Six is the AI Act floor for deployers under Article 26. Twelve is a defensible upper bound if your incident-investigation purpose is documented and your DPIA references it.
Tier 3: Content logs. The raw prompt, the raw output, the retrieved context for RAG, the user's message, the assistant's reply. This tier is where the personal data lives. If you have to log it at all, redact it before storage and keep it for a window measured in days, not months. Seven to thirty days is the band most legal teams will defend. The retention has to be tied to a specific purpose: hallucination triage, output-quality review, safety incident investigation. Tied to a purpose, automated for deletion, access-controlled at the row level if your observability vendor supports it.
The failure mode I see most often is teams that built tiers in name only. They created three log streams in Datadog, then routed every event to all three because the templating library made it easy. Six months later the "Tier 1 metadata" stream contains the full rendered prompt under a debug.context field that nobody planned for. The discipline is not the diagram. It is the one engineer whose job it is to refuse the field at ingestion.
If Tier 3 has to exist (it usually does), the data has to be redacted before it lands in the log store. After-the-fact redaction is much harder. Once unredacted text is in Elasticsearch or Datadog or your observability vendor's storage, deleting it reliably is a project. Redacting at ingestion is a function call.
Two redaction approaches, used together:
Pattern matching for structured PII. Email addresses, phone numbers, IBANs, credit card numbers, national IDs, IP addresses, postal codes. Regex catches these reliably and fast. The CRAPII benchmark and similar evaluations put pattern-only recall around 0.65 on unstructured text, meaning regex alone misses about a third of the PII in a typical prompt.
Named-entity recognition for the rest. People's names, organisation names, locations, freeform addresses, timestamps that double as identifiers, anything that depends on context. NER models from the Hugging Face hub or the spaCy ecosystem catch these. They are slower than regex and they produce occasional false positives (a person named "Berlin" is no fun) but they cover what the regex layer cannot reach.
Microsoft Presidio is the open-source library that ties both layers together with a stable API. It runs locally, supports a hybrid pattern-plus-NER pipeline by default, ships GPU-accelerated NER via GLiNER and Stanza in the recent releases, and has batch processing over a REST surface for high-volume pipelines. The hybrid configuration on the CRAPII evaluation reaches an F1 of around 0.94 with precision and recall both around 0.96. That is not perfect (nothing in this space is) but it is a long way past pattern-only.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def redact(text: str, language: str = "en") -> str:
findings = analyzer.analyze(text=text, language=language)
return anonymizer.anonymize(text=text, analyzer_results=findings).text
# In the AI client, before the log call:
log_payload = {
"model": model,
"prompt_template_id": template.id,
"redacted_prompt": redact(rendered_prompt),
"redacted_output": redact(model_response),
"token_count_in": usage.prompt_tokens,
"token_count_out": usage.completion_tokens,
}
Two things to know about this in production. First, redaction is not free. The NER pass is the slow step and on a CPU it adds tens of milliseconds per request. For high-volume traffic, sample. Run 100% of requests through pattern matching and 5-10% through the full hybrid pipeline. Elastic's reference architecture for PII detection in observability uses roughly that ratio. You get a representative sample for quality monitoring without paying the full cost on every call. Second, redaction has to happen before the log payload leaves your application boundary. If you let the unredacted prompt hit the observability SDK and then try to scrub it on the vendor side, you have already shipped the personal data to the vendor. The scrub is now a deletion request, not a redaction.
This is the section nobody puts on the planning doc.
The minute you connect an LLM observability tool to an application that processes personal data, that tool is a sub-processor of personal data under GDPR. The DPA needs to exist, the sub-processor needs to be on your processing register, and the data flow needs to fit your transfer-impact framework. That is the legal layer. The harder layer is what the tool actually captures by default in 2026, because the defaults have shifted in the wrong direction.
send_default_pii=True in sentry_sdk.init() (commonly turned on for non-AI debugging because it lets you see request headers and IPs in error events), the integration starts shipping LLM inputs and outputs to Sentry. The opt-out without losing the rest of the PII context is to add OpenAIIntegration(include_prompts=False) explicitly. Datadog LLM Observability captures inputs and outputs of every instrumented LLM call by default; the Sensitive Data Scanner is the redaction layer and it has to be enabled separately. Langfuse data masking is an Enterprise-tier feature; the open-source self-hosted build does not redact for you. Audit your sentry_sdk.init(), your Datadog LLM Observability configuration, and your Langfuse tier before adding any of these to a system that processes personal data.
The vendor-by-vendor reality:
Langfuse is open-source under the MIT licence, Berlin-based, and the only major LLM observability vendor that lets you self-host the full stack on your own infrastructure for free. The cloud product has an EU region. The Enterprise tier is where SOC 2 Type II, the BAA for HIPAA, the GDPR DPA, and the data-masking feature live. If you are running the open-source self-hosted build, the masking layer is your responsibility.
Datadog LLM Observability is GA throughout 2025 with native instrumentation for OpenAI, Anthropic, the OpenAI Agents SDK, Pydantic AI, and a growing list of frameworks. Datadog acts as a processor, has a DPA, and supports EU regions. The Sensitive Data Scanner is a separate product that has to be enabled and configured for the LLM Observability traces. Datadog's own documentation has carried language to the effect that the platform is "not generally intended to be used for processing personal data," which is worth reading carefully against your use case.
LangSmith (LangChain) offers managed cloud, bring-your-own-cloud, and self-hosted options. Self-hosting requires an enterprise licence. Their public materials are less granular on the GDPR specifics than Langfuse's; budget time for the procurement conversation if you go this route.
Helicone, Phoenix (Arize), and the OpenTelemetry GenAI semantic conventions sit alongside the bigger vendors. The semantic-convention layer is interesting in its own right because it standardises which span attributes carry prompts and completions (gen_ai.prompt, gen_ai.completion), which means the redaction question moves from "configure each vendor differently" to "configure your OTel exporter once." It is not a complete answer yet but it is heading the right way.
The minimum-viable vendor checklist for any of them:
A retention window is not a number you pick at the kickoff and forget. It is a defended choice with a documented purpose, an automated deletion job, and a place in your Record of Processing Activities under Article 30 GDPR.
A starting set:
| Log tier | Retention | Justification |
|---|---|---|
| Tier 3 (content, redacted) | 7-30 days | Hallucination triage, output quality review |
| Tier 2 (operational, pseudonymized) | 6-12 months | AI Act Article 26 deployer floor, incident investigation |
| Tier 1 (technical metadata) | 1-3 years | Performance analysis, model version comparison |
| Article 11 high-risk technical documentation | 10 years | EU AI Act provider obligation |
Two things make these defensible.
The first is automation. "We delete logs when they age out" is not a control unless the deletion job exists, runs on a schedule, and writes its own audit trail. Datadog index TTLs, Elasticsearch ILM policies, S3 lifecycle rules, BigQuery partition expiration, Postgres pg_partman policies all work for the operational tier. The CNIL Free Mobile decision turned partly on the absence of a working deletion process, not on the retention period itself. The fine was not because the company kept former-subscriber records too long. It was because nobody had built the mechanism that would have deleted them on time.
The second is purpose specificity. "Necessary for AI debugging" is too vague. The CNIL and the EDPB CEF 2025 enforcement programme both want to see retention tied to a named purpose with a named end condition. "Hallucination triage on customer support outputs, retained until the next quarterly model evaluation, then deleted" is the shape that survives review.
This is the part most teams gloss and most regulators care about.
Article 12 EU AI Act sits in the provider chapter. It says the provider of a high-risk AI system must design the system so it automatically generates logs over its lifecycle. The duty is on the entity that builds and places the system on the market. The logs themselves are an architectural feature of the product.
Article 26 sits in the deployer chapter. It says the deployer of that high-risk AI system must keep the automatically generated logs (to the extent they are under the deployer's control) for at least six months, and must monitor the system in operation. The duty is on the entity that puts the system to use in a real workflow.
The practical consequence: when you sit down to design the logging stack, write the obligation you are designing for at the top of the page. If the answer is "we do not know," resolve that before you write the schema. The schema you build for a deployer monitoring a third-party model is not the same shape as the schema you build for a provider proving system safety to a notified body.
There is one more wrinkle worth knowing. Article 12 requires logs that allow identification of risk situations and post-market monitoring. The AI Office and the European Commission have not yet published the implementing acts that will detail exactly which fields belong in those logs (the implementing acts are expected through 2026 and 2027). The current frame is "everything reasonably necessary," which is the kind of phrase that does the regulator's work and makes engineering planning hard. I think the safe move is to design the schema for the use cases that are clearly required (model version, input class, output class, decision, override) and treat the rest as forward-compatible columns you can fill once the implementing acts arrive. (Caveat: the implementing-acts timeline is not on a clean schedule, so build the column-add path before you need it.)
Block out a half-day this sprint and answer five questions about your current AI logging setup, in writing, in the engineering handbook page nobody has written yet.
If you cannot answer any of these in five minutes, that is your audit. The fix is not a tool purchase. It is a documented architecture, a redaction layer at the right point in the request path, and the discipline to keep Tier 3 small. The audit-by-grep article in this set is the natural starting point if you also want to know where the personal data is entering your codebase to begin with.
A practical, surface-by-surface audit recipe for finding personal data flowing to AI services. Covers prompt templates, observability defaults, embedding pipelines, and the limits of audit-by-grep in agent mode.
The April 2026 trilogue reshaped the deadline. What binds you regardless, what the Omnibus will probably move, and the deployer obligations most dev teams underestimate.
An operational guide for AI data leaks. GDPR Article 33 timing, containment, evidence preservation, notification templates, three worked incident walkthroughs, and the regulator differences that catch teams off guard.
Free tool · live
AI Data Flow Checker
Map how personal data flows through your AI integrations and spot the privacy risks before they spot you.