AI PrivacyMarch 25, 2026· 12 min read

Logging and monitoring AI outputs: what to keep and what not

How to log AI features without violating GDPR storage limitation or failing the EU AI Act audit. Three-tier architecture, PII redaction defaults that 2026 observability vendors get wrong, and the Article 12 vs Article 26 split.

TLDR

Two regulators are pulling in opposite directions. GDPR Article 5(1)(e) says delete personal data when its purpose is fulfilled. EU AI Act Article 12 says keep automatically generated logs for at least six months. The fix is architectural, not legal: separate the personal data from the operational metadata before either rule bites.
Three log tiers, in this order. Technical metadata (no PII, long retention), pseudonymized operational logs (medium retention), redacted content logs (short retention). Most teams skip the separation and end up with one giant "logs" bucket that breaks both regulations.
Your observability vendor is a sub-processor and the 2026 defaults are not on your side. Sentry's OpenAI integration ships LLM prompts the moment send_default_pii=True is set. Datadog LLM Observability captures inputs and outputs by default with the Sensitive Data Scanner as a separate opt-in. Langfuse data masking lives in the Enterprise tier; the open-source build does not redact for you.
The CNIL fined Free and Free Mobile €42 million on 13 January 2026 partly for retaining subscriber data without justification. That fine was not about AI, but the same retention reasoning applies the moment your AI logs cross the "necessary purpose" line.
Article 12 is the provider's duty. Article 26 is the deployer's six-month minimum on the same logs. They are different obligations on the same artefact. If you do not know which one applies to you, you do not know whose audit you are preparing for.

Why this is harder than it looks

GDPR Article 5(1)(e) is the storage-limitation principle. Personal data must be kept "in a form which permits identification of data subjects for no longer than is necessary." Necessary for the original processing purpose, that is. Once the purpose is done, the data should be deleted, anonymised, or have its retention justified by a different lawful basis. The CNIL has been enforcing this in earnest. On 13 January 2026 the regulator sanctioned Free and Free Mobile for €42 million combined (€27 million for Free Mobile, €15 million for Free), partly for retaining millions of former-subscriber records without justification for an excessive period. That was not an AI case. It does not need to be. The logic is the same the moment you point an LLM at customer data and start capturing the requests for debugging.

Article 12 of the EU AI Act says the opposite, sort of. High-risk AI systems must "technically allow for the automatic recording of events (logs) over the lifetime of the system." Article 26 makes deployers of those systems keep the logs for at least six months. Article 19 (provider-side) goes further on documentation retention, ten years on the technical file. These rules become applicable for high-risk systems on 2 August 2026.

So one regulator wants the data gone and the other wants the data kept. Both are right. Both are enforceable. The teams that get this wrong tend to do one of two things: either log everything and hope nobody looks, or log nothing and discover six months in that they cannot debug a hallucination that hit a real customer. Neither is the answer.

The answer is architectural. The two regulations are pointing at different categories of information that happen to live in the same Elasticsearch index in your current setup. Pull them apart and the tension dissolves.

The three-tier architecture that resolves the tension

Treat AI system logging as three distinct streams, not one. Each tier has a different content profile, a different retention window, and a different access policy. Most of the engineering work is just deciding which fields go in which stream.

Tier 1: Technical metadata. Model name and version, inference parameters (temperature, max tokens, top-p), token counts in and out, latency, status code, HTTP error class, deployment ID, request hash, cost per call, the time-of-day, the region the call hit. None of this is personal data. None of it should ever contain a customer name. Keep it for as long as it has analytic value. The AI Act technical-documentation duty under Article 11 reaches up to ten years for high-risk systems and Tier 1 is where most of the artefacts live. This is the cheap log to keep.

Tier 2: Operational logs. Pseudonymised session identifiers (a hash of the user ID, not the user ID), prompt template ID (the version of the template that was used, not the rendered text), the guardrail that triggered, the safety classifier output, the fallback that fired, the human-oversight decision if there was one, the anomaly flag, the cost-budget breach flag. This tier may carry pseudonymised references to people but not the people themselves. The window that fits both regulations is six to twelve months. Six is the AI Act floor for deployers under Article 26. Twelve is a defensible upper bound if your incident-investigation purpose is documented and your DPIA references it.

Tier 3: Content logs. The raw prompt, the raw output, the retrieved context for RAG, the user's message, the assistant's reply. This tier is where the personal data lives. If you have to log it at all, redact it before storage and keep it for a window measured in days, not months. Seven to thirty days is the band most legal teams will defend. The retention has to be tied to a specific purpose: hallucination triage, output-quality review, safety incident investigation. Tied to a purpose, automated for deletion, access-controlled at the row level if your observability vendor supports it.

The architectural principle is "log the interaction's shape, not its content." A latency spike, a token-count regression, a rising rate of guardrail trips, a model version that suddenly costs 2x more per call. All of those investigations live in Tier 1. You almost never need the raw prompt to spot them. Reach for Tier 3 only when the question is specifically about what the model said in a given session, and then reach for it through a controlled query interface, not a Slack screenshot.

The failure mode I see most often is teams that built tiers in name only. They created three log streams in Datadog, then routed every event to all three because the templating library made it easy. Six months later the "Tier 1 metadata" stream contains the full rendered prompt under a debug.context field that nobody planned for. The discipline is not the diagram. It is the one engineer whose job it is to refuse the field at ingestion.

PII redaction without breaking your debug loop

If Tier 3 has to exist (it usually does), the data has to be redacted before it lands in the log store. After-the-fact redaction is much harder. Once unredacted text is in Elasticsearch or Datadog or your observability vendor's storage, deleting it reliably is a project. Redacting at ingestion is a function call.

Two redaction approaches, used together:

Pattern matching for structured PII. Email addresses, phone numbers, IBANs, credit card numbers, national IDs, IP addresses, postal codes. Regex catches these reliably and fast. The CRAPII benchmark and similar evaluations put pattern-only recall around 0.65 on unstructured text, meaning regex alone misses about a third of the PII in a typical prompt.

Named-entity recognition for the rest. People's names, organisation names, locations, freeform addresses, timestamps that double as identifiers, anything that depends on context. NER models from the Hugging Face hub or the spaCy ecosystem catch these. They are slower than regex and they produce occasional false positives (a person named "Berlin" is no fun) but they cover what the regex layer cannot reach.

Microsoft Presidio is the open-source library that ties both layers together with a stable API. It runs locally, supports a hybrid pattern-plus-NER pipeline by default, ships GPU-accelerated NER via GLiNER and Stanza in the recent releases, and has batch processing over a REST surface for high-volume pipelines. The hybrid configuration on the CRAPII evaluation reaches an F1 of around 0.94 with precision and recall both around 0.96. That is not perfect (nothing in this space is) but it is a long way past pattern-only.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def redact(text: str, language: str = "en") -> str:
    findings = analyzer.analyze(text=text, language=language)
    return anonymizer.anonymize(text=text, analyzer_results=findings).text

# In the AI client, before the log call:
log_payload = {
    "model": model,
    "prompt_template_id": template.id,
    "redacted_prompt": redact(rendered_prompt),
    "redacted_output": redact(model_response),
    "token_count_in": usage.prompt_tokens,
    "token_count_out": usage.completion_tokens,
}

Two things to know about this in production. First, redaction is not free. The NER pass is the slow step and on a CPU it adds tens of milliseconds per request. For high-volume traffic, sample. Run 100% of requests through pattern matching and 5-10% through the full hybrid pipeline. Elastic's reference architecture for PII detection in observability uses roughly that ratio. You get a representative sample for quality monitoring without paying the full cost on every call. Second, redaction has to happen before the log payload leaves your application boundary. If you let the unredacted prompt hit the observability SDK and then try to scrub it on the vendor side, you have already shipped the personal data to the vendor. The scrub is now a deletion request, not a redaction.

Your observability vendor is a sub-processor

This is the section nobody puts on the planning doc.

The minute you connect an LLM observability tool to an application that processes personal data, that tool is a sub-processor of personal data under GDPR. The DPA needs to exist, the sub-processor needs to be on your processing register, and the data flow needs to fit your transfer-impact framework. That is the legal layer. The harder layer is what the tool actually captures by default in 2026, because the defaults have shifted in the wrong direction.

The 2026 defaults across the major LLM observability vendors are not on your side. Sentry's Python OpenAI integration does not include LLM prompts and outputs by default. The moment you set send_default_pii=True in sentry_sdk.init() (commonly turned on for non-AI debugging because it lets you see request headers and IPs in error events), the integration starts shipping LLM inputs and outputs to Sentry. The opt-out without losing the rest of the PII context is to add OpenAIIntegration(include_prompts=False) explicitly. Datadog LLM Observability captures inputs and outputs of every instrumented LLM call by default; the Sensitive Data Scanner is the redaction layer and it has to be enabled separately. Langfuse data masking is an Enterprise-tier feature; the open-source self-hosted build does not redact for you. Audit your sentry_sdk.init(), your Datadog LLM Observability configuration, and your Langfuse tier before adding any of these to a system that processes personal data.

The vendor-by-vendor reality:

Langfuse is open-source under the MIT licence, Berlin-based, and the only major LLM observability vendor that lets you self-host the full stack on your own infrastructure for free. The cloud product has an EU region. The Enterprise tier is where SOC 2 Type II, the BAA for HIPAA, the GDPR DPA, and the data-masking feature live. If you are running the open-source self-hosted build, the masking layer is your responsibility.

Datadog LLM Observability is GA throughout 2025 with native instrumentation for OpenAI, Anthropic, the OpenAI Agents SDK, Pydantic AI, and a growing list of frameworks. Datadog acts as a processor, has a DPA, and supports EU regions. The Sensitive Data Scanner is a separate product that has to be enabled and configured for the LLM Observability traces. Datadog's own documentation has carried language to the effect that the platform is "not generally intended to be used for processing personal data," which is worth reading carefully against your use case.

LangSmith (LangChain) offers managed cloud, bring-your-own-cloud, and self-hosted options. Self-hosting requires an enterprise licence. Their public materials are less granular on the GDPR specifics than Langfuse's; budget time for the procurement conversation if you go this route.

Helicone, Phoenix (Arize), and the OpenTelemetry GenAI semantic conventions sit alongside the bigger vendors. The semantic-convention layer is interesting in its own right because it standardises which span attributes carry prompts and completions (gen_ai.prompt, gen_ai.completion), which means the redaction question moves from "configure each vendor differently" to "configure your OTel exporter once." It is not a complete answer yet but it is heading the right way.

The minimum-viable vendor checklist for any of them:

DPA in place before the first request.
EU region for the storage tier.
Sub-processor list reviewed and added to your own register, including any LLM provider the observability vendor uses to power its own AI features (some now have model-graded eval that calls OpenAI or Anthropic on every event).
PII redaction enabled at ingestion in your application, not at the vendor's edge.
Retention windows configured to match your own three-tier architecture, not the vendor's defaults.
The audit trail for who can query content logs, not just who can write them.

Retention windows that survive a CNIL inspection

A retention window is not a number you pick at the kickoff and forget. It is a defended choice with a documented purpose, an automated deletion job, and a place in your Record of Processing Activities under Article 30 GDPR.

A starting set:

Log tier	Retention	Justification
Tier 3 (content, redacted)	7-30 days	Hallucination triage, output quality review
Tier 2 (operational, pseudonymized)	6-12 months	AI Act Article 26 deployer floor, incident investigation
Tier 1 (technical metadata)	1-3 years	Performance analysis, model version comparison
Article 11 high-risk technical documentation	10 years	EU AI Act provider obligation

Two things make these defensible.

The first is automation. "We delete logs when they age out" is not a control unless the deletion job exists, runs on a schedule, and writes its own audit trail. Datadog index TTLs, Elasticsearch ILM policies, S3 lifecycle rules, BigQuery partition expiration, Postgres pg_partman policies all work for the operational tier. The CNIL Free Mobile decision turned partly on the absence of a working deletion process, not on the retention period itself. The fine was not because the company kept former-subscriber records too long. It was because nobody had built the mechanism that would have deleted them on time.

The second is purpose specificity. "Necessary for AI debugging" is too vague. The CNIL and the EDPB CEF 2025 enforcement programme both want to see retention tied to a named purpose with a named end condition. "Hallucination triage on customer support outputs, retained until the next quarterly model evaluation, then deleted" is the shape that survives review.

The Article 12 versus Article 26 split (and why it matters)

This is the part most teams gloss and most regulators care about.

Article 12 EU AI Act sits in the provider chapter. It says the provider of a high-risk AI system must design the system so it automatically generates logs over its lifecycle. The duty is on the entity that builds and places the system on the market. The logs themselves are an architectural feature of the product.

Article 26 sits in the deployer chapter. It says the deployer of that high-risk AI system must keep the automatically generated logs (to the extent they are under the deployer's control) for at least six months, and must monitor the system in operation. The duty is on the entity that puts the system to use in a real workflow.

You may be both at once, or only one. If you build an AI feature inside your own product and ship it to your own customers, you are the provider for that feature and your customer is the deployer. If you integrate someone else's high-risk model behind your application, you are the deployer and the model maker is the provider. The logs are the same artefact, but Article 12 gates the design and Article 26 gates the operation. The penalty band for non-compliance with Article 26 deployer obligations under Article 99(4) is up to €15 million or 3% of global annual turnover, whichever is higher. The provider band is the same. Being unsure whose obligation it is does not reduce either number.

The practical consequence: when you sit down to design the logging stack, write the obligation you are designing for at the top of the page. If the answer is "we do not know," resolve that before you write the schema. The schema you build for a deployer monitoring a third-party model is not the same shape as the schema you build for a provider proving system safety to a notified body.

There is one more wrinkle worth knowing. Article 12 requires logs that allow identification of risk situations and post-market monitoring. The AI Office and the European Commission have not yet published the implementing acts that will detail exactly which fields belong in those logs (the implementing acts are expected through 2026 and 2027). The current frame is "everything reasonably necessary," which is the kind of phrase that does the regulator's work and makes engineering planning hard. I think the safe move is to design the schema for the use cases that are clearly required (model version, input class, output class, decision, override) and treat the rest as forward-compatible columns you can fill once the implementing acts arrive. (Caveat: the implementing-acts timeline is not on a clean schedule, so build the column-add path before you need it.)

Before your next AI Act audit

Block out a half-day this sprint and answer five questions about your current AI logging setup, in writing, in the engineering handbook page nobody has written yet.

Which of the three tiers does each log stream actually map to today, and which fields does each one contain?
Where is the redaction happening in the request path, and which library or SDK handles it?
Which observability vendor is in the chain, what does it capture by default, and what is its DPA and EU-region status?
What is the retention window for each tier, and where is the automated deletion job defined?
For your AI features that touch the high-risk list under Annex III, are you the provider, the deployer, or both?

If you cannot answer any of these in five minutes, that is your audit. The fix is not a tool purchase. It is a documented architecture, a redaction layer at the right point in the request path, and the discipline to keep Tier 3 small. The audit-by-grep article in this set is the natural starting point if you also want to know where the personal data is entering your codebase to begin with.

The five-question test. A senior engineer should be able to walk an auditor through tier mapping, redaction location, vendor sub-processor status, retention windows, and the Article 12 vs Article 26 split in five minutes. If any of those takes longer than five minutes today, the work for the next sprint is the documentation, not new tooling. The August 2026 deadline for high-risk AI Act obligations is the date the test stops being optional.

Sources

Continue reading

AI PrivacyMar 15, 2026

How to audit your codebase for AI data leakage

A practical, surface-by-surface audit recipe for finding personal data flowing to AI services. Covers prompt templates, observability defaults, embedding pipelines, and the limits of audit-by-grep in agent mode.

13 min read

AI ActMar 31, 2026

EU AI Act: what developers who deploy AI features need to do by August 2026

The April 2026 trilogue reshaped the deadline. What binds you regardless, what the Omnibus will probably move, and the deployer obligations most dev teams underestimate.

9 min read

AI SecurityApr 7, 2026

Your AI feature just leaked customer data. The first 72 hours, hour by hour

An operational guide for AI data leaks. GDPR Article 33 timing, containment, evidence preservation, notification templates, three worked incident walkthroughs, and the regulator differences that catch teams off guard.

18 min read

Free tool · live

AI Data Flow Checker

Map how personal data flows through your AI integrations and spot the privacy risks before they spot you.