Self-hosted LLM vs cloud API: the privacy tradeoff

TLDR

The decision is regulatory before it is technical. Cost and model quality are inputs. The question that should drive the architecture is: how much regulatory exposure can you absorb if the EU-US Data Privacy Framework gets narrowed or invalidated?
Self-hosting eliminates the third-party processor relationship for inference. No DPA to negotiate, no transfer mechanism, no sub-processor cascade. The data protection obligations on your own infrastructure do not go away.
The model-quality gap is no longer load-bearing for most production tasks. Llama 4 Scout (17B active, 10M context, fits a single H100) and Qwen 3 reach the bar for classification, summarisation, extraction, and most RAG. The remaining gap is concentrated in long-horizon reasoning and multimodal.
Cost crosses over around 2M tokens per day. Below that, cloud APIs win on price even before counting MLOps salary. Above it, self-hosting on owned hardware breaks even in 3–5 months.
EU-region cloud is the cheapest privacy upgrade. Azure OpenAI EU Data Zones (announced Q4 2024) and AWS Bedrock EU regions keep inference inside the EEA without forcing you to run the model. The transatlantic transfer question disappears, the third-party processor question does not.
The right answer for almost every team is hybrid, not binary. Cloud for low-sensitivity, high-quality-bar work; self-hosted for anything touching personal data; an anonymisation pipeline for the middle.

The framing "self-host vs cloud API" is a 2023 question. In 2026 the actual decision is about how much regulatory exposure your team can absorb for the next eighteen months while the EU-US transfer regime is being litigated, what your real token volume is, and whether the use case lives in the band where Llama 4 or Qwen 3 already meet your quality bar. The technical picture changed so fast between 2024 and 2026 that the cost and quality numbers most teams remember are wrong by an order of magnitude.

This page is a decision framework. The comparison table is the summary, the five factors below are the structured weighing exercise, and the three "when to choose" sections are the clean output. Skip to whichever block the team needs.

	Self-hosted open-weight LLM	Cloud API
Data control	Full. Inference data never leaves your infrastructure	Partial. Each prompt is sent to the provider for processing
Training risk	Zero. You control the model weights	Provider terms. Team / Business / Enterprise / API tiers are no-train by default in 2026
GDPR processor relationship	None for inference (still applies to your own infra)	Yes. DPA, transfer mechanism, legal basis review, sub-processor monitoring
Cost (under 2M tokens/day)	High. GPU hardware or rental is mostly fixed	Low. Pay-per-token wins clearly
Cost (over 2M tokens/day)	Lower per token; 3–5 month break-even on owned hardware	Linear scaling; expensive at high volume
Model quality (most tasks)	Llama 4 Scout, Llama 4 Maverick, Qwen 3, DeepSeek V3 close enough	Best-in-class on long-horizon reasoning and multimodal
Time to first prototype	Days to weeks (Ollama / vLLM / BentoML stack)	Hours
Maintenance	Yours: model updates, GPU patching, observability	Provider's
Regulatory exposure to DPF / SCCs	None	Yes; depends on Latombe outcome at the CJEU
Best for	Sensitive data, regulated industries, predictable high volume	Rapid prototyping, multimodal, low-volume features

Factor 1: data residency and the third-party processor relationship

This is the load-bearing factor. With a cloud API, your inference data travels to the provider's infrastructure. OpenAI's commercial endpoint is US-based by default. Anthropic processes in the US, with a significant Google Cloud TPU expansion announced in October 2025 that elevated Google to a more prominent sub-processor inside the Anthropic stack. Google Vertex AI offers EU regions for Gemini.

Under GDPR, sending personal data to a US-based processor is legal if you have a valid transfer mechanism. The EU-US Data Privacy Framework currently covers this for self-certified US companies. Standard Contractual Clauses are the fallback. The two sit on top of each other, and most privacy lawyers I have read recommend wiring SCCs alongside DPF certification as belt-and-suspenders.

Note

The Anthropic / Google Cloud TPU expansion is an example of how cloud-API sub-processor relationships actually evolve. A team that chose Anthropic in 2024 because "Anthropic processes only on AWS" had a quietly different sub-processor stack by Q4 2025, with no contract renegotiation triggered. The DPA you signed lists sub-processor change notification, not sub-processor stability. See the AI sub-processor cascade for the full mechanics.

With a self-hosted open-weight model, the inference path stays inside your infrastructure. There is no third-party processor for the inference step, no DPA to negotiate for it, no transfer mechanism question, no sub-processor list to track. Your data protection obligations on your own systems do not change. The surface area gets smaller, not zero.

Factor 2: model quality in 2026

This used to be a clear cloud-API win. The gap has closed faster than most teams expect. BentoML's frontier model tracking puts open-weight models trailing the proprietary frontier by about three months on average in 2026, and the use cases where that three-month gap actually matters keep shrinking.

Where open-weight models in 2026 reach the quality bar:

Classification, named entity recognition, structured extraction
Document summarisation and chunked RAG
Code generation for routine tasks
Long-context recall (Llama 4 Scout's 10M token context is industry-leading)
Most agent tool-calling that does not require novel multi-step reasoning

Where cloud APIs still dominate:

Long-horizon reasoning chains and complex agent planning
Frontier multimodal (vision-language tasks where the proprietary models lead)
Tasks where each marginal quality point matters (high-stakes customer-facing generation)
Rapid iteration with the latest model versions, weeks after release

I think the model-quality gap is no longer load-bearing for most production tasks. For a team running classification or RAG, picking GPT-4.1 or Claude 4.6 over Llama 4 Scout is a habit, not a quality decision. The honest test is to run the same fifty representative inputs through both and look at the outputs side by side. Most teams that do this end up surprised.

Factor 3: cost at your actual scale

The math depends on your token volume. The break-even point as of April 2026 lands around two million tokens per day on owned hardware, by the cost-modelling I have seen published in recent dev community pieces.

Under 2M tokens/day. Cloud APIs win on price even before counting infrastructure overhead. GPT-4.1 mini sits around $0.40 per million input tokens / $1.60 output. Claude Haiku 4.5 is roughly $1.00 / $5.00. Self-hosting requires GPU infrastructure that costs hundreds to low thousands per month even when idle, plus the operator time.

2M to 50M tokens/day. Mixed. Self-hosted costs are mostly fixed (hardware or GPU rental); cloud API costs scale linearly. The break-even point depends on which model you run, your power costs, and whether you can amortise the GPU across other workloads. Self-hosting an 8B model on a single RTX 4090 amortises to roughly $0.009 per million tokens at 24/7 utilisation, several orders of magnitude cheaper than the cloud equivalent.

Over 50M tokens/day. Self-hosting wins on per-token cost, generally by an order of magnitude. The constraint becomes whether the team can keep an inference cluster running.

The hidden costs of self-hosting that the headline math omits:

MLOps engineer time (setup, monitoring, scaling, on-call rotation)
GPU procurement, depreciation, and replacement (or cloud GPU rental, where spot instances are not guaranteed)
Model updates, security patching, and re-benchmarking with each new release
Inference optimisation: quantisation, batching, KV cache management, dynamic batching
Observability and incident response

For a five-to-twenty person team, the MLOps overhead alone often makes self-hosting impractical unless someone on the team already has the expertise. The salary line of one specialist easily exceeds the cloud API bill.

Factor 4: regulatory exposure and the transfer-mechanism question

Cloud API compliance work for an EU team is not theoretical. Each provider needs a Data Processing Agreement under Article 28. Transfers outside the EEA need a valid mechanism: DPF certification for the provider plus, almost always, SCCs as a belt-and-suspenders backstop. Privacy notices need to mention the provider as a third-party processor under Articles 13–14, and after the CJEU ruling in Dun & Bradstreet (C-203/22) the boilerplate "we use machine learning to personalise your experience" is no longer enough. The DPIA decision usually lands on the side of "yes, do one." You then have ongoing monitoring obligations every time the provider updates terms or shifts sub-processors.

Watch out

The MEP Philippe Latombe challenge to the EU-US Data Privacy Framework reached the General Court in 2025 and is currently on appeal at the CJEU. I am not sure how it lands. The Schrems II framing cuts both ways: the Court may accept the EU Commission's position that the new framework's Data Protection Review Court closes the gaps Schrems II identified, or it may decide it does not. If the framework is invalidated or narrowed, the compliance cost of every US-based cloud API stack flips overnight for any team that has not already wired SCCs alongside DPF. The cheap insurance is to ship SCCs as the primary mechanism today, treat DPF as a belt-and-suspenders extra, and re-read your contracts at the next renewal. See the 2026 state of EU-US AI transfers for the full picture.

Self-hosting an open-weight model removes the entire transfer-mechanism layer. The compliance work that remains is the work you would do for any service running on your own infrastructure: access control, encryption at rest and in transit, retention, the same DPIA you would have done anyway under Article 35. For teams in regulated sectors (healthcare, finance, legal, anything with sectoral data residency rules), that simplification is often worth the entire infrastructure cost on its own.

Factor 5: latency, availability, and control over upgrades

Cloud APIs add network latency that you cannot tune. For real-time applications (inline code completion, voice agents, customer-facing chat) the round trip adds 200–500ms minimum, more for complex requests. Self-hosted models on local GPUs answer in single-digit milliseconds for short outputs and scale predictably for batch workloads.

Cloud APIs also have rate limits and outages that become your outages. If your product depends on AI inference for a core feature, a third-party incident is on your status page. Self-hosting moves that uptime responsibility to you, with the trade-off that the constraint is now physical.

The under-discussed factor is upgrade control. With a cloud API, the model behind the same endpoint changes underneath you. Quality regressions, retraining shifts, and silent default changes are the norm. With a self-hosted model, you choose when to upgrade and you can A/B the new version against the old. For teams whose evals matter (regulated outputs, anything graded against historical benchmarks) the ability to pin a model version is itself a feature.

When self-hosted is the right call

You process sensitive personal data (health, financial, legal, biometric, children's data) and want to minimise third-party exposure
Your sector has data residency rules that strongly prefer or require on-premise processing
You have predictable inference volume above the 2M tokens/day break-even
You have ML or platform engineering capability already on the team
Your use case fits inside what Llama 4 Scout or Qwen 3 do well (classification, RAG, structured extraction, code, long-context recall)
The regulatory risk of a DPF reversal is something the team cannot absorb
You need pinned model versions for evals, audits, or downstream guarantees

When the cloud API is the right call

You are still validating product-market fit and shipping speed dominates everything
Your use case requires frontier multimodal or long-horizon reasoning
You do not have ML ops capability and cannot hire it
Your token volume is genuinely low or spiky
The data you process is anonymisable or non-personal
You can negotiate (or accept) a satisfactory DPA, transfer mechanism, and sub-processor list from the provider
Pinned model versions are not a hard requirement

The hybrid pattern most teams should pick

I think the hybrid pattern is the right answer for almost every team I have read about that ships AI features against EU customers in 2026, and the framing of "pick one" is itself the mistake. Three concrete patterns are worth knowing.

Routing by data sensitivity. Cloud API for non-sensitive work (drafting copy, summarising public documents, code assistance against open-source repositories). Self-hosted model for anything touching personal data, internal source code, or contractual material. The router lives in your application layer and the rules belong in your AI acceptable use policy.

Tip

An anonymisation pipeline is the cheapest middle ground for use cases where the cloud-API quality matters but the inputs contain personal data. Strip names, emails, account numbers, and location specifics before the prompt leaves your system; replace with deterministic tokens; call the cloud API; reinsert the personal data locally on the response. Microsoft Presidio, Nightfall, and Cyberhaven each ship the redaction primitive. The pattern works for summarisation, classification, and structured extraction. It does not work for anything that needs the actual personal data to make sense (personalised replies, account-specific reasoning).

EU-region managed deployments. Azure OpenAI Data Zones (launched Q4 2024) keep inference inside the EEA across all Azure OpenAI EU regions for any deployment labelled "DataZone". AWS Bedrock has EU residency for the frontier models it hosts. Google Vertex AI offers EU regions for Gemini. None of these eliminate the third-party processor relationship, but each removes the transatlantic transfer question, which is the hardest part of the compliance math. The trade-off is that the latest models often arrive in EU regions weeks or months after the US ones.

Managed private deployments. Azure OpenAI's data zones at the higher tier and AWS Bedrock's dedicated capacity offer model instances inside your own cloud tenancy. The provider runs the model; the data does not leave your account. The cost is between shared API and fully self-managed; the operational simplicity is closer to the API end. For teams that need the privacy posture of self-hosting without the MLOps capability, this is usually the cleanest answer.

Key takeaway

Pick the architecture that matches your actual regulatory exposure and your actual token volume, not the theoretical maximum risk and the theoretical quality bar. For most EU dev teams in 2026 that means routing by data sensitivity, with EU-region managed deployments for the cloud half and an anonymisation pipeline for the in-between. The single biggest move is to wire SCCs as the primary transfer mechanism today, before the Latombe outcome forces you to do it under time pressure.

TLDR

The decision is regulatory before it is technical. Cost and model quality are inputs. The question that should drive the architecture is: how much regulatory exposure can you absorb if the EU-US Data Privacy Framework gets narrowed or invalidated?
Self-hosting eliminates the third-party processor relationship for inference. No DPA to negotiate, no transfer mechanism, no sub-processor cascade. The data protection obligations on your own infrastructure do not go away.
The model-quality gap is no longer load-bearing for most production tasks. Llama 4 Scout (17B active, 10M context, fits a single H100) and Qwen 3 reach the bar for classification, summarisation, extraction, and most RAG. The remaining gap is concentrated in long-horizon reasoning and multimodal.
Cost crosses over around 2M tokens per day. Below that, cloud APIs win on price even before counting MLOps salary. Above it, self-hosting on owned hardware breaks even in 3–5 months.
EU-region cloud is the cheapest privacy upgrade. Azure OpenAI EU Data Zones (announced Q4 2024) and AWS Bedrock EU regions keep inference inside the EEA without forcing you to run the model. The transatlantic transfer question disappears, the third-party processor question does not.
The right answer for almost every team is hybrid, not binary. Cloud for low-sensitivity, high-quality-bar work; self-hosted for anything touching personal data; an anonymisation pipeline for the middle.

	Self-hosted open-weight LLM	Cloud API
Data control	Full. Inference data never leaves your infrastructure	Partial. Each prompt is sent to the provider for processing
Training risk	Zero. You control the model weights	Provider terms. Team / Business / Enterprise / API tiers are no-train by default in 2026
GDPR processor relationship	None for inference (still applies to your own infra)	Yes. DPA, transfer mechanism, legal basis review, sub-processor monitoring
Cost (under 2M tokens/day)	High. GPU hardware or rental is mostly fixed	Low. Pay-per-token wins clearly
Cost (over 2M tokens/day)	Lower per token; 3–5 month break-even on owned hardware	Linear scaling; expensive at high volume
Model quality (most tasks)	Llama 4 Scout, Llama 4 Maverick, Qwen 3, DeepSeek V3 close enough	Best-in-class on long-horizon reasoning and multimodal
Time to first prototype	Days to weeks (Ollama / vLLM / BentoML stack)	Hours
Maintenance	Yours: model updates, GPU patching, observability	Provider's
Regulatory exposure to DPF / SCCs	None	Yes; depends on Latombe outcome at the CJEU
Best for	Sensitive data, regulated industries, predictable high volume	Rapid prototyping, multimodal, low-volume features

Factor 1: data residency and the third-party processor relationship

Note

Factor 2: model quality in 2026

Where open-weight models in 2026 reach the quality bar:

Classification, named entity recognition, structured extraction
Document summarisation and chunked RAG
Code generation for routine tasks
Long-context recall (Llama 4 Scout's 10M token context is industry-leading)
Most agent tool-calling that does not require novel multi-step reasoning

Where cloud APIs still dominate:

Long-horizon reasoning chains and complex agent planning
Frontier multimodal (vision-language tasks where the proprietary models lead)
Tasks where each marginal quality point matters (high-stakes customer-facing generation)
Rapid iteration with the latest model versions, weeks after release

Factor 3: cost at your actual scale

Over 50M tokens/day. Self-hosting wins on per-token cost, generally by an order of magnitude. The constraint becomes whether the team can keep an inference cluster running.

The hidden costs of self-hosting that the headline math omits:

MLOps engineer time (setup, monitoring, scaling, on-call rotation)
GPU procurement, depreciation, and replacement (or cloud GPU rental, where spot instances are not guaranteed)
Model updates, security patching, and re-benchmarking with each new release
Inference optimisation: quantisation, batching, KV cache management, dynamic batching
Observability and incident response

Factor 4: regulatory exposure and the transfer-mechanism question

Watch out

Factor 5: latency, availability, and control over upgrades

When self-hosted is the right call

You process sensitive personal data (health, financial, legal, biometric, children's data) and want to minimise third-party exposure
Your sector has data residency rules that strongly prefer or require on-premise processing
You have predictable inference volume above the 2M tokens/day break-even
You have ML or platform engineering capability already on the team
Your use case fits inside what Llama 4 Scout or Qwen 3 do well (classification, RAG, structured extraction, code, long-context recall)
The regulatory risk of a DPF reversal is something the team cannot absorb
You need pinned model versions for evals, audits, or downstream guarantees

When the cloud API is the right call

You are still validating product-market fit and shipping speed dominates everything
Your use case requires frontier multimodal or long-horizon reasoning
You do not have ML ops capability and cannot hire it
Your token volume is genuinely low or spiky
The data you process is anonymisable or non-personal
You can negotiate (or accept) a satisfactory DPA, transfer mechanism, and sub-processor list from the provider
Pinned model versions are not a hard requirement

The hybrid pattern most teams should pick

Tip

Key takeaway

Self-hosted LLM vs cloud API: the privacy tradeoff

Factor 1: data residency and the third-party processor relationship

Factor 2: model quality in 2026

Factor 3: cost at your actual scale

Factor 4: regulatory exposure and the transfer-mechanism question

Factor 5: latency, availability, and control over upgrades

When self-hosted is the right call

When the cloud API is the right call

The hybrid pattern most teams should pick

Continue reading

Building with AI APIs: the 5 privacy questions to answer first

Anthropic vs OpenAI vs Google: privacy policy comparison

Sending data to OpenAI, Anthropic, or Google? The 2026 state of EU-US AI transfers

Self-hosted LLM vs cloud API: the privacy tradeoff

Factor 1: data residency and the third-party processor relationship

Factor 2: model quality in 2026

Factor 3: cost at your actual scale

Factor 4: regulatory exposure and the transfer-mechanism question

Factor 5: latency, availability, and control over upgrades

When self-hosted is the right call

When the cloud API is the right call

The hybrid pattern most teams should pick

Continue reading

Building with AI APIs: the 5 privacy questions to answer first

Anthropic vs OpenAI vs Google: privacy policy comparison

Sending data to OpenAI, Anthropic, or Google? The 2026 state of EU-US AI transfers