Build vs buy is not one decision. It is five — one for each layer of the AI stack. The five layers, the question that decides each, and the AI Act trap that catches teams who build the wrong layer.
The reason "build vs buy" is the wrong frame for AI tooling in 2026 is that the AI stack has at least five layers and each one can be built or bought independently. A team can buy the foundation model from Anthropic, run inference on their own VPC, build the application layer themselves, buy a managed RAG vendor for the data pipeline, and rely on Microsoft's DPA for the compliance layer. That is one team. The decision to "build" or "buy" was made five different times for five different layers, and the answers were not the same.
The framing this article uses is layered ownership. For each layer of the stack, what is the question that decides whether you build or buy that layer? The questions are not the same per layer, and the trade-offs each one creates are not the same either. The article walks the five layers, names the decisive question for each, and ends with the hybrid router pattern that most teams converge on once they have made the five decisions deliberately rather than by default.
The model is the layer most discussions of build-vs-buy actually mean. It is also the layer where the build-vs-buy spectrum has the most options.
The four real choices in 2026 are: a frontier commercial model via API (GPT-5, Claude Opus 4.5, Gemini 3), an open-weight model self-hosted on your own infrastructure (Llama 4, DeepSeek V3.2, Qwen 3, Mistral Medium 3.1), an open-weight model hosted by a third party that is not the model creator (Together, Replicate, Fireworks, Bedrock, Vertex), or a custom model trained or heavily fine-tuned from scratch (rare; specific high-volume use cases).
The decisive question for Layer 1 is what gap, if any, exists between the open-source state of the art and the frontier you actually need. As of April 2026, the state of the open-source frontier is genuinely close to the commercial frontier on most business workloads. Llama 4 (released by Meta on 5 April 2025), DeepSeek V3.2, and Qwen 3 are competitive on summarisation, classification, Q&A, translation, and standard code generation. Mistral Medium 3.1 claims approximately 90% of Claude Sonnet 3.7 quality at one-eighth the cost. Where the gap remains visible is in complex multi-step reasoning, frontier coding tasks (SWE-bench), agentic workflows with tool use, and very long context analysis. Buying the frontier still wins for the use cases that need it. For everything else, the open-source option is technically sufficient.
The cost crossover point matters less than most write-ups suggest. Below roughly 3 to 5 billion tokens per day (the breakeven for self-hosting an 8×H100 node running DeepSeek V3.2 or Qwen 3-32B), commercial API inference is cheaper per token than running your own GPUs at full utilisation. Above that volume, self-hosting wins on raw inference cost. But "raw inference cost" is one line in a larger total. Self-hosting adds 10-20 hours per month of engineering maintenance (model updates, GPU driver issues, eval pipelines, monitoring), and for a team under 10 engineers that is the line that decides the question regardless of the per-token math.
The check that fits Layer 1: measure the gap between the frontier model you would buy and the open-source model you would self-host on the specific tasks your product runs. If the gap is zero on those tasks, the open-source option is on the table. If the gap is real, you are buying.
The serving layer is where the model actually runs. It is also the layer that gets conflated with Layer 1 most often.
The four serving options in 2026 are: the model creator's own hosted API (OpenAI, Anthropic, Google directly), a hyperscaler-hosted version of the model (Azure OpenAI, AWS Bedrock, GCP Vertex AI), a specialist GPU provider (Together, Replicate, Fireworks, Spheron, RunPod), or your own GPUs in a colocation or cloud-hosted VPC (a self-managed cluster of H100s or successors).
The decisive question for Layer 2 is what controllership posture you need on the data flowing through inference. The four options give you four different postures. Calling Anthropic's API directly puts Anthropic in the processor seat and creates a controller-processor-sub-processor chain whose lower layers (Google Cloud TPUs since the 23 October 2025 expansion, AWS Bedrock as another deployment path) are not under your control. Calling Anthropic via AWS Bedrock or Azure OpenAI rebrands the controllership chain through your existing hyperscaler DPA, sometimes simpler, sometimes the same complexity in a different shape. Calling a specialist GPU provider or running your own VPC removes the model creator from the data path entirely; the only entity that sees the prompts is the infrastructure layer you pay for the GPUs.
For most teams, the cheapest correct answer at Layer 2 is to keep the model creator in the inference path (use Anthropic's API directly, or use Azure OpenAI for OpenAI models) and accept the controller-processor chain. The hyperscaler-hosted route is the right answer for teams that already have an enterprise relationship with that hyperscaler and want one DPA covering both inference and the rest of the stack. The self-managed VPC is the right answer when the data is sensitive enough that no external party can be in the inference path at all, and when the team has the operational maturity to run a GPU cluster without the failure modes (driver issues, model update churn, eval-pipeline maintenance) eating engineering time.
EU residency matters here separately. OpenAI launched in-region GPU inference on 16 January 2026, which means EU API calls now stay on EU GPUs by default for the first time. Anthropic models run on US infrastructure unless invoked via AWS Bedrock in the Frankfurt or Paris regions, with the caveat that the 23 October 2025 Anthropic-Google Cloud TPU expansion may shift the underlying compute geography. Self-hosting in an EU region (Nebius in Finland or Paris, OVHcloud in France, DataCrunch in the Netherlands) is the route that gives the simplest residency story without depending on a hyperscaler's residency commitments holding through the next sub-processor change.
The application layer is what the user actually touches. The build-vs-buy spectrum here is the widest of all five layers.
The four real options at Layer 3 are: a fully off-the-shelf SaaS product where the AI feature is bundled in (Notion AI, ChatGPT Enterprise, GitHub Copilot, Microsoft 365 Copilot), a vendor product with a customisation surface (a Copilot Studio agent, a Glean enterprise search instance, a Vectara RAG product), a custom application built on top of a bought model and bought serving (the standard "we built our own UI on top of OpenAI's API" pattern), or a custom application on top of a self-hosted model (the same pattern with the GPU question moved into Layer 2).
The decisive question for Layer 3 is whether the application is differentiating or undifferentiated. The hatchworks.com framing of "buy the heavy core, build what differentiates" is honest. If the application is "we want a chat interface for our customer support team", that is undifferentiated and there are 30 vendors who already built it. Buy. If the application is "we want a structured-output generation pipeline that is unique to how our domain works", that is differentiating and there is no off-the-shelf product. Build.
The trap at Layer 3 is buying the demo and underestimating the customisation work. Most off-the-shelf SaaS-AI products demo well on a clean dataset and degrade badly when connected to a real production data lake with the real permission structure, the real schema drift, and the real "this column is sometimes a JSON blob" data quality. The full TCO of "buy the SaaS product" includes the integration work, the prompt and behaviour customisation, the eval and regression infrastructure, and the operational support, and that work is rarely 10% of the cost of the licence. For teams that under-budget the integration work, "buy" turns out to be more expensive than "build" ever would have been.
The check that fits Layer 3: is the application layer something your competitors could build identically on top of the same model? If yes, buy it. If no, the differentiation lives at this layer and you should build it (probably on top of a bought model and bought serving).
The data pipeline is the layer where the model meets your data. RAG indices, embedding pipelines, chunking strategies, retrieval logic, observability for the query → context → response loop. This layer is where most AI projects fail in production, and it is also the layer where build-vs-buy is the least obvious.
The three real options are: a fully managed RAG vendor (Vectara, Pinecone Assistant, Weaviate Cloud, Glean), a self-hosted vector store paired with a custom RAG pipeline (Qdrant, Pinecone Standard, pgvector + your own retrieval logic), or a custom data pipeline built end-to-end on raw infrastructure (your own embedding model, your own chunking, your own retrieval).
The decisive question for Layer 4 is how much of the data quality problem is general and how much is domain-specific. General-purpose RAG (chunking text into 500-token windows, embedding with a generic model, doing top-k retrieval) is undifferentiated and there are good vendors. Domain-specific RAG (where the chunking is structured by your domain ontology, the embeddings are fine-tuned on your data, the retrieval is graph-walked rather than vector-only) is differentiating and the vendors will not build it for you. Most teams under-build the data layer because the demos make it look easy and over-build the model layer because the model conversations are more interesting.
The compliance angle on Layer 4 is the EDPB Opinion 28/2024 anonymity test. Embeddings derived from personal data are rarely anonymous and almost always inherit the original record's controller obligations through the entire RAG chain. Buying a managed RAG vendor adds them to your sub-processor list; building a self-hosted vector store keeps the controllership chain shorter but moves the operational burden onto your team. The cross-link to read alongside this layer is the article on whether vector embeddings are personal data under GDPR. The short answer is "almost always yes, treat them like the source text".
The check that fits Layer 4: is the chunking, embedding, and retrieval logic the part that makes your AI feature actually work better than a generic competitor's? If yes, that work is the moat and you should build it. If no, buy a vendor and put the engineering hours into the application layer above it instead.
This is the layer most build-vs-buy articles do not mention, and it is the one with the most expensive trap.
Under the EU AI Act Article 25, a deployer of an AI system can be reclassified as a provider if they (a) put their name or trademark on a high-risk AI system already on the market, (b) make a "substantial modification" to a high-risk AI system, or (c) modify the intended purpose of a non-high-risk system in such a way that it becomes high-risk. The definition of "substantial modification" in Article 3 is any change after the system was placed on the market that was not foreseen in the original conformity assessment and that affects compliance with Chapter III Section 2 obligations or modifies the intended purpose for which the system was assessed.
The provider/deployer distinction is the difference between conformity assessment, technical documentation, post-market monitoring, registration in the EU database, and quality management systems on one side (provider) and DPIA-plus-Article-26-deployer-obligations on the other (deployer). The cost differential is roughly an order of magnitude. The August 2026 enforcement date applies to both, but the surface area for a provider is dramatically larger.
For Layer 5, the decisive question is does any combination of the four layers above turn you into a provider. The trap is that the move from deployer to provider happens by accident. Three accidental paths recur:
The compliance layer is the only layer where building costs you more than buying in almost every scenario. The provider obligations are large, fixed, and only worth paying if the system is genuinely a strategic asset. For most teams, the right answer at Layer 5 is to buy a system whose original provider has done the conformity assessment, accept the deployer obligations, and avoid the modifications and rebranding moves that flip the classification.
The hybrid pattern is what teams converge on once they have made the five decisions independently. It is rarely "all build" or "all buy". The most common production shape in 2026 is: buy the model (Layer 1) from a frontier provider for the high-quality tasks, buy the serving (Layer 2) through a hyperscaler relationship the team already has, build the application (Layer 3) on top, buy a managed RAG vendor (Layer 4) for the general data pipeline and build a thin domain-specific layer above it, and buy a deployer-classified system (Layer 5) so the Article 25 reclassification trap stays closed.
Two routing patterns are worth knowing.
The data classification router. Incoming requests are classified by sensitivity at the gateway. Sensitive requests (containing PII, health data, financial records, legal documents) are routed to a self-hosted model on a self-managed VPC. Non-sensitive requests (general content, public data analysis, code assistance on open-source code) are routed to a commercial API. The two paths share the application layer above and the data pipeline alongside, but the model and serving layers fork. This works when the sensitivity classification can be automated reliably enough that the routing is invisible to end users. If your team has to manually decide which path each request takes, the pattern will not scale and someone will route sensitive data the wrong way.
The anonymisation proxy. PII is stripped from the prompt before it leaves the team's network, the anonymised prompt is sent to a commercial API, the response is processed and the identifiers are reinserted locally. Microsoft Presidio is the open-source library most teams use for the redaction step (CRAPII F1 around 0.94 with comparable precision and recall in the published benchmarks). The pattern lets the team use frontier model quality on data that would otherwise have to stay self-hosted. The honest caveat is that anonymisation is not the same as anonymity under EDPB Opinion 28/2024. For many use cases the residual re-identification risk means the controller obligations still apply, and the anonymisation proxy is a defence-in-depth pattern, not a regulatory escape hatch.
The third option is to run a single architecture and accept the trade-offs of one approach. For teams under 10 engineers, this is often the right answer because two architectures means two evaluation pipelines, two deployment paths, two prompt sets, two regression suites, and two on-call rotations. The hybrid pattern is structurally cleaner but operationally heavier; the all-buy pattern is operationally cleaner but creates the largest sub-processor surface; the all-build pattern is operationally heaviest and only makes sense when the data sensitivity or AI Act classification forces it.
The right time to walk the five questions is before the team commits to an architecture for the next twelve months. Most teams decide build-vs-buy implicitly during the first sprint (someone signs up for an OpenAI API key, or a senior engineer spins up an H100 box to "try Llama 4") and then the rest of the architecture grows around that early default. The default rarely turns out to be the right decision at all five layers, and rebuilding any one layer six months later is expensive.
The S&P Global figure that 42% of enterprise AI initiatives were scrapped in 2025 (up from 17% the year before) is the load-bearing data point on this. The failure mode is rarely "the model was wrong". It is "we built the layer we should have bought and bought the layer we should have built, and the architecture cannot be fixed without starting over". Walking the five layers deliberately before the architecture commits is the cheapest insurance against that pattern.
A 2026 decision framework for dev teams choosing between self-hosting an open-weight LLM and calling a cloud API. Refreshed with Llama 4, the Latombe DPF challenge, and Azure / Bedrock EU data zones.
Generic AI vendor checklists fail because they treat every provider as one category. The right questions depend on which of four vendor archetypes you are evaluating.
EU AI data residency in 2026: the seven layers where data lives, the CLOUD Act mechanic, the OpenAI in-region GPU launch, and when sovereignty beats residency.
Free tool · live
AI Data Flow Checker
Map how personal data flows through your AI integrations and spot the privacy risks before they spot you.