Context engineering decides what an AI actually sees beyond the prompt. Anthropic published the canonical 2025 reference, and it now drives production agents.

What Is Context Engineering? The 2026 Explainer

Six labeled inputs feeding into a single AI context window: system prompt, tools, retrieved knowledge, memory, examples, output schema.

Most AI failures are context failures. The model wasn't wrong because the prompt was wrong. It was wrong because what it saw alongside the prompt was incomplete or contradictory. Context engineering names the work of deciding what shows up in the model's window before it answers.

The term hardened over 2025. Andrej Karpathy named it in June.¹ Anthropic published the engineering canon in September.² By November, MIT Technology Review framed the transition from vibe coding to context engineering as one of the year's defining shifts in software.³

Key Takeaways

Context engineering is the discipline of curating everything the model sees: system prompt, tools, retrieved docs, memory, examples, and output schema.

Long context windows didn't solve the problem. Anthropic's research on context rot found that recall drops as tokens grow, so packing more in usually makes things worse.²

Treat context as an active design problem, not a place to dump everything you have.

Why prompt engineering stopped being enough

Prompting has gotten better over time, but the bottleneck for quality outputs is constantly shifting. Long-context models removed the size excuse. Claude Sonnet 4.6 ships with a 1M-token context window.⁴ What didn't scale was the model's ability to actually use that space. Anthropic found that as the token count grows, accurate recall from inside the window degrades. They named it context rot.²

Karpathy framed context engineering as "the delicate art and science of filling the context window with just the right information for the next step," reframing the discipline as composition rather than single-shot prompting.¹

Simon Willison endorsed the term the same week, arguing that prompt engineering had collapsed into mere typing into a chatbot while the real work was composing context across system prompts, tools, retrieved knowledge, and memory.⁵

What context engineering actually is

Anthropic's working definition: "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference."²

The post framed the question every builder should answer first: "What configuration of context is most likely to generate our model's desired behavior?"²

It sits above prompting and below the agent loop. Anything that ends up in the model's window during inference falls inside its scope, including the prompt itself, retrieved documents, tool definitions, conversation memory, examples, and the output schema. The orchestration of those pieces (and the decisions about what to include, exclude, compress, or persist) is the new layer.

The pieces that make up context

A modern LLM call has six common slots. Each one is something you design, not something the model fills in for you.

System prompt and instructions. The role, the rules, the output expectations. Anthropic's guidance is to keep it specific enough to guide behavior but flexible enough to leave the model room to think. The wrong altitude here either over-constrains the model or leaves it guessing.²

Tool definitions. Function-calling schemas and Model Context Protocol (MCP) servers that tell the model what actions are available. Tool sprawl is a common failure mode. Defining 30 tools when 4 will do confuses the model and burns tokens.

Retrieved knowledge. RAG-style document chunks pulled at query time. Retrieval quality, not retrieval volume, drives outcomes. A 2026 industry assessment put it directly: most teams use a fraction of their context window, so the bottleneck is which information actually drives a decision, not how much fits.⁶

Memory. Short-term (this conversation) and long-term (everything before this conversation). The distinction matters because long-term memory needs an external store. Short-term doesn't.

Few-shot examples. Concrete demonstrations that show the model what good looks like. Often more effective than another paragraph of instructions.

Output schema. A structured response format (JSON Schema, Pydantic, function-call shape) that constrains what the model returns. Underused and high-leverage when the downstream system has to parse the answer.

A given LLM call rarely uses all six. A simple chatbot uses system prompt and memory. A RAG question-answer system uses system prompt and retrieved knowledge. An agentic workflow uses every slot. The discipline is recognizing which slots a particular task needs and engineering each one rather than letting any of them stay implicit.

Six context slots arranged in a panel: system prompt, tools, retrieved knowledge, memory, examples, output schema.

Context engineering vs prompt engineering vs RAG vs fine-tuning

These get conflated. They're related but they're not the same thing.

Prompt engineering is wording. It's how you phrase the request and the rules. It's a part of context engineering, not a competitor to it.

RAG is one technique for filling the retrieved-knowledge slot. It does not cover memory, tool definitions, system prompts, or output schemas. The newer framing from RAG vendors themselves: RAG is a subset of context engineering, not the parent discipline.⁷

Fine-tuning changes the model itself. Context engineering changes what the model sees at inference time. Fine-tuning is expensive, slow to update, and often the wrong tool when a better context would do the same job.

Context engineering is the orchestration layer. It decides which pieces are loaded, in what order, in what format, and what gets evicted when the budget tightens.

Four disciplines compared: prompt engineering, context engineering, RAG, and fine-tuning, each with a one-line definition.

Where teams actually hit the wall

Anthropic's post is candid about failure modes that production teams keep rediscovering.²

Context rot. The dominant one. As tokens grow, recall accuracy drops. Stuffing the window full of just-in-case docs makes the model worse, not better.

Tool overload. A 30-tool agent makes 30-tool decisions. Most agents do better with a small, well-described set.

Conflicting instructions across layers. System prompt says one thing, retrieved doc says another, memory says a third. The model picks one and looks unreliable.

Stale memory. A long-running agent that remembers something no longer true is worse than an agent that remembers nothing.

Irrelevant retrieval. A retrieval system tuned for recall over precision dumps borderline-relevant chunks into the window. Borderline-relevant context is what context rot eats first.

LangChain's State of Agent Engineering survey identified context engineering and managing context at scale as a top-cited difficulty among teams shipping agents in production.⁸

The four moves: write, select, compress, isolate

LangChain's framework is the cleanest summary of what teams actually do once they take context engineering seriously.⁹

Write. Save context outside the window. A scratchpad, a notes file, a project memory store. The model offloads work and the window stays clean.

Select. Pick the slice of available context relevant to this turn. Not every tool, not every document, not every memory. The selection logic is its own design problem.

Compress. Summarize older turns. Claude Code runs auto-compact at 95% of window utilization, replacing older tool calls and conversation with a structured summary.⁹

Isolate. Split work across sub-agents with their own narrower contexts so the parent context doesn't bloat. Multi-agent architectures are an isolation pattern.

You don't need all four on day one. Most teams start with select (better retrieval) and write (an external memory store) and add compress and isolate as agents run longer.

Where to start

If you're building an agent or a long-running AI feature in 2026, five steps cover most of the gap between first prototype and production-shaped.

Write the system prompt at the right altitude. Specific enough to remove ambiguity, broad enough not to over-constrain. Read Anthropic's reference for the calibration.²
Define the tool surface deliberately. Fewer, clearer tools beat a large catalog. Each tool needs an unambiguous name and description.
Decide where memory lives. Short-term in the conversation, long-term in an external store. Memory layers are an emerging tooling category. The slot is small but it's the one most teams underweight.
Compress before you have to. Build summarization into the agent loop early, not when you hit the window ceiling.
Instrument every layer. Log the prompt, the retrieved chunks, the tool calls, and the final response. You can't tune what you can't see, and the failures that matter (context rot, stale memory, irrelevant retrieval) all live one layer below the visible output.

A reasonable first iteration takes a long weekend. The hard part isn't the pieces individually. It's the loop where you change one thing, run the eval, and discover the system regressed in a slot you didn't touch. That feedback loop is what context engineering, as a practice, is really about.

Context engineering tools: the 2026 stack

The tooling landscape splits into five layers. The pieces are real, but most production teams still hand-roll the glue between them.

Orchestration frameworks. LangChain, LangGraph, and LlamaIndex own this layer. They define the agent loop, sequence the tool calls, manage the message history, and expose hooks for compression and routing. LangChain's State of Agent Engineering report shows context handling at the top of the difficulty list, which is why their own framework added explicit context-engineering primitives in 2025.⁸

Tool definition standards. Anthropic's Model Context Protocol (MCP) is the de facto standard in 2026. By December 2025, more than 10,000 MCP servers had been published, and OpenAI announced MCP adoption in March 2025 (ChatGPT desktop support rolled out thereafter). AGENTS.md is the cross-tool repo-context standard, with adoption past 60,000 open-source projects by late 2025.¹⁰

Retrieval infrastructure. Vector databases (Pinecone, Weaviate, Qdrant, Chroma, pgvector) handle the embeddings and similarity search behind RAG. Hybrid search (vector plus BM25) and reranking models are now common because pure vector retrieval over-recalls borderline-relevant chunks. The Anthropic post is explicit that retrieval quality is the bottleneck, not retrieval volume.²

Memory layers. The newest slot in the stack. Short-term memory lives in the conversation. Long-term memory needs an external store designed to be queried, updated, and compacted across sessions. Letta, MemoryPlugin, and Supermemory are the open-source and consumer-facing options that have shipped. Most teams that need long-term memory still build a thin layer over a vector store plus a relational table for facts they want to reason over.

Evaluation and observability. Once context is the design problem, the question becomes how you know your context is working. LangSmith, Helicone, Phoenix, and the eval-focused side of LangChain all live here. The standard pattern: log every prompt, retrieved chunk, tool call, and response, then run regression evals when you change anything in the context pipeline.

The honest assessment of this stack: the orchestration layer is mature, the tool definition standard is settling, and the retrieval and memory layers are still consolidating. Expect more movement in 2026.

How memory fits inside context engineering

Memory deserves its own treatment because it's the slot most teams underweight and the one where context rot does the most damage.

Two distinctions matter. Short-term memory is the current conversation, which lives inside the window automatically. Long-term memory is everything from outside this conversation that should still inform the model. Long-term memory always lives in an external store. The store has to handle three operations: write a new fact, retrieve relevant facts at query time, and decide what to drop or compact when the store grows.

For teams that don't want to build this from scratch, the consumer-facing memory layer category is starting to formalize. The same providers named in the tools section (Letta, MemoryPlugin, Supermemory) plus emerging cross-AI extensions like MemoryBase cover most of the no-build options. Pick by which AIs you actually use, what scope you need, and whether you want SDK-style integration or extension-style coverage.

Why the term replaced prompt engineering

Prompt engineering was 2023 thinking. Models were small, contexts were short, and the highest-leverage move was wording the instruction better. Models in 2026 are different. Context windows are large but lossy. Tools, memory, and retrieval all live in the same scarce attention budget. The work moved from clever phrasing to deliberate composition.

LangChain's State of Agent Engineering survey identifies context engineering and managing context at scale as among the top-cited difficulties for teams shipping agents in production.⁸ The teams that do graduate treat context as the design problem, not the prompt.

For a closer look at one piece of the stack (memory), see how to make ChatGPT remember across conversations. For the cross-tool angle, see share context between ChatGPT and Claude. For the underlying fragmentation problem, see stop repeating context to AI.

Frequently asked questions

Is context engineering replacing prompt engineering?

Not replacing. Subsuming. Prompt engineering is one piece of context engineering. Karpathy and Anthropic both frame the prompt as a single component inside a larger composition that also includes tools, memory, retrieved knowledge, and examples.¹² Wording still matters. It's no longer the highest-leverage move.

What's the difference between context engineering and RAG?

RAG fills one slot in the context window: retrieved documents. Context engineering covers the whole window, including system prompt, tools, memory, examples, and output schema. Vendors in the RAG space now describe RAG as a subset of context engineering, not the parent discipline.⁷

How is context engineering different from fine-tuning?

Fine-tuning changes the model's weights. Context engineering changes what the model sees at inference time. Fine-tuning is slow and expensive to update. Context engineering can be redesigned per turn. When a behavior change is needed in production, context engineering is the first place to look.

When do I need context engineering instead of just a better prompt?

The moment the model has access to more than one source of information. Any agent with tools, any chatbot with memory, any RAG system, any long-running task is a context engineering problem. A one-shot single-prompt task can still be solved with prompt engineering alone.

What tools help with context engineering?

The pieces are emerging. LangChain and LangGraph for orchestration, Anthropic's MCP for tool definitions, vector databases for retrieval, and memory layers like MemoryBase, Letta, and MemoryPlugin for the persistence slot. The tooling is a year or two behind the concept, which is why most production teams still hand-roll major parts.

Sources

Andrej Karpathy (June 25, 2025), post on context engineering vs prompt engineering.
Anthropic Engineering (September 29, 2025), Effective context engineering for AI agents. Retrieved 2026-05-07.
MIT Technology Review (November 5, 2025), From vibe coding to context engineering: 2025 in software development.
Anthropic, Context windows (Claude API documentation). Retrieved 2026-05-07.
Simon Willison (June 27, 2025), Context engineering.
SwirlAI (March 2026), State of Context Engineering in 2026.
Elastic Search Labs, Context engineering: components, techniques, and best practices. Retrieved 2026-05-07.
LangChain, State of Agent Engineering. Retrieved 2026-05-07.
LangChain Blog (July 2025), Context Engineering for Agents.
Linux Foundation (December 9, 2025), Linux Foundation Announces the Formation of the Agentic AI Foundation. Retrieved 2026-05-12.