Context Engineering
Also known as: Context Curation, Context Window Management
Decides which tokens fill the model's window, in what order, at what fraction of budget.
Claude Code
- Put always-on project rules in `CLAUDE.md`. Claude reads it at the start of every session before any tool call.
- Move trigger-bearing rules into skills — a
SKILL.mdloads only when its trigger fires, preserving always-on budget. - Keep
CLAUDE.mdunder 3 K tokens. Count withwc -w × 1.8; every extra line crowds out the actual task context. - Use `settings.json` to scope tool permissions per project — pin invariant config at the head so it cache-hits every session.
- When a rule starts with "when X" or "before Y", it belongs in a skill, not here — keep
CLAUDE.mdto invariants.
Primitives
Related patterns
Cursor
- Set
alwaysApply: truein `.cursor/rules/*.mdc` only for project-wide invariants — every other rule pays that token budget on every turn. - Use
globsin rule frontmatter to scope rules to specific file types; a rule for**/*.tsxfires only when those files are in context. - Keep total
alwaysApplyrule content under ~500 lines combined — beyond that, the working context fills before the task description lands. - Use @-references to inject specific files, folders, or terminal output on demand rather than loading everything up front.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Indicated when the candidate set of relevant tokens (transcripts, retrievals, tool outputs, scratchpad notes) regularly exceeds the model window or your latency and cost budget for one turn. | −When every candidate token comfortably fits the window with budget to spare and the task is single-shot, ranking and packing add latency without changing the answer. |
| +Justified where the agent runs many turns and the same prefix can be cached across calls. Pin invariant tokens at the head so cache reads pay for themselves. | −Without a feedback signal that ties context choices to outcome quality (eval set, user thumbs, downstream tool success), tuning the layout becomes superstition that survives by inertia. |
| +A good fit when accuracy depends on which evidence the model attends to, not just on whether the evidence is in the prompt at all (long-context recall degrades by position). | −When the task is purely retrieval-bottlenecked and the model is not the limit, work the retrieval pipeline first. Engineering the layout cannot rescue evidence that was never fetched. |
| +Useful when you want to swap retrieval strategies, memory policies, or summarisation cadences without changing the agent's control flow. The engineering layer is the seam. |
In the wild
| Source | Claim |
|---|---|
| cursor.com → | Cursor's @-reference UI lets users hand-pick files, folders, terminals, past chats, and git diffs to inject into the agent's prompt, and falls back to its own search when the user does not. The surface exposes both the user-curated and model-curated halves of the context-selection step. |
| anthropic.com → | Anthropic's Applied AI team frames context engineering as an explicit design discipline distinct from prompt engineering, with named techniques (just-in-time retrieval, compaction, structured note-taking, and sub-agent isolation) that production agent runtimes use to keep the attention budget on the highest-signal tokens. |
| docs.anthropic.com → | Anthropic's prompt-caching API exposes the layout decision as a billing primitive: marking up to four cache breakpoints in the request lets the runtime reuse the matching prefix at a tenth of the input price, which is why production agents pin invariant context (system prompt, tool definitions, long documents) at the head of the window. |
Reader gotcha
Liu et al. document a U-shaped recall curve: language models given a long context attend most reliably to tokens at the start and end and least reliably to tokens in the middle, with accuracy on multi-document QA dropping sharply when the relevant passage sits at position 10 of 20. Stuffing the window is not the same as engineering the window. Placement load-bears, and the practitioner who concatenates retrievals in arbitrary order is sampling the worst part of the curve. source
Implementation sketch
import { embed, generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
type Candidate = { id: string; text: string; tokens: number; recencyBoost?: number }
declare const candidates: Candidate[]
declare function cosine(a: number[], b: number[]): number
declare const embeddings: Record<string, number[]>
async function buildContext(question: string, budgetTokens = 6000): Promise<string> {
const { embedding: q } = await embed({ model: openai.embedding('text-embedding-3-small'), value: question })
const ranked = candidates
.map((c) => ({ c, score: cosine(q, embeddings[c.id]!) + (c.recencyBoost ?? 0) }))
.sort((a, b) => b.score - a.score)
const packed: Candidate[] = []
let used = 0
for (const { c } of ranked) {
if (used + c.tokens > budgetTokens) continue
packed.push(c); used += c.tokens
}
const evidence = packed.map((p, i) => `[${i + 1}] ${p.text}`).join('\n\n')
const { text } = await generateText({
model: openai('gpt-4o'),
system: 'You are answering using only the numbered evidence below. Cite by bracketed index.',
prompt: `Evidence:\n${evidence}\n\nQuestion: ${question}`,
})
return text
}
export {}
- LangChain
- LangGraph
- Vercel AI SDK
- Mastra
References
- DOCSPrompt caching
Context engineering is the discipline of deciding what fills the model's window before each call: which system instructions, which retrieved passages, which tool outputs, which scratchpad notes, which prior turns, in what order, and at what fraction of the budget. The pattern treats the window as a finite resource the runtime allocates, not as a place to dump everything that might be relevant. Selection happens upstream of generation: a step ranks candidate items by signal-to-cost, packs the highest-yield ones into a budget, and lays them out so the model attends to the parts that matter most. Longer windows did not eliminate the cost of choosing; they made the choice more consequential.
Background · context and trade-offs
The pattern is broader than retrieval-augmented generation, which is one specific way to populate the window. RAG answers where new tokens come from; context engineering answers which tokens, of all the candidates, get a seat at the table this turn. It also sits next to but distinct from prompt chaining, which decomposes the task into stages: chaining changes what the model is asked, while context engineering changes what the model sees while answering. A typical implementation composes both: the chain decides the stage, the engineer decides which retrieved chunks, prior summaries, and tool transcripts that stage will read.
The mechanism leans on three operational moves. A relevance signal (embedding similarity, recency, a learned ranker, an explicit user @-reference) orders candidates. A budget (token count, dollar cost, latency target) caps what survives. A layout (system prompt, then evidence, then history, then question) places the surviving tokens where attention is strongest, because models do not weight positions evenly across a long window. The cost is operational: someone has to pick the signal, set the budget, instrument the cache hit rate, and decide what to evict when the window fills. Without those decisions the agent silently degrades as transcripts grow.