RAG
Also known as: Retrieval-Augmented Generation, Retrieve-and-Generate
Pulls documents from an external store and uses them as context for the model's answer.
Claude Code
- Use a MCP server to expose your retrieval index as a
searchtool; Claude calls it inside the normal tool-use loop. - Declare the retrieval tool in `settings.json`
permissions.allowso it fires without prompting on every session. - Pin the prompt template (evidence layout, citation instruction) in `CLAUDE.md` so the grounding format is consistent across calls.
- Use a subagent for retrieval-heavy tasks so the main session context stays clean from bulk passage text.
Primitives
Related patterns
Cursor
- Add a retrieval MCP server to
.cursor/mcp.json; the agent calls it as a tool inside the standard Agent mode loop. - Use
@docsto index a documentation URL directly into Cursor's context — no embedding pipeline setup required for known doc sites. - Reference retrieved files via
@fileto give the model grounding passages without embedding them in the prompt by hand. - Use
@codebasefor codebase-native retrieval — Cursor indexes the repo and retrieves relevant snippets automatically.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Best for cases where the answer must be grounded in a corpus that changes faster than you can retrain or fine-tune (product manuals, internal wikis, regulatory filings, customer ticket history). | −When the question requires combining facts across multiple documents that no single chunk surfaces, single-pass retrieval will miss the bridge and the answer will look confidently wrong. |
| +Justified where the user's question is well-formed and self-contained: a single retrieval step is enough to surface the relevant passage. | −When the corpus is small enough to fit in the model's context window: long-context prompting is simpler, has no retrieval failure mode, and benchmarks competitively for under ~100k tokens. |
| +A good fit when hallucination cost is high and you need an inspectable audit trail: every cited claim points back to a specific chunk in a specific document. | −Without a way to evaluate retrieval quality independently of generation quality, the system's failures are unattributable and tuning becomes superstition. |
| +Useful when the corpus is large enough that fitting it in the context window is wasteful or impossible, but small enough that the index fits in a single vector store. |
In the wild
| Source | Claim |
|---|---|
| docs.perplexity.ai → | Perplexity issues a search for every conversational turn, retrieves a few pages of results, and conditions its answer on those passages with inline citations. The product is consumer-facing RAG with a UI that exposes the retrieved sources next to each claim. |
| glean.com → | Glean indexes an enterprise's documents, chats, and tickets behind permissions-aware vector search, then answers employee questions by quoting the matching passages with links back to the source system. |
| anthropic.com → | Anthropic's Contextual Retrieval writeup documents a production RAG variant that prepends a one-paragraph chunk-context preamble before embedding, raising recall on their own evals by reducing the rate at which a relevant chunk goes unretrieved. |
Reader gotcha
Embedding a chunk in isolation strips the surrounding context that disambiguates it: "the company reported $X" loses meaning when severed from the document that names the company. Anthropic's Contextual Retrieval evaluation shows that prepending a short LLM-written context to each chunk before embedding cuts the failed-retrieval rate substantially; the cheap fix that practitioners most often skip. source
Implementation sketch
import { generateText, embed } from 'ai'
import { openai } from '@ai-sdk/openai'
type Chunk = { id: string; text: string; embedding: number[] }
declare const store: {
search(query: number[], k: number): Promise<Chunk[]>
}
async function ragAnswer(question: string, k = 5): Promise<string> {
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: question,
})
const passages = await store.search(embedding, k)
const context = passages.map((p, i) => `[${i + 1}] ${p.text}`).join('\n\n')
const { text } = await generateText({
model: openai('gpt-4o'),
system: 'Answer using only the numbered passages. Cite the bracketed index for each claim. If the passages do not support an answer, say so.',
prompt: `Passages:\n${context}\n\nQuestion: ${question}`,
})
return text
}
export {}
- LangChain
- LangGraph
- Vercel AI SDK
- Mastra
References
Retrieval-Augmented Generation pairs a frozen language model with an external store of documents the model was never trained on. At query time the system embeds the question, scores it against a precomputed index of chunked text, and selects the top handful of passages by cosine similarity or hybrid sparse-dense ranking. Those passages are concatenated into the prompt as evidence the model is told to ground its answer in. The model then generates a response that, in the well-tuned case, cites or quotes the retrieved snippets rather than fabricating from parametric memory.
Background · context and trade-offs
The pattern's leverage is that knowledge can change without retraining: swap the index, update a chunk, redeploy nothing. The cost is that retrieval becomes the silent tail that dominates accuracy. Chunk size, overlap, embedding model, distance metric, top-k, prompt template, and reranker each have a budget of free parameters and each interacts with the others. Even the embedding's distance function is load-bearing: a passage that should rank first under cosine often ranks fifth under dot-product on the same vectors. Most production RAG quality work is retrieval engineering wearing a generation costume.
The vanilla pattern is single-pass and stateless: one query in, one prompt out. That makes it cheap and easy to debug: every retrieved chunk and every token in the augmented prompt is inspectable without a stacktrace. It also makes it fragile to multi-hop questions where the answer depends on combining facts from multiple documents, ambiguous queries that require clarification, or domains where the relevant chunk is not lexically close to the user's phrasing. The agentic variant, in which the model rewrites queries, retrieves iteratively, and decides when to stop, addresses those gaps but is documented separately as a different pattern.