Agentic RAG
Also known as: Iterative RAG, Adaptive RAG, Self-RAG, Active Retrieval
Lets the model issue retrieval as a tool call and decide when to stop searching.
Claude Code
- Expose your retrieval index as an MCP tool; Claude calls it iteratively inside the normal tool-use loop without extra orchestration.
- Write the retrieval tool description to instruct re-querying — tell the model to refine and re-issue when initial passages are insufficient.
- Use a subagent for retrieval-heavy tasks; it accumulates passages in its own context and returns a summary, keeping the main session clean.
- Pin the stopping heuristic in `CLAUDE.md`: specify how many retrieval rounds are appropriate before the agent should answer on available evidence.
Primitives
Related patterns
Cursor
- Add a retrieval MCP server to
.cursor/mcp.json; in Agent mode, the model issues multipleretrievecalls iteratively until it has enough evidence. - Use
@docsto pre-index documentation sources; Cursor fetches and re-ranks on each retrieval call automatically. - Pass the initial query result back via
@fileand prompt the agent to refine the query if the result is insufficient. - Use Plan mode to sketch the retrieval strategy before the loop starts so you can review the intended hops.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Apply when questions are multi-hop or ambiguous and a single similarity search will miss the bridging fact (cross-document reasoning, follow-up questions that depend on what the first hop returned, vague queries that need a clarifying search). | −When the question is well-formed and self-contained, vanilla single-shot RAG is cheaper, easier to debug, and competitive on quality. The loop pays no rent on a one-hop question. |
| +Justified where the agent has access to multiple corpora or tools and must decide which one to consult (internal docs, the public web, a structured database) rather than treating all evidence as one undifferentiated index. | −When latency budgets are tight (sub-second chat, autocomplete), the extra retrieval round-trips and reasoning tokens compound into a worse user experience than answering from a single retrieval and being wrong sometimes. |
| +Worth the cost when the cost of an extra retrieval is small relative to the cost of an unsupported answer (research assistants, regulatory or medical Q&A, customer-facing support that escalates on uncertainty). | −Without an answer-level evaluator that catches premature termination, the agent will confidently stop on the first plausible hit and the loop adds cost without measurable accuracy improvement over the single-shot baseline. |
| +A better fit than a hand-rolled query-rewriter chain when the number of hops is data-dependent and a fixed pipeline would either over- or under-fetch. |
In the wild
| Source | Claim |
|---|---|
| docs.perplexity.ai → | Perplexity issues fresh search queries on every conversational turn and reformulates them as the dialogue refines what the user is actually asking, then conditions its grounded answer on the iteratively retrieved passages with inline citations. |
| platform.claude.com → | Anthropic's web_search tool lets a Claude model decide mid-trajectory to issue one or more searches, read the results, and either answer or search again. The pattern is documented as a first-party tool the model invokes inside a normal tool-use loop. |
| langchain-ai.github.io → | LangGraph's agentic-RAG tutorial wires a graph in which the model first decides whether to retrieve, then grades the retrieved passages, and either rewrites the query and re-retrieves or generates an answer. It implements the iterative control flow this pattern names, runnable end-to-end. |
Reader gotcha
Self-RAG's evaluation reports that adaptive retrieval helps most when the model is trained to emit explicit reflection tokens that decide whether to retrieve and whether the retrieved passages are relevant; bolting an unmodified instruction-tuned model into a retrieve-anything-anytime loop tends to over-retrieve on questions it could have answered from parametric memory and under-verify on questions where the first hit was off-topic. The decision to retrieve is itself a learned skill that the loop alone does not provide. source
Implementation sketch
import { generateText, tool } from 'ai'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'
declare const index: { search(q: string, k: number): Promise<{ id: string; text: string }[]> }
const retrieve = tool({
description: 'Search the corpus and return top-k passages. Call again with a refined query if the first set is insufficient.',
parameters: z.object({ query: z.string(), k: z.number().int().min(1).max(8).default(4) }),
execute: async ({ query, k }) => ({ passages: await index.search(query, k) }),
})
const { text } = await generateText({
model: openai('gpt-4o'),
tools: { retrieve },
maxSteps: 5, // bounded retrieval loop; model decides when to stop and answer
system: 'Use retrieve as many times as needed. Cite passage ids. If the corpus does not support an answer, say so.',
prompt: 'Which of our SLOs changed between the 2025 and 2026 reliability reviews, and why?',
})
export {}
- LangGraph
- Vercel AI SDK
- Mastra
- LangChain
References
- DOCSWeb search tool
Agentic RAG turns retrieval into a tool the model can call as many times as the task warrants. The agent inspects the question, issues a first query, reads what came back, and then decides, based on what it found and what it still needs, whether to rephrase, narrow to a subtopic, fetch a different corpus, or stop and answer. Retrieval is no longer a fixed pre-step that runs before generation; it becomes one action in a Thought–Action–Observation loop, indistinguishable from any other tool call. The runtime owns the bound on iteration; the model owns the decision to keep going.
Background · context and trade-offs
The pattern sits one layer above vanilla RAG (authored separately at /agentic-design-patterns/rag), which fires a single similarity search and concatenates the top-k chunks into the prompt. That single-shot pipeline is cheap and inspectable but folds on multi-hop questions where the bridging fact lives in a chunk no embedding of the original phrasing will surface. The agentic variant addresses that gap by letting the model rewrite the query mid-trajectory: read the first hit, notice the entity it actually needs, search again. Self-RAG goes further and trains the model to emit reflection tokens that decide when to retrieve, when retrieved passages are relevant, and when the draft is supported. The same control flow is learned end-to-end rather than orchestrated externally.
The cost of this flexibility is a longer tail. Each extra retrieval is another similarity search, another set of tokens in the context, and another decision the model can get wrong; runs that should have terminated after one hop drift into four. Two failure modes recur: the agent keeps re-querying because nothing in the trajectory disconfirms its hypothesis, or it terminates early on a confident-but-wrong first hit and never asks the second question that would have surfaced the contradiction. A hard step budget and an answer-level evaluator that scores against ground truth, not retrieval recall, are both load-bearing.