last week, i watched a traditional RAG system confidently answer "ACME Corp's Q2 2023 revenue grew 3%" when the user asked about TechCorp's performance. why? because the retrieved chunk said "the company's revenue grew by 3%" without specifying which company.
(this is exactly why context-less retrieval fails in production)
the problem with traditional RAG isn't retrieval—it's that retrieval happens blindly. chunks get embedded without context. search happens without verification. results get injected into prompts without validation.
agentic RAG fixes this by making retrieval intelligent. agents reason about what information they need, actively search for it, validate what they find, and iterate until they have what's required.
what makes RAG "agentic"?
the shift from traditional to agentic RAG is about moving from passive retrieval to active information seeking.
traditional RAG: the static approach
user query → embed query → similarity search → return top-k chunks → inject into prompt → generate answer
this works... until it doesn't. problems emerge:
- no context awareness - chunks lose surrounding information during embedding
- single-shot retrieval - can't adapt if initial results are insufficient
- no validation - blindly trusts that top-k results contain the answer
- query dependency - completely reliant on how user phrased their question
agentic RAG: the intelligent approach
user query → agent analyzes information needs → formulates search strategy → retrieves → evaluates results → refines search if needed → validates completeness → synthesizes answer
the agent actively manages the retrieval process:
- query decomposition - breaks complex questions into targeted sub-queries
- adaptive retrieval - iteratively refines search based on what it finds
- result validation - checks if retrieved information actually helps
- multi-strategy search - combines semantic search, keyword search, metadata filters
- synthesis - intelligently combines information from multiple sources
the contextual retrieval breakthrough
before diving into agentic patterns, we need to talk about Anthropic's contextual retrieval research. this fundamentally changes how we should think about chunking and embedding.
the chunking problem nobody talks about
traditional RAG chunks documents into 200-500 token pieces. this destroys context.
example from Anthropic's research:
original chunk: "The company's revenue grew by 3% over the previous quarter."
without context, this chunk doesn't specify:
- which company?
- which quarter?
- what year?
result? the chunk gets retrieved for the wrong queries, or doesn't get retrieved for the right ones.
contextual embeddings: adding context to chunks
Anthropic's solution: prepend chunk-specific explanatory context before embedding.
using Claude to generate context for each chunk:
original_chunk = "The company's revenue grew by 3% over the previous quarter." contextualized_chunk = "This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."
now the chunk carries its own context. semantic search works dramatically better.
contextual BM25: exact matching with context
while embeddings capture semantic meaning, BM25 (Best Match 25) provides exact keyword matching. it's especially powerful for:
- unique identifiers ("error code TS-999")
- technical terms
- specific names or codes
- precise phrase matching
combining contextual embeddings with contextual BM25 creates a hybrid retrieval system that handles both semantic similarity and exact matching.
the numbers: contextual retrieval performance
according to Anthropic's research on contextual retrieval, the improvements are substantial:
- contextual embeddings alone: 35% reduction in retrieval failure rate (5.7% → 3.7%)
- contextual embeddings + contextual BM25: 49% reduction in failure rate (5.7% → 2.9%)
- adding reranking: 67% reduction in failure rate (5.7% → 1.9%)
these aren't marginal improvements. this is the difference between a RAG system that works in demos versus one that handles production edge cases reliably.
agentic RAG patterns in practice
now that we understand contextual retrieval, let's look at how agents use it intelligently.
pattern 1: query decomposition and planning
complex questions often require multiple pieces of information. agentic RAG breaks them down.
user query: "How did ACME Corp's revenue growth in Q2 2023 compare to their competitors?"
agent's query decomposition:
- what was ACME Corp's Q2 2023 revenue growth?
- who are ACME Corp's main competitors?
- what was each competitor's Q2 2023 revenue growth?
- synthesize comparison
each sub-query gets targeted retrieval. agent validates it has the necessary information before proceeding to the next step.
as i covered in my agentic design patterns guide, this is a combination of planning patterns (breaking down the task) and tool use patterns (retrieval as a tool).
pattern 2: adaptive retrieval with ReAct
sometimes you don't know what you need until you see what you find. this is where ReAct (reason + act) shines.
thought: need to find ACME Corp's Q2 2023 performance action: search("ACME Corp Q2 2023 revenue") observation: found SEC filing mentioning 3% growth thought: 3% growth over what baseline? need context action: search("ACME Corp Q1 2023 revenue") observation: Q1 2023 revenue was $314 million thought: now i can calculate Q2 revenue: $314M * 1.03 = $323.42M action: search("ACME Corp competitors 2023") observation: main competitors are TechCorp, InnovateInc, DataCo thought: need competitor Q2 2023 performance for comparison action: search("TechCorp Q2 2023 revenue growth") ...
the agent adapts its search strategy based on what it finds. each retrieval informs the next.
built a customer research agent last month using this exact pattern. instead of trying to retrieve everything upfront, it explores iteratively—much more reliable when you don't know exactly what the knowledge base contains.
pattern 3: multi-source retrieval with validation
production RAG systems often need to query multiple sources: internal docs, databases, external APIs, vector stores.
agentic RAG manages this orchestration:
- source selection: agent decides which sources are relevant for the query
- parallel retrieval: queries multiple sources simultaneously
- cross-validation: checks if sources agree or contradict
- confidence scoring: weights results based on source reliability and consistency
- synthesis: combines information intelligently
example: financial analysis agent queries:
- internal database for company records
- vector store for historical analysis documents
- external APIs for market data
- structured database for financial metrics
when sources disagree, the agent flags the discrepancy instead of silently choosing one. in production, this transparency is critical.
pattern 4: retrieval with self-critique
one of the most powerful agentic patterns: the agent evaluates whether retrieved information actually helps.
after retrieval, the agent asks itself:
- "does this information answer the user's question?"
- "is anything missing or contradictory?"
- "do i need additional context?"
- "are there gaps in my understanding?"
if the answer is "no" or "partially", the agent reformulates the query and tries again.
implemented this for a legal research agent. first-pass retrieval accuracy was ~60%. with self-critique and adaptive retrieval, it climbed to 91%. the agent catches its own mistakes.
pattern 5: progressive context building
instead of retrieving everything at once, agents can build context progressively:
- initial broad search: get overview of relevant documents
- analyze summaries: identify most promising sources
- targeted deep dive: retrieve detailed information from selected sources
- fill gaps: additional targeted searches for missing pieces
this approach:
- reduces token usage (don't retrieve unnecessary detail)
- improves accuracy (focuses on most relevant sources)
- handles large knowledge bases efficiently
as covered in my context engineering guide, this is a form of context selection strategy—pulling in exactly what's needed, when it's needed.
implementing agentic RAG: the practical stack
here's what a production agentic RAG system actually looks like:
1. document preprocessing with contextual chunking
traditional approach:
# naive chunking chunks = split_text(document, chunk_size=500) embeddings = embed_chunks(chunks) store_in_vector_db(chunks, embeddings)
agentic RAG approach:
# contextual chunking chunks = smart_chunk(document) # semantic boundaries for chunk in chunks: # use Claude to generate context context = generate_chunk_context( chunk, document_metadata, surrounding_chunks ) contextualized_chunk = context + chunk # dual indexing embedding = embed(contextualized_chunk) bm25_index.add(contextualized_chunk) vector_db.add(contextualized_chunk, embedding)
the prompt for generating context (adapted from Anthropic's contextual retrieval guide):
<document> {{WHOLE_DOCUMENT}} </document> Here is the chunk we want to situate within the whole document: <chunk> {{CHUNK_CONTENT}} </chunk> Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
2. agent-driven retrieval layer
the agent orchestrates retrieval intelligently:
class AgenticRetriever: def retrieve(self, query, context): # agent analyzes the query sub_queries = self.decompose_query(query) results = [] for sub_query in sub_queries: # hybrid retrieval semantic_results = self.vector_search(sub_query) keyword_results = self.bm25_search(sub_query) # combine and rerank combined = self.merge_results( semantic_results, keyword_results ) reranked = self.rerank(combined, sub_query) # validate sufficiency if self.validate_results(reranked, sub_query): results.extend(reranked) else: # adaptive refinement refined_query = self.refine_query( sub_query, reranked ) additional = self.retrieve(refined_query, context) results.extend(additional) return self.deduplicate_and_score(results)
3. reranking for precision
after initial retrieval, reranking dramatically improves relevance. using something like Cohere's reranker:
# retrieve top-150 candidates candidates = retriever.retrieve_top_k(query, k=150) # rerank to top-20 reranked = reranker.rerank( query=query, documents=candidates, top_n=20 ) # use top-20 for generation context = format_context(reranked)
according to Anthropic's research, reranking on top of contextual retrieval gives you that 67% reduction in retrieval failures. the combination is more powerful than either technique alone.
4. synthesis and validation
after retrieval, the agent synthesizes information and validates completeness:
# generate answer with retrieved context answer = agent.generate( query=user_query, context=retrieved_chunks, instructions="cite sources, flag uncertainties" ) # self-critique critique = agent.critique( query=user_query, answer=answer, context=retrieved_chunks ) if critique.needs_improvement: # additional retrieval or refinement additional_info = agent.retrieve( critique.missing_information ) answer = agent.regenerate( query=user_query, context=retrieved_chunks + additional_info )
production lessons from building agentic RAG
after deploying agentic RAG systems for three different companies over the past year:
1. contextual chunking is non-negotiable
i tried building agentic RAG with traditional chunking first. "the agent will figure it out," i thought.
wrong. garbage chunks produce garbage retrievals, no matter how smart your agent is.
invest in proper contextual chunking from day one. the 35-49% improvement in retrieval accuracy compounds through the entire agent loop.
2. hybrid retrieval beats pure semantic search
semantic embeddings are powerful but fail on exact matches. someone searches for "API endpoint /v2/users" and semantic search returns general documentation about users instead of the specific endpoint.
combining semantic search (embeddings) with keyword search (BM25) handles both use cases. production systems need both.
3. reranking is worth the latency
initially skipped reranking to keep latency low. mistake.
users would rather wait an extra 200ms for an accurate answer than get a fast wrong answer. reranking adds minimal latency but dramatically improves relevance.
the trick: retrieve 100-150 candidates fast, then rerank to top-20. best of both worlds.
4. agents need retrieval tools, not just search
don't just give agents a "search" function. give them structured retrieval capabilities:
semantic_search(query, filters)
- for concept-based retrievalkeyword_search(exact_terms)
- for precise matchingmetadata_filter(criteria)
- for structured filteringget_related_documents(doc_id)
- for exploring connectionstemporal_search(query, date_range)
- for time-bounded queries
richer tool set = more intelligent retrieval strategies.
5. validation catches silent failures
the scariest bug in RAG systems: retrieving irrelevant information but confidently generating an answer anyway.
agentic RAG with validation catches this:
- agent checks if retrieved chunks actually contain relevant information
- flags low-confidence retrievals
- admits uncertainty when appropriate
- requests clarification instead of hallucinating
this self-awareness transforms reliability. users trust the system more when it admits what it doesn't know.
6. progressive disclosure reduces costs
retrieving everything upfront is expensive and often unnecessary.
pattern that works:
- retrieve document summaries first (cheap)
- identify most relevant documents
- retrieve detailed content only for top candidates (expensive but focused)
one customer support agent reduced token usage by 60% using this approach, with no accuracy loss.
when to use agentic RAG vs traditional RAG
agentic RAG isn't always the answer. here's when to use each:
use traditional RAG when:
- queries are straightforward and single-purpose
- knowledge base is small and well-organized
- latency is critical (need sub-second responses)
- retrieval accuracy is already high (>95%)
- cost constraints are tight
use agentic RAG when:
- queries are complex or multi-faceted
- knowledge base is large or poorly organized
- accuracy matters more than speed
- retrieval often requires multiple steps
- sources need validation or cross-referencing
- users ask questions that require synthesis
in my experience: start with traditional RAG for simple use cases. upgrade to agentic RAG when retrieval failures become a bottleneck.
the cost-accuracy tradeoff
let's be honest about costs. agentic RAG is more expensive than traditional RAG:
traditional RAG per query:
- embedding query: ~$0.0001
- vector search: negligible
- generation with context: ~$0.01
- total: ~$0.01
agentic RAG per query:
- query decomposition: ~$0.005
- multiple retrievals: ~$0.0005
- contextual retrieval overhead: ~$0.002
- reranking: ~$0.01
- generation with validation: ~$0.015
- potential refinement loops: ~$0.01
- total: ~$0.04-0.06
4-6x more expensive. but:
- 67% fewer retrieval failures
- dramatically better answer quality
- reduced hallucinations
- higher user trust
for production applications where accuracy matters, the cost increase is justified.
real implementation: financial analysis agent
concrete example from a project last quarter:
the use case
financial analysts needed to query SEC filings, earnings calls, and market data to answer complex questions like: "How has ACME Corp's profit margin trended relative to competitors over the past 3 years, and what were the key drivers?"
the implementation
data layer:
- 10,000+ SEC filings preprocessed with contextual chunking
- dual indexing: vector embeddings (Voyage AI) + BM25
- structured database for financial metrics
- metadata tagging (company, date, document type, financial quarter)
agent layer:
- query decomposition using Claude Sonnet
- ReAct loop for adaptive retrieval
- parallel sub-agents for different companies/time periods
- cross-validation of financial figures
- synthesis with source citation
retrieval strategy:
- decompose query into sub-questions
- identify relevant companies and time periods
- parallel retrieval for each company/period
- metadata filtering for specific quarters
- semantic search for qualitative analysis (driver identification)
- keyword search for exact financial terms
- reranking top 150 candidates to top 20
the results
- accuracy: 89% vs 62% with traditional RAG
- retrieval failures: 2.1% vs 6.8% with traditional RAG
- average query time: 8 seconds vs 2 seconds (acceptable tradeoff)
- analyst time savings: 4+ hours per day
- user trust score: 4.6/5 vs 3.1/5 with previous system
the agentic approach handled complex multi-part questions that traditional RAG couldn't. analysts trusted the system enough to use it for actual research, not just preliminary exploration.
implementing your first agentic RAG system
practical steps to get started:
phase 1: upgrade your chunking (week 1)
- implement contextual chunking using Claude
- dual-index: embeddings + BM25
- add metadata to all chunks
- benchmark retrieval accuracy before/after
this alone gives you 35% improvement. do this first before adding agent complexity.
phase 2: hybrid retrieval (week 2)
- implement semantic search (embeddings)
- implement keyword search (BM25)
- create result merging logic
- add reranking step
now you have the foundation for reliable retrieval.
phase 3: basic agentic layer (week 3-4)
- query decomposition tool
- retrieval tools (semantic, keyword, metadata filter)
- result validation logic
- synthesis with citation
start simple. single agent with ReAct pattern. add complexity only when needed.
phase 4: iterate based on failures (ongoing)
- track retrieval failures
- analyze failure patterns
- add targeted improvements (better chunking, more tools, refined prompts)
- measure impact
use Anthropic's contextual retrieval cookbook as a practical implementation guide—they provide working code examples.
key takeaways
- agentic RAG transforms retrieval - from passive search to active information seeking with reasoning and validation
- contextual retrieval is foundational - 35-67% improvement in retrieval accuracy through contextual chunking, hybrid search, and reranking
- hybrid search outperforms - combining semantic embeddings with BM25 keyword search handles both concept and exact matching
- agents enable adaptive retrieval - query decomposition, progressive refinement, multi-source orchestration, and self-critique
- validation prevents silent failures - agents check if retrieved information actually helps, reducing hallucinations
- reranking is worth the cost - adds minimal latency but dramatically improves relevance, especially combined with contextual retrieval
- costs increase 4-6x - but accuracy improvements justify the cost for production applications where correctness matters
- start with contextual chunking - biggest single improvement before adding agent complexity
- progressive disclosure reduces costs - retrieve summaries first, then detailed content only for relevant sources
- real production gains - 89% accuracy vs 62% with traditional RAG in actual financial analysis deployment
resources worth exploring
- Anthropic's contextual retrieval guide - comprehensive research and cookbook for implementing contextual embeddings and BM25
- Full-stack AI agent mastery: agentic RAG lesson - practical implementation examples and code
- agentic design patterns - understanding ReAct, planning, and multi-agent patterns that power agentic RAG
- context engineering for agents - managing context and retrieval strategies in production agents
- AI agents: the complete guide - foundational concepts for building agent systems
building agentic RAG systems? i help companies design and implement intelligent retrieval systems that actually work in production. from contextual chunking to agent orchestration. let's discuss your retrieval challenges