Context Engineering for AI Agents: What Actually Works in Production

here's something nobody tells you when you start building ai agents: getting the model to work is easy. getting it to work reliably is where most projects die.

and in my experience (building agents for 8+ clients over the past two years), the difference between "works in demo" and "works in production" usually comes down to one thing:

how you manage context.

wait, what even is context engineering?

if you've been following ai development, you've heard about prompt engineering... crafting the perfect instructions to get llms to do what you want.

context engineering is different. it's about managing the entire information landscape your agent operates in.

think of it this way:

prompt engineering = writing good instructions
context engineering = deciding what information your agent sees, when it sees it, and how much of it survives to the next interaction

according to anthropic's recent engineering post, context engineering is about "optimizing the utility of tokens against the inherent constraints of llms."

(which is a fancy way of saying: you have limited space, use it wisely)

the problem nobody talks about: context rot

here's what i wish someone had told me two years ago...

llms don't just "run out of space" when you hit the context limit. they start getting confused way before that.

this is called context rot (or sometimes "lost in the middle" problem).

the technical reality

based on research from anthropic and others, here's what happens:

llms use transformer architecture where every token "attends to" every other token
this creates n² pairwise relationships for n tokens
as context grows, the model's "attention budget" gets stretched thin
performance degrades gradually, not suddenly

working on a customer service agent last quarter, i noticed something weird... accuracy was great at 20k tokens. still good at 40k. but at 60k? started making mistakes on basic facts that were in the conversation history.

the context window was 200k tokens. we weren't even close to the limit. but the agent was already losing track.

this is context rot in action

the fundamental principle (that changed how i build)

after a year of debugging context issues, i've internalized one core principle from anthropic's work:

find the smallest possible set of high-signal tokens that maximize the likelihood of your desired outcome

sounds simple. incredibly hard to implement.

let me show you what this means in practice...

context engineering in practice: what actually works

1. system prompts at the "right altitude"

this one took me forever to figure out.

your system prompt needs to be specific enough to guide behavior, but general enough to let the model reason flexibly.

too low altitude (overly specific):

if user asks about refunds and order date > 30 days ago, say "outside refund window"
if user asks about refunds and order date < 30 days, check order status
if order status is "delivered" and < 7 days, process refund
if order status is "delivered" and >= 7 days, escalate to manager...

this breaks constantly. every edge case needs a new rule.

too high altitude (too vague):

you are a helpful customer service agent. help customers with their requests.

this is useless. no guidance on policies, escalation, or handling edge cases.

right altitude (goldilocks zone):

you handle customer support with access to order data and refund tools.

## refund policy
- standard window: 30 days from order date
- condition: product must be delivered
- if eligible: process immediately
- if edge case: escalate with context

## your approach
1. gather relevant order information
2. check policy eligibility  
3. take appropriate action
4. explain decision clearly

use tools to check actual dates and status rather than relying on customer statements.

this gives clear guidance while letting the agent reason about specific situations.

result: 40% fewer hallucinations, 60% fewer unnecessary escalations.

2. tool design for context efficiency

here's a mistake i made that cost me a week of debugging...

i built a research agent with these tools:

search_documents(query) - returned full document text (avg 8k tokens)
get_document_by_id(id) - also returned full text
list_documents() - returned all 200 documents with metadata

agent would search once... boom, 8k tokens gone. search again... another 8k. three searches and we're at 24k tokens of mostly duplicate information.

better approach (what i should've done):

search_documents(query, max_results=3) - returns only titles and summaries (200 tokens avg)
get_document_sections(id, sections) - fetch specific sections, not entire doc
search_within_document(id, query) - targeted search within known doc

same functionality. 90% less context pollution.

every tool call is a context tax. design tools that return minimal, high-value information.

3. the rag vs agentic search debate

this is where context engineering gets interesting...

traditional approach (rag - retrieval augmented generation):

pre-compute embeddings for all your data
when query comes in, find semantically similar chunks
stuff relevant chunks into context
hope you grabbed the right information

agentic search (what i'm using more lately):

give agent tools to explore information space
let it decide what to retrieve based on what it's learned
iterative refinement - can go deeper if needed
only loads what's actually needed

example from last month... building a legal document analyzer.

rag approach:

user asks: "what are the termination clauses?"
rag retrieves 5 chunks about "termination"
includes termination of employees, termination of services, termination of agreement...
context filled with 80% irrelevant info

agentic search approach:

user asks: "what are the termination clauses?"
agent uses search_sections({"type": "termination_clause"}, contract_type="employment")
gets targeted results
if ambiguous, can follow up with get_related_clauses()
builds context incrementally

agentic search is slower (multiple llm calls) but way more accurate for complex queries.

my current rule: use rag for speed on simple queries, agentic search for accuracy on complex ones.

4. context compaction for long-running agents

this is critical for any agent that runs longer than 10-15 interactions...

basic idea: when approaching context limits, summarize the conversation and start fresh.

sounds simple. execution is nuanced.

naive compaction (what i did first):

"summarize the conversation so far in 500 tokens"

result: lost critical details. agent forgot important context from early conversations.

better compaction (learned from anthropic's approach):

create a structured summary preserving:
- key decisions made
- open questions/blockers  
- important facts established
- relevant file/data references
- user preferences revealed

keep the most recent 5 messages in full (recent context matters most)
compress everything before that into the structured summary

working on a code migration agent, implemented this compaction strategy. agent maintained coherence across 4-hour sessions spanning 100+ file modifications.

anthropic's claude code uses a similar approach - they preserve recent messages and architectural decisions while compressing tool outputs.

5. structured note-taking (agentic memory)

this technique is underused but incredibly powerful...

instead of keeping everything in context, let the agent write persistent notes.

example from a customer onboarding agent i built:

# customer_profile.md

## company context
- industry: fintech (specifically crypto trading)
- team size: 47 engineers
- current pain point: api latency during high volume

## technical environment  
- stack: python/fastapi + react
- cloud: aws (considering gcp migration)
- database: postgresql + redis

## decision makers
- cto (technical decisions): prefers pragmatic solutions over cutting-edge
- vp eng (budget): cost-conscious, needs clear roi

## preferences discovered
- likes: specific metrics, code examples, direct communication
- dislikes: buzzwords, vague timelines
- communication style: technical depth over high-level pitches

this lives outside the context window. agent reads it at start of each session, updates it as it learns new information.

result: continuity across sessions without burning context tokens on repeated information.

anthropic recently launched a memory tool in their api that makes this pattern even easier to implement.

6. multi-agent architectures for context isolation

when tasks get complex enough, single-agent context management becomes untenable.

solution: specialized sub-agents with isolated contexts.

built a market research system last month:

orchestrator agent: maintains high-level plan and coordinates subagents
search agent: explores information sources, returns summaries
analysis agent: deep dives on specific data points
synthesis agent: combines findings into final report

each subagent might use 50k+ tokens during its work. but it only returns 1-2k tokens of distilled insights to the orchestrator.

this is what anthropic calls "context isolation" - detailed work happens in subagent contexts, orchestrator maintains only high-level state.

analyzed 200+ research papers in one session. single agent would've choked at ~30 papers.

common mistakes (that i've made, multiple times)

mistake 1: stuffing everything into system prompt

saw a 12k token system prompt last week. included:

complete product documentation
all company policies
50+ example conversations
detailed api specifications

agent was slow, confused, and expensive.

better: minimal system prompt + tools to retrieve specific information when needed.

mistake 2: not clearing old tool results

agent searches database, gets 200 rows back (5k tokens).
analyzes them.
those rows stay in context forever.

five searches later: 25k tokens of stale database results.

anthropic just launched "tool result clearing" - automatically removes old tool outputs while keeping the fact that tool was called.

implement this pattern. your context will thank you.

mistake 3: ignoring recency bias

fun fact: llms pay more attention to information at the start and end of context.

middle? gets lost.

structure your context strategically:

start: system instructions, critical constraints
middle: reference information, examples (can be lossy)
end: current task, recent messages, immediate context

made this change to a document analysis agent - accuracy improved 23% on long documents.

mistake 4: premature optimization

spent two weeks building a sophisticated caching system for frequently-accessed context...

then realized 90% of queries were unique. cache hit rate: 12%.

measure before you optimize. most context engineering is about good defaults, not clever tricks.

practical implementation: a context engineering checklist

when i start a new agent project, i now run through this checklist:

phase 1: design (before writing code)

define context budget
- how much context do i have? (model's limit)
- how much do i actually want to use? (aim for 60-70% max)
- what's my average interaction length?
map information sources
- what data does agent need access to?
- what's static vs dynamic?
- what can be retrieved just-in-time vs preloaded?
design tool returns
- each tool should return minimal, structured info
- avoid returning full documents/datasets
- include pagination for large results

phase 2: implementation

write minimal system prompt
- test with bare minimum first
- add specificity only where agent fails
- use structured sections (xml tags or markdown headers)
implement context monitoring
- track tokens used per interaction
- alert when approaching limits
- log context at each agent decision point
add compaction strategy
- trigger at 70-80% of context limit
- preserve recent messages + structured summary
- test that agent maintains coherence post-compaction

phase 3: optimization (after initial deployment)

analyze context usage patterns
- which tools return most tokens?
- where does context get wasted?
- what information is rarely used?
refine based on failure modes
- when agent hallucinates, what context was it missing?
- when agent gets confused, what context was noisy?
- iterate on information density

measuring success: metrics that matter

you can't improve what you don't measure. here's what i track:

context efficiency metrics

average tokens per interaction - should stay well below limit
context utilization rate - percentage of context containing relevant info
token cost per successful outcome - measures efficiency of achieving goals

quality metrics

hallucination rate - often correlates with context pollution
task completion rate - are you providing enough context?
coherence across compactions - does agent maintain state after summarization?

real numbers from production

customer service agent before context optimization:

avg tokens: 85k per session
hallucination rate: 12%
cost per interaction: $0.34

after optimization:

avg tokens: 32k per session (62% reduction)
hallucination rate: 4% (67% improvement)
cost per interaction: $0.11 (68% reduction)

same functionality. better performance. way cheaper.

the future of context engineering

context windows are getting bigger. claude 3.5 has 200k. gemini 1.5 has 1 million. gpt-4 has 128k.

does this make context engineering obsolete?

no. actually makes it more important.

larger windows mean more opportunities to pollute context. context rot still happens - just at higher token counts.

based on anthropic's research and my own testing, models still perform best with focused, well-curated context regardless of window size.

the future isn't about stuffing more into context. it's about smarter context management at any scale.

key takeaways (if you remember nothing else)

context is finite - treat it like a precious resource, even with large windows
less is more - smallest possible set of high-signal tokens wins
design for context efficiency - tools should return minimal, structured info
implement compaction early - don't wait until you hit limits
measure everything - track tokens, costs, quality metrics
agentic search > rag - for complex queries where accuracy matters
use structured memory - persistent notes outside context window
consider multi-agent - for complex tasks requiring deep exploration

resources for going deeper

if you want to dive deeper into context engineering:

anthropic's context engineering guide - comprehensive technical deep-dive
microsoft's context engineering course - practical implementation strategies
my claude agent sdk guide - related agent development patterns

context engineering for ai agents (or: why your agent keeps forgetting things)