here's something nobody tells you when you start building ai agents: getting the model to work is easy. getting it to work reliably is where most projects die.
and in my experience (building agents for 8+ clients over the past two years), the difference between "works in demo" and "works in production" usually comes down to one thing:
how you manage context.
wait, what even is context engineering?
if you've been following ai development, you've heard about prompt engineering... crafting the perfect instructions to get llms to do what you want.
context engineering is different. it's about managing the entire information landscape your agent operates in.
think of it this way:
- prompt engineering = writing good instructions
- context engineering = deciding what information your agent sees, when it sees it, and how much of it survives to the next interaction
according to anthropic's recent engineering post, context engineering is about "optimizing the utility of tokens against the inherent constraints of llms."
(which is a fancy way of saying: you have limited space, use it wisely)
the problem nobody talks about: context rot
here's what i wish someone had told me two years ago...
llms don't just "run out of space" when you hit the context limit. they start getting confused way before that.
this is called context rot (or sometimes "lost in the middle" problem).
the technical reality
based on research from anthropic and others, here's what happens:
- llms use transformer architecture where every token "attends to" every other token
- this creates n² pairwise relationships for n tokens
- as context grows, the model's "attention budget" gets stretched thin
- performance degrades gradually, not suddenly
working on a customer service agent last quarter, i noticed something weird... accuracy was great at 20k tokens. still good at 40k. but at 60k? started making mistakes on basic facts that were in the conversation history.
the context window was 200k tokens. we weren't even close to the limit. but the agent was already losing track.
this is context rot in action
the fundamental principle (that changed how i build)
after a year of debugging context issues, i've internalized one core principle from anthropic's work:
find the smallest possible set of high-signal tokens that maximize the likelihood of your desired outcome
sounds simple. incredibly hard to implement.
let me show you what this means in practice...
context engineering in practice: what actually works
1. system prompts at the "right altitude"
this one took me forever to figure out.
your system prompt needs to be specific enough to guide behavior, but general enough to let the model reason flexibly.
too low altitude (overly specific):
if user asks about refunds and order date > 30 days ago, say "outside refund window" if user asks about refunds and order date < 30 days, check order status if order status is "delivered" and < 7 days, process refund if order status is "delivered" and >= 7 days, escalate to manager...
this breaks constantly. every edge case needs a new rule.
too high altitude (too vague):
you are a helpful customer service agent. help customers with their requests.
this is useless. no guidance on policies, escalation, or handling edge cases.
right altitude (goldilocks zone):
you handle customer support with access to order data and refund tools. ## refund policy - standard window: 30 days from order date - condition: product must be delivered - if eligible: process immediately - if edge case: escalate with context ## your approach 1. gather relevant order information 2. check policy eligibility 3. take appropriate action 4. explain decision clearly use tools to check actual dates and status rather than relying on customer statements.
this gives clear guidance while letting the agent reason about specific situations.
result: 40% fewer hallucinations, 60% fewer unnecessary escalations.
2. tool design for context efficiency
here's a mistake i made that cost me a week of debugging...
i built a research agent with these tools:
search_documents(query)
- returned full document text (avg 8k tokens)get_document_by_id(id)
- also returned full textlist_documents()
- returned all 200 documents with metadata
agent would search once... boom, 8k tokens gone. search again... another 8k. three searches and we're at 24k tokens of mostly duplicate information.
better approach (what i should've done):
search_documents(query, max_results=3)
- returns only titles and summaries (200 tokens avg)get_document_sections(id, sections)
- fetch specific sections, not entire docsearch_within_document(id, query)
- targeted search within known doc
same functionality. 90% less context pollution.
every tool call is a context tax. design tools that return minimal, high-value information.
3. the rag vs agentic search debate
this is where context engineering gets interesting...
traditional approach (rag - retrieval augmented generation):
- pre-compute embeddings for all your data
- when query comes in, find semantically similar chunks
- stuff relevant chunks into context
- hope you grabbed the right information
agentic search (what i'm using more lately):
- give agent tools to explore information space
- let it decide what to retrieve based on what it's learned
- iterative refinement - can go deeper if needed
- only loads what's actually needed
example from last month... building a legal document analyzer.
rag approach:
user asks: "what are the termination clauses?"
rag retrieves 5 chunks about "termination"
includes termination of employees, termination of services, termination of agreement...
context filled with 80% irrelevant info
agentic search approach:
user asks: "what are the termination clauses?"
agent uses search_sections({"type": "termination_clause"}, contract_type="employment")
gets targeted results
if ambiguous, can follow up with get_related_clauses()
builds context incrementally
agentic search is slower (multiple llm calls) but way more accurate for complex queries.
my current rule: use rag for speed on simple queries, agentic search for accuracy on complex ones.
4. context compaction for long-running agents
this is critical for any agent that runs longer than 10-15 interactions...
basic idea: when approaching context limits, summarize the conversation and start fresh.
sounds simple. execution is nuanced.
naive compaction (what i did first):
"summarize the conversation so far in 500 tokens"
result: lost critical details. agent forgot important context from early conversations.
better compaction (learned from anthropic's approach):
create a structured summary preserving: - key decisions made - open questions/blockers - important facts established - relevant file/data references - user preferences revealed keep the most recent 5 messages in full (recent context matters most) compress everything before that into the structured summary
working on a code migration agent, implemented this compaction strategy. agent maintained coherence across 4-hour sessions spanning 100+ file modifications.
anthropic's claude code uses a similar approach - they preserve recent messages and architectural decisions while compressing tool outputs.
5. structured note-taking (agentic memory)
this technique is underused but incredibly powerful...
instead of keeping everything in context, let the agent write persistent notes.
example from a customer onboarding agent i built:
# customer_profile.md ## company context - industry: fintech (specifically crypto trading) - team size: 47 engineers - current pain point: api latency during high volume ## technical environment - stack: python/fastapi + react - cloud: aws (considering gcp migration) - database: postgresql + redis ## decision makers - cto (technical decisions): prefers pragmatic solutions over cutting-edge - vp eng (budget): cost-conscious, needs clear roi ## preferences discovered - likes: specific metrics, code examples, direct communication - dislikes: buzzwords, vague timelines - communication style: technical depth over high-level pitches
this lives outside the context window. agent reads it at start of each session, updates it as it learns new information.
result: continuity across sessions without burning context tokens on repeated information.
anthropic recently launched a memory tool in their api that makes this pattern even easier to implement.
6. multi-agent architectures for context isolation
when tasks get complex enough, single-agent context management becomes untenable.
solution: specialized sub-agents with isolated contexts.
built a market research system last month:
orchestrator agent: maintains high-level plan and coordinates subagents
search agent: explores information sources, returns summaries
analysis agent: deep dives on specific data points
synthesis agent: combines findings into final report
each subagent might use 50k+ tokens during its work. but it only returns 1-2k tokens of distilled insights to the orchestrator.
this is what anthropic calls "context isolation" - detailed work happens in subagent contexts, orchestrator maintains only high-level state.
analyzed 200+ research papers in one session. single agent would've choked at ~30 papers.
common mistakes (that i've made, multiple times)
mistake 1: stuffing everything into system prompt
saw a 12k token system prompt last week. included:
- complete product documentation
- all company policies
- 50+ example conversations
- detailed api specifications
agent was slow, confused, and expensive.
better: minimal system prompt + tools to retrieve specific information when needed.
mistake 2: not clearing old tool results
agent searches database, gets 200 rows back (5k tokens).
analyzes them.
those rows stay in context forever.
five searches later: 25k tokens of stale database results.
anthropic just launched "tool result clearing" - automatically removes old tool outputs while keeping the fact that tool was called.
implement this pattern. your context will thank you.
mistake 3: ignoring recency bias
fun fact: llms pay more attention to information at the start and end of context.
middle? gets lost.
structure your context strategically:
- start: system instructions, critical constraints
- middle: reference information, examples (can be lossy)
- end: current task, recent messages, immediate context
made this change to a document analysis agent - accuracy improved 23% on long documents.
mistake 4: premature optimization
spent two weeks building a sophisticated caching system for frequently-accessed context...
then realized 90% of queries were unique. cache hit rate: 12%.
measure before you optimize. most context engineering is about good defaults, not clever tricks.
practical implementation: a context engineering checklist
when i start a new agent project, i now run through this checklist:
phase 1: design (before writing code)
- define context budget
- how much context do i have? (model's limit)
- how much do i actually want to use? (aim for 60-70% max)
- what's my average interaction length?
- map information sources
- what data does agent need access to?
- what's static vs dynamic?
- what can be retrieved just-in-time vs preloaded?
- design tool returns
- each tool should return minimal, structured info
- avoid returning full documents/datasets
- include pagination for large results
phase 2: implementation
- write minimal system prompt
- test with bare minimum first
- add specificity only where agent fails
- use structured sections (xml tags or markdown headers)
- implement context monitoring
- track tokens used per interaction
- alert when approaching limits
- log context at each agent decision point
- add compaction strategy
- trigger at 70-80% of context limit
- preserve recent messages + structured summary
- test that agent maintains coherence post-compaction
phase 3: optimization (after initial deployment)
- analyze context usage patterns
- which tools return most tokens?
- where does context get wasted?
- what information is rarely used?
- refine based on failure modes
- when agent hallucinates, what context was it missing?
- when agent gets confused, what context was noisy?
- iterate on information density
measuring success: metrics that matter
you can't improve what you don't measure. here's what i track:
context efficiency metrics
- average tokens per interaction - should stay well below limit
- context utilization rate - percentage of context containing relevant info
- token cost per successful outcome - measures efficiency of achieving goals
quality metrics
- hallucination rate - often correlates with context pollution
- task completion rate - are you providing enough context?
- coherence across compactions - does agent maintain state after summarization?
real numbers from production
customer service agent before context optimization:
- avg tokens: 85k per session
- hallucination rate: 12%
- cost per interaction: $0.34
after optimization:
- avg tokens: 32k per session (62% reduction)
- hallucination rate: 4% (67% improvement)
- cost per interaction: $0.11 (68% reduction)
same functionality. better performance. way cheaper.
the future of context engineering
context windows are getting bigger. claude 3.5 has 200k. gemini 1.5 has 1 million. gpt-4 has 128k.
does this make context engineering obsolete?
no. actually makes it more important.
larger windows mean more opportunities to pollute context. context rot still happens - just at higher token counts.
based on anthropic's research and my own testing, models still perform best with focused, well-curated context regardless of window size.
the future isn't about stuffing more into context. it's about smarter context management at any scale.
key takeaways (if you remember nothing else)
- context is finite - treat it like a precious resource, even with large windows
- less is more - smallest possible set of high-signal tokens wins
- design for context efficiency - tools should return minimal, structured info
- implement compaction early - don't wait until you hit limits
- measure everything - track tokens, costs, quality metrics
- agentic search > rag - for complex queries where accuracy matters
- use structured memory - persistent notes outside context window
- consider multi-agent - for complex tasks requiring deep exploration
resources for going deeper
if you want to dive deeper into context engineering:
- anthropic's context engineering guide - comprehensive technical deep-dive
- microsoft's context engineering course - practical implementation strategies
- my claude agent sdk guide - related agent development patterns
related reading
if you found this helpful, you might also enjoy:
- Building Agents with Claude Agent SDK - practical agent development with the agent loop
- AI Agents: The Complete Guide - foundational concepts and architecture
- LangGraph State Management - building stateful agents with memory
struggling with context management in your ai agents? i help companies design and implement production-grade context engineering strategies. let's talk about your specific challenges. schedule a conversation