Back to Blog

context engineering for agents: the practical guide nobody writes

Aankit Roy
16 min read
Context EngineeringAI AgentsProduction AIAgent MemoryMulti-Agent Systems
Share:TwitterLinkedIn

context engineering isn't just about fitting more into a context window. it's about understanding what to include, what to exclude, and how to structure information so your agent actually works in production.

the LangChain blog on context engineering lays out four core strategies: write, select, compress, and isolate. these are the right categories. but understanding the nuances of when and how to apply them makes the difference between agents that work and agents that don't scale.

this post digs deeper into the practical implications of each strategy, based on what we can learn from production systems like Claude Code, ChatGPT, Cursor, and Windsurf, as well as research from Anthropic, Cognition, and others building at the frontier.

why context engineering matters

as the LangChain article points out, Andrej Karpathy's analogy is perfect: think of the LLM as a CPU and the context window as RAM. context engineering is about managing that limited working memory.

here's what makes this critical for agents specifically: agents accumulate context rapidly. every tool call adds feedback. every turn adds messages. long-running tasks can easily hit hundreds of thousands of tokens.

Drew Breunig identified four specific failure modes that longer context causes:

  • context poisoning: when a hallucination makes it into context and compounds
  • context distraction: when the context overwhelms the model's training
  • context confusion: when superfluous context influences the response
  • context clash: when parts of the context disagree with each other

both Anthropic and Cognition have called out context engineering as the "#1 job of engineers building AI agents." poor context management is often the primary bottleneck in agent reliability.

strategy 1: writing context (scratchpads and memories)

the first strategy is about writing context—saving it outside the context window for later use.

scratchpads: how anthropic's multi-agent researcher uses them

according to the LangChain article, Anthropic's multi-agent researcher provides a clear example. the LeadResearcher agent "begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan."

this reveals something important: even with 200k token windows, you can't rely on keeping everything in context. critical information needs to be explicitly saved.

scratchpads can be implemented as:

  • tool calls that write to files
  • fields in a runtime state object that persists during the session
  • external databases or storage systems

memory systems: what production systems teach us

several production agents now have memory capabilities:

  • ChatGPT: stores user-specific memories across sessions
  • Cursor and Windsurf: use rules files for procedural memories
  • Claude Code: uses CLAUDE.md for project-specific context

these systems distinguish between different memory types:

  • episodic memories: examples of desired behavior (few-shot examples)
  • procedural memories: instructions and rules to steer behavior
  • semantic memories: facts relevant to the task

the research cited in the LangChain article includes concepts from Reflexion (self-generated memories after each turn) and Generative Agents (synthesizing memories periodically from collections of past feedback).

memory selection becomes critical when you have large collections. Simon Willison shared an example at the AIEngineer World's Fair: ChatGPT fetched his location from memories and unexpectedly injected it into a requested image. this type of unexpected retrieval can make users feel like the context window "no longer belongs to them."

strategy 2: selecting context

selecting context means pulling it into the context window to help an agent perform a task.

tool selection: the RAG for tools approach

agents can become overloaded if provided with too many tools. tool descriptions often overlap, causing model confusion about which tool to use.

the LangChain article cites research showing that applying RAG (retrieval augmented generation) to tool descriptions—fetching only the most relevant tools for a task—can improve tool selection accuracy by 3-fold.

LangGraph's Bigtool library implements this approach: use semantic search over tool descriptions to select the most relevant subset for each task.

knowledge retrieval: what windsurf learned about code agents

Varun from Windsurf captured some key challenges with RAG for code agents:

"Indexing code ≠ context retrieval ... [We are doing indexing & embedding search ... [with] AST parsing code and chunking along semantically meaningful boundaries ... embedding search becomes unreliable as a retrieval heuristic as the size of the codebase grows ... we must rely on a combination of techniques like grep/file search, knowledge graph based retrieval, and ... a re-ranking step where [context] is ranked in order of relevance."

this highlights that simple embedding search often isn't enough. production systems combine:

  • embedding-based semantic search
  • keyword search (grep/file search)
  • knowledge graph retrieval
  • re-ranking steps

strategy 3: compressing context

compressing context involves retaining only the tokens required to perform a task.

context summarization: how claude code does it

if you've used Claude Code, you've seen this in action. according to the LangChain article, Claude Code runs "auto-compact" after you exceed 95% of the context window and summarizes the full trajectory of user-agent interactions.

summarization can be applied at different points:

  • trajectory-level: compress entire conversation histories
  • tool-level: post-process token-heavy tool calls
  • agent-to-agent: summarize at handoff boundaries in multi-agent systems

Cognition mentioned using a fine-tuned model for summarization to ensure specific events and decisions are captured accurately.

context trimming: the simpler alternative

as Drew Breunig points out, trimming can often filter or "prune" context without needing an LLM. this can use hard-coded heuristics like removing older messages from a list.

there's also research on trained context pruners like Provence (for Question-Answering) that can learn what to remove while maintaining performance.

strategy 4: isolating context

isolating context involves splitting it up to help an agent perform a task.

multi-agent: the 15x token multiplier

one of the most popular ways to isolate context is to split it across sub-agents. OpenAI's Swarm library was motivated by separation of concerns, where a team of agents handle specific sub-tasks.

Anthropic's multi-agent researcher reported that many agents with isolated contexts outperformed single-agent approaches because each subagent context window can be allocated to a more narrow sub-task.

however, there's a significant cost: Anthropic reported up to 15× more tokens than single-agent approaches.

challenges with multi-agent include:

  • token usage (that 15x multiplier)
  • need for careful prompt engineering to plan sub-agent work
  • coordination complexity

context isolation with environments

HuggingFace's deep researcher shows another approach. most agents use tool calling APIs that return JSON objects. HuggingFace uses a CodeAgent, which outputs code that runs in a sandbox.

as they noted, this allows for "better handling of state ... Need to store this image / audio / other for later use? No problem, just assign it as a variable in your state and you [use it later]."

this lets you isolate token-heavy objects in the environment, outside the LLM's context window.

state-based isolation

an agent's runtime state object can serve a similar purpose. a state object can be designed with a schema that has fields where context can be written. one field (like messages) can be exposed to the LLM at each turn, while the schema isolates information in other fields for more selective use.

this provides the benefits of context isolation without the coordination overhead of multi-agent architectures.

how LangGraph supports these strategies

LangGraph was designed to support all four context engineering strategies:

write context

  • checkpointing: persists agent state across all steps (short-term memory)
  • long-term memory: persist context across many sessions
  • LangMem: abstractions for memory management

select context

  • state access: fetch state within each node for fine-grained control
  • memory retrieval: embedding-based retrieval on memory collections
  • Bigtool library: semantic search over tool descriptions

compress context

  • message utilities: built-in utilities for summarization and trimming
  • custom nodes: add summarization at specific points in your agent
  • tool-specific compression: post-process specific tool calls

isolate context

  • state schemas: isolate context in different state fields
  • sandbox support: E2B sandbox and Pyodide for context isolation
  • multi-agent libraries: supervisor and swarm patterns

practical principles

whether you use LangGraph or build from scratch, these principles apply:

  1. measure before optimizing

    use tools like LangSmith to track token usage and understand where context is going

  2. context poisoning is real

    hallucinations that make it into context compound. design systems to detect and remove them.

  3. multi-agent has real costs

    15x token multiplier isn't hyperbole. use multi-agent when parallelism and specialization justify the cost.

  4. consider state isolation first

    before jumping to multi-agent, try isolating context in state fields

  5. summarization requires care

    Cognition uses fine-tuned models. critical information can't be lost in compression.

  6. retrieval is harder than it looks

    as Windsurf shows, production RAG combines multiple techniques and re-ranking

key takeaways

  1. context engineering is foundational - both Anthropic and Cognition call it the #1 job for agent builders
  2. four strategies matter - write, select, compress, and isolate provide a complete mental model
  3. production systems teach us - Claude Code, ChatGPT, Cursor, and Windsurf show these strategies in action
  4. costs compound - multi-agent's 15x multiplier, poor retrieval, and lack of compression create expensive systems
  5. frameworks help - LangGraph provides primitives for all four strategies, but you still need to design carefully
  6. measure everything - LangSmith and similar observability tools are essential for understanding context usage

resources worth reading


working on production agent systems? i help companies design agent architectures that balance context management with performance and cost. let's discuss your specific challenges

AR

Aankit Roy

ai strategy & engineering leadership consultant. helping companies build production agents that balance performance, reliability, and cost through thoughtful context engineering.