last week, i spent 6 hours debugging an agent that kept writing python scripts to create excel files instead of just... calling the excel API. turns out, i completely misunderstood how the claude agent sdk approaches tool selection.
(this is the messy reality nobody talks about in those polished case studies)
wait... claude code is now for more than coding?
here's what clicked for me after reading the anthropic engineering blog...
anthropic built claude code originally as an internal tool for their developers. gave it access to terminals, file systems, the ability to run commands... basically everything a developer needs to write and debug code.
but then something interesting happened.
people at anthropic started using it for deep research. video creation. note-taking. all sorts of non-coding work. because when you give an AI agent access to a computer... it can do way more than just write code.
(which is why they renamed it from "claude code sdk" to "claude agent sdk" - reflects what it actually does)
if you've been following my writing, you know i've covered ai agents extensively and compared different frameworks. the claude agent sdk sits in a unique position...
the core insight that changed how i build agents
anthropic's key design principle: claude needs the same tools that people use every day.
not some abstract API wrapper or sanitized interface. the actual terminal. actual file system. the messy, powerful tools we use.
this is fundamentally different from how i was building agents before... and it explains why my early attempts felt so constrained.
the agent loop (that actually works in production)
after working with the sdk for the past few months, i've noticed all successful implementations follow the same pattern:
gather context → take action → verify work → repeat
sounds simple, right? but each stage has specific considerations that'll make or break your agent.
gathering context (the right way)
most agent failures happen here. they either pull in too much context (hitting token limits) or too little (making uninformed decisions).
here's what i've learned building 8 different agents with the sdk:
agentic search > semantic search (usually)
controversial take... but agentic search (where claude uses bash tools like grep, tail, find) is more reliable than semantic search for most use cases.
why? because:
- it's transparent - you can see exactly what claude searched for
- no chunking artifacts or embedding drift
- works with any file type without special handling
- easier to debug when things go wrong (and they will)
i only add semantic search when i need faster results across massive datasets. start simple.
subagents for parallelization
this feature is underrated...
working on a research agent last month, i needed to analyze 47 different customer feedback documents. sequential processing would've taken forever and burned through context.
solution: spin up multiple subagents in parallel, each analyzing 5-6 documents, returning only the key insights to the main agent.
result: 6.7x faster processing, used 40% less context, maintained accuracy.
(the sdk handles this automatically with isolated context windows for each subagent)
compaction for long-running agents
if your agent runs for more than a few interactions, you need compaction. the sdk automatically summarizes previous messages when approaching context limits.
learned this the hard way when an email management agent kept failing after ~20 interactions. added compaction, problem solved.
taking action (where most tutorials stop)
here's where it gets interesting... the sdk gives you several ways for agents to take action, and choosing the right one matters more than i initially thought.
tools = primary actions
tools are prominent in claude's context window. which means claude will consider them first when deciding how to complete a task.
so your tool design directly impacts agent behavior.
building a customer service agent recently, i made the mistake of creating 23 different tools (fetchUserData, getUserPreferences, checkOrderStatus, getShippingInfo, etc...)
claude got confused. kept choosing wrong tools. high latency from analyzing all options.
refactored to 5 core tools with parameters. much better.
lesson: fewer, well-designed tools > many specialized tools
bash for flexible operations
the sdk's bash access is incredibly powerful... and slightly terrifying.
example from last week: user wanted an agent that could process PDF attachments in emails.
instead of building a custom pdf-to-text tool, claude just wrote a bash script:
# claude's approach (works first try) pdftotext attachment.pdf - | grep "invoice" | awk '{print $3}'
simple. composable. exactly how a developer would solve it.
code generation for complex operations
anthropic's file creation feature in claude.ai? entirely powered by code generation.
claude writes python scripts to create excel spreadsheets, powerpoint presentations, word documents. ensures consistent formatting and complex functionality.
i've been using this pattern for data transformation tasks. instead of building rigid data pipelines, let claude write the transformation code based on actual requirements.
result: more flexible, easier to debug (it's just code), and surprisingly reliable.
mcps for external services
the model context protocol (mcp) handles authentication and api calls automatically. huge time saver.
building an internal tool agent that needed slack, asana, and google drive access. without mcp? would've spent days on oauth flows and api integration.
with mcp? connected all three services in an afternoon.
(the mcp ecosystem is growing fast - check available servers before building custom integrations)
verifying work (the critical piece everyone skips)
this is what separates toy demos from production agents.
agents that can check and improve their own output are fundamentally more reliable. they catch mistakes before they compound.
defined rules > vague guidelines
code linting is the perfect example of rules-based feedback.
working on a typescript generation agent, i originally just had claude "check if the code looks good." useless. too vague.
switched to proper linting: typescript compiler errors, eslint warnings, specific rule violations.
quality improved dramatically. fewer iterations needed.
same principle applies to non-code work. for an email agent, i defined rules: valid email format, check if recipient exists in history, tone consistency with previous messages.
specific rules = specific feedback = better results
visual feedback for ui/design work
when building agents that generate visual output (html emails, reports, dashboards), visual feedback is incredibly valuable.
example: email formatting agent that generates html emails.
i have it screenshot the rendered email and provide that back to claude for verification:
- layout correct?
- colors/fonts as intended?
- responsive at different sizes?
- anything look broken?
using playwright mcp server makes this trivial - screenshot, analyze, iterate.
llm as judge (use sparingly)
having another llm judge output quality... i'm honestly skeptical of this approach.
it's slow, expensive, and not particularly reliable in my testing.
but... for highly subjective quality assessments where defined rules don't work, it's sometimes your only option.
i used it for tone analysis in an email drafting agent. separate subagent judges if the draft tone matches user's previous messages.
adds 2-3 seconds latency. worth it for this specific use case. not worth it for most things.
what i've learned from production deployments
after deploying 5 different agents built with the sdk over the past few months:
start with the agent loop in mind
before writing any code, map out:
- what context does my agent need? (and where does it live?)
- what actions should it take? (tools, bash, code?)
- how will it verify its work? (defined rules, visual feedback?)
this simple framework prevents most design mistakes.
file system = context engineering
how you organize files directly impacts agent performance.
for an email agent, i store previous conversations in a structured folder hierarchy:
conversations/ 2025/ 09/ customer_name/ thread_001.md thread_002.md
makes agentic search fast and accurate. claude can grep through relevant conversations without loading everything into context.
iteration beats perfection
my best-performing agent (customer support automation, handling ~300 tickets/day) started as a 150-line prototype that handled exactly one use case.
shipped it. watched how it failed. improved.
now it's ~800 lines and handles 12 different ticket types. but i would've never gotten there trying to build the perfect system upfront.
common mistakes (that i made so you don't have to)
too many tools
already mentioned this, but worth repeating: fewer, well-designed tools with parameters > many specialized tools
ignoring error cases
what happens when:
- api call fails?
- file doesn't exist?
- bash command returns error?
- subagent times out?
if you haven't explicitly handled these cases, your agent will fail in production. guaranteed.
not testing with real data
synthetic test data is clean, well-formatted, consistent.
real data is messy, inconsistent, full of edge cases.
my research agent worked perfectly in testing (10/10 success rate). deployed to production, success rate dropped to 60% because real customer feedback documents were way messier than my test data.
test with real data early and often.
overcomplicating context gathering
i tried building a sophisticated caching system for frequently accessed data... spent a week on it.
then realized claude's agentic search was already pretty efficient. didn't need complex caching for most use cases.
premature optimization strikes again.
specific use cases i've built
concrete examples with actual numbers:
email management agent
context gathering: searches past conversations folder, checks sender history, reviews current calendar
actions: drafts replies, schedules meetings (google calendar mcp), files important messages, creates follow-up tasks
verification: checks email format validity, tone matching against previous messages, confirms meeting times don't conflict
results: processes 40-50 emails/day, 15 minute time savings per day, zero calendar conflicts in 2 months
document analysis agent
context gathering: uses subagents to parallel process documents, agentic search for specific terms, builds summary index
actions: generates structured analysis reports (via code generation), extracts key metrics, creates comparison tables
verification: cross-references extracted numbers with source documents, validates report structure against template
results: analyzed 200+ customer feedback docs in 12 minutes (previously took 6+ hours of manual work)
code modernization agent
context gathering: scans codebase for patterns, identifies dependencies, checks current language versions
actions: rewrites deprecated code, updates syntax, runs tests, creates git commits
verification: runs full test suite, lints all changes, checks for breaking changes
results: migrated 23k lines of python 2 to python 3 (with 94% test pass rate on first run)
when to use claude agent sdk vs other approaches
honest assessment after working with multiple agent frameworks:
use claude agent sdk when:
- you need flexible tool execution (bash, code generation, file manipulation)
- your agent needs to work with files and file systems
- you want transparent, debuggable agent behavior
- you're building agents that need to compose tools creatively
- you need reliable mcp integration
consider alternatives when:
- you need highly specialized domain tools (might need custom agent framework)
- you're building simple api wrappers (sdk is overkill)
- you need specific orchestration patterns the sdk doesn't support
- your use case requires custom llm routing logic
getting started (practical next steps)
if you're thinking about building with the claude agent sdk:
- read the anthropic engineering blog post - seriously, it's good (link: anthropic.com/engineering/building-agents-with-the-claude-agent-sdk)
- start with one clear use case - don't try to build a general-purpose agent first
- map the agent loop - context sources, actions needed, verification methods
- prototype with minimal tools - add complexity only when needed
- test with real data immediately - synthetic data will mislead you
- deploy small and iterate - you'll learn more from one production deployment than 10 perfect prototypes
final thoughts
the claude agent sdk represents a fundamentally different approach to building agents... giving ai the same tools humans use, rather than creating sanitized abstractions.
it's powerful. it's flexible. and it's surprisingly easy to shoot yourself in the foot with bash access and code generation.
but after months of working with it, i genuinely believe it's one of the better approaches to building agents that actually work in production.
the key insight: good agent design is more important than fancy technology.
gather context efficiently. take appropriate actions. verify your work. repeat.
simple in theory. nuanced in practice. worth learning.
related reading
if you found this helpful, you might also enjoy:
- AI Agents: The Complete Guide - foundational concepts and architecture patterns
- AI Agent Frameworks: The Honest Comparison - how claude agent sdk compares to langgraph, crewai, and others
- LangGraph State Management and Memory - building stateful agents with persistent memory
building production agents? i help companies design and implement ai agent systems that actually work at scale. let's talk about your specific use case. schedule a conversation