The demos are dazzling. An AI agent books a flight, writes a report, and files an expense claim while you drink your morning coffee. But build one yourself, and the shine fades quickly.
I spent the last three months deep in the agent stack — LangChain, CrewAI, AutoGen, and a fair amount of raw OpenAI function calling. What I found was not the autonomous future we were promised, but something more interesting: a set of powerful primitives that only work when you understand exactly where they break.
The Planning Problem
Most agent frameworks sell the dream of goal-oriented autonomy. You give the agent a high-level objective — "plan the company offsite" — and expect it to decompose, execute, and adapt. In practice, this rarely works without extensive scaffolding.
The core issue is planning. Humans maintain a rich hierarchical model of a task: we know which steps depend on others, what resources each requires, and when to backtrack. Current agents, by contrast, plan one token at a time. They are local optimisers trying to solve global problems. The result is competent micro-execution and catastrophic macro-planning.
The teams building the most reliable agent systems have converged on a hybrid model: human-defined workflow graphs with agent-powered nodes. You specify the structure; the agents handle the improvisation within nodes. This is less thrilling than full autonomy, but it ships.
State Is Everything
One underappreciated truth: the hardest part of building agents is not the LLM calls. It is state management. A non-trivial agent may touch dozens of tools, run for minutes or hours, and encounter errors that require recovery strategies. Without durable state, a single API timeout destroys hours of progress.
Patterns that work:
- Event-sourced architectures — every tool call and observation is logged to a stream that can be replayed.
- Checkpointing at decision boundaries — before any expensive or irreversible action, persist state.
- Structured output schemas — force the model to emit parseable planning artifacts, not just free-text reasoning.
The Tool Gap
Agents are only as capable as their tools. And most real-world tools — internal APIs, legacy databases, vendor systems — are not designed for programmatic navigation. They require authentication negotiation, rate-limit handling, idempotency logic, and context-specific interpretation.
I have seen teams spend 80% of their agent development time not on the agent, but on robust tool wrappers. The lesson: invest in your tooling layer first. An agent with five rock-solid, well-documented tools outperforms one with fifty flaky ones.
What Actually Works Now
Despite these constraints, there are clear domains where agent architectures deliver real value:
- Software engineering assistance — agents that explore codebases, suggest refactors, and write tests within bounded contexts. Cursor and Copilot are the vanguard here.
- Research synthesis — multi-step retrieval, cross-referencing, and summarisation across large document collections.
- Customer support triage — gathering context, classifying intent, and routing to the right human specialist.
The common thread: bounded domains, clear success criteria, and supervisory boundaries that prevent runaway behaviour.
The Year Ahead
We are not on the verge of artificial general intelligence. We are in the midst of a pragmatic revolution: LLMs as universal interface adapters that can glue together software systems with unprecedented flexibility. The best outcomes will come not from chasing autonomy for its own sake, but from designing thoughtful human-agent partnerships.
The hype will subside. But the engineers who learn to build reliable, observable, and well-instrumented agent systems will find themselves with skills that are genuinely scarce — and genuinely useful — for years to come.