I can build an agent that reads every deal in my CRM through MCP tool calls and tells me where to focus. It costs $15 a run and the output is mediocre. The same answer from an algorithm over a Postgres table costs fractions of a cent. That gap is not a model problem. It is an architecture problem, and most teams shipping agents have not confronted it yet.
Here is the pattern that works. Prototype on pure reasoning — let the agent call MCP tools freely while you figure out what the workflow actually requires. That phase is legitimate and fast. But then read the logs. Wherever the model rebuilds the same bash script, the same search, the same data pull every single run, you are paying model prices for deterministic work. Extract it. Make it a Python function the agent calls as a tool, with a SQL query that hands the model exactly the data it needs and nothing more. The teams doing this well treat skills as real software, not markdown instructions.
The same curve applies to the integration layer itself. Aggregator MCP providers — the services that hand you a hundred OAuth connectors in an afternoon — are the prototyping medium for connections. They are an easy button of variable quality: when a connector works, you have built nothing and shipped something. When it turns out to be mediocre — and you usually find out the moment a real workflow leans on it — you write the native integration and swap the agent over with a one-line instruction. The decision rule is identical to the one for reasoning: pay the convenience tax while you are learning what the workflow needs, then harden the parts that matter.
The serious operators go further. One team I work with — automating tax and payroll for German SMEs — runs a small, cheap model as a triage layer in front of every task: how big is this, how many cycles does it deserve, which model should handle it, where is the cutoff. Frontier reasoning is reserved for genuine ambiguity. Everything else flows through code paths that cost almost nothing and never hallucinate.
They have also started automating the hardening itself. After every successful run, a background agent reads the execution logs and asks a simple question: what did this run do that the next one should skip? When it catches the model writing the same bash script for the fifth time, it proposes promoting that script to a skill and tightening the prompt — posted to Slack, approved with a thumbs-up, shipped as a reviewable change in git. Every successful run improves the next one — run data plus human feedback compiling down into cheaper, more deterministic agents over time.
This is the same maturity curve software has always followed — explore expensively, then optimize what repeats. What is different now is the bill. Token costs that look tolerable at ten runs a day become a line item the CFO notices at ten thousand, and the reckoning on token spend is coming faster than most agent roadmaps assume.
Production agents will not be pure LLM loops. They will be mixed flows — traditional software doing the repeatable ninety percent, model reasoning handling the ambiguous ten. If you want to know where you stand, take your most expensive agent, read its last ten runs, and count how many times it did the same work twice. That number is your roadmap.
Related Essays
We Made the Agent Smarter by Deleting Code
The biggest agent upgrade we shipped was subtraction — removing hardcoded logic and context bloat that existed because last year's models couldn't be trusted.
The Agent Is the Primitive, Not the Automation
Automations bundle trigger, prompt, tools, and model into one flat object. That works at five. It falls apart at fifty. The agent has to become its own primitive.
Agents Are Software, Not Prompts
The industry treats agents as a new category. They are not. Agents are software, and the same engineering principles that have always mattered still apply.
Key takeaways
- An agent that pulls an entire CRM through tool calls to rank deals costs dollars per run — the same answer from a function over Postgres costs fractions of a cent.
- Read your agent's logs and harden every repeated query, script, and transformation into a deterministic tool the agent calls.
- Production agent flows mix traditional software with model reasoning, reserving tokens for genuine ambiguity.
- The same curve applies to integrations — generic OAuth-aggregator connectors are fine for prototyping, but swap in native implementations once a connector proves low quality.
FAQ
Is it wrong to build agents with pure MCP tool calls?
No — it is the right way to prototype. Letting the agent reason freely over MCP tools shows you what the workflow actually requires. The mistake is shipping that prototype to production unchanged, where you pay model prices every run for work a function could do deterministically.
How do you decide which parts of an agent to harden into code?
Read the run logs. Wherever the model rebuilds the same script, query, or transformation across runs, that is a deterministic operation wearing an expensive costume. Extract it into a Python function or SQL query the agent calls as a tool, and keep reasoning for the genuinely ambiguous steps.
Should you use aggregator services for agent integrations or build native ones?
Start with the aggregator — it is an easy button that gets you dozens of OAuth connections without writing anything. But the quality varies connector by connector. When one proves mediocre in practice, build the native integration and point the agent at it. Prototype on the generic connector, productionize on the native one.