
A client came to me last year with what looked like a finished product. Their support agent was the best demo their team had ever shown. The goal was always production-ready AI agents, but what they actually had was a brilliant prototype. It read a customer's email, looked up their order, decided whether a refund was warranted, and issued it, all in one smooth loop. Leadership loved it, so they shipped it behind a feature flag to five percent of tickets. Within a week, their finance team flagged that three customers had been refunded twice for the same order, and one had been refunded for an order that never existed.
That gap, the one between a flawless demo and a system you can trust with real money and real customers, is where most agent projects quietly die. This is the story of how we closed it for them, and the engineering discipline that got us there.
The team's first instinct had been the obvious one. Tool calls to their order API occasionally timed out, the agent would throw, and the ticket would land in a dead-letter queue. So they wrapped the agent step in a retry. If it failed, run it again.
The duplicate refunds got worse, not better.
Here is what they had missed. When an agent step fails after a side effect has already happened (the refund went through, but the API response timed out) retrying replays the side effect. The model had no memory that it had already issued the refund. From its perspective, every attempt was the first attempt.
The next thing they tried was a bigger model. Frontier model upgrades improved the reasoning, but reliability barely moved, because the failures were never reasoning failures. They then spent two weeks tightening prompts, adding "do not issue duplicate refunds" in increasingly stern language. It helped at the margins and plateaued fast. You cannot prompt your way out of a systems problem.
The breakthrough came when we stopped treating the agent as a magic box and started treating it like what it really is, a distributed system that happens to make decisions with an LLM. We added structured tracing around every single step. That meant the model's reasoning, the exact tool name and arguments it chose, the raw tool response, token counts, and latency.
Three patterns jumped out almost immediately:
The model occasionally hallucinated tool arguments, passing an order_id that did not exist because the customer had pasted a tracking number instead.
Roughly four percent of tool calls were timing out at the network layer, and the retry was replaying side effects on every one of them.
When a step failed for a genuinely unrecoverable reason, the agent often "recovered" by inventing a plausible sounding response to the customer, a silent failure that looked like success in their metrics.
None of this had been visible before, because the system had logs that said "agent ran" and "agent failed," and nothing in between. The LLM was rarely the problem. The system around the LLM was missing every safeguard we would take for granted in any other production service.
The fix was not a clever prompt. It was applying boring, battle tested distributed systems discipline to the agent loop. Four changes did the heavy lifting.
Every action that touched money or external state got an idempotency key derived from the ticket ID and the action type. The refund API now rejected a second call carrying the same key. A retry could replay the intent as many times as it liked, but the side effect happened exactly once. This single change eliminated duplicate refunds entirely.
We stopped trusting the model's tool arguments blindly. Every tool got a schema, and arguments were validated, and where possible resolved against real data, before execution. If the model passed an order_id that did not exist, the tool returned a structured error the agent could reason about, rather than crashing or, worse, acting on garbage.
We split failures into two buckets. Transient errors covered things like network timeouts and rate limits. Terminal errors covered things like an invalid order or a business rule violation. Only transient errors triggered a bounded, backed off retry. Terminal errors were surfaced and never retried. That is the difference between resilience and a loop that confidently does the wrong thing five times in a row.
For any action above a risk threshold, such as refunds over a certain amount or account changes, the agent stopped and queued the action for human approval instead of executing on its own. Counterintuitively, this increased how much the client trusted it with, because the blast radius of a mistake was now capped.

The numbers told a clear story once the dust settled. Duplicate action incidents went from several a week to zero, and stayed there. The p99 latency stabilized, because the system was no longer silently retrying timed out calls into oblivion. Most importantly, mean time to resolution for agent bugs dropped from days to under an hour, because every failure now arrived with a trace that told us exactly which step broke and why. We could replay a single failed ticket end to end instead of guessing.
Observability turned out to be the highest leverage investment of the entire project. You cannot fix what you cannot see, and for the first month the team had been flying blind.
Reliability is not an engineering luxury. It is what unlocks scope. Once the agent stopped making expensive mistakes, the client was comfortable raising its autonomy from five percent of tickets to nearly sixty. That translated into a measurable drop in support cost per ticket and a response time their customers actually noticed.
Just as important, the budget caps and model routing we added (cheap models for triage, the expensive model only for genuinely ambiguous cases) kept their inference bill flat even as volume grew tenfold. For a founder, that is the difference between an AI feature that erodes margin and one that compounds it.
The lesson I keep coming back to is simple. The model is the easy part. Production-ready AI agents are won or lost on the unglamorous engineering around the model, the idempotency, validation, error classification, and observability. Treat your agent like a distributed system, because that is exactly what it is.
// Naive agent loop: retries replay side effects, no validation
async function runAgentStep(ticket: Ticket) {
const decision = await llm.run(agentPrompt(ticket));
// The model picked a tool and arguments, and we trust them blindly
const { tool, args } = decision;
for (let attempt = 0; attempt < 3; attempt++) {
try {
return await tools[tool](args); // issueRefund(args) runs AGAIN on every retry
} catch (err) {
// Retry everything, including actions that already succeeded
if (attempt === 2) throw err;
}
}
}
Why it fails in production: if issueRefund succeeds but the network response times out, the catch block retries the whole step and issues a second refund. There is no schema check, so hallucinated arguments execute directly, and there is no trace, so the failure stays invisible until finance notices the money.
// Production loop: idempotency, validation, typed errors, observability
async function runAgentStep(ticket: Ticket, span: Trace) {
const decision = await llm.run(agentPrompt(ticket));
const { tool, args } = decision;
// 1. Validate arguments against a schema before doing anything
const parsed = toolSchemas[tool].safeParse(args);
if (!parsed.success) {
span.record("invalid_args", parsed.error);
return { status: "tool_error", error: parsed.error }; // hand back to the model
}
// 2. Idempotency key makes side effects exactly once
const idempotencyKey = `${ticket.id}:${tool}`;
let attempt = 0;
while (attempt < 3) {
try {
span.record("tool_call", { tool, attempt });
return await tools[tool](parsed.data, { idempotencyKey });
} catch (err) {
// 3. Only retry transient errors, never business or terminal ones
if (!isTransient(err) || attempt === 2) {
span.record("tool_failed", { tool, err });
throw err;
}
await backoff(attempt++);
}
}
}
In plain language: validate first so garbage never executes, stamp every side effect with an idempotency key so a retry becomes a no-op instead of a duplicate, retry only the errors that are actually safe to retry, and emit a trace span at every step so failures show up in seconds rather than days.

Learn when to use autonomous systems versus structured automation in AI Agents vs AI Workflows Architecture Step by Step Guide.
Discover how to build scalable AI automation pipelines in Building Production-Ready AI Workflows with n8n, OpenAI, and Vector Databases.
Understand value-based pricing strategies in How Service-Based IT Companies Stop Undervaluing Their Work.
Explore common performance bottlenecks and scaling challenges in Why Most Next.js Apps Become Slow Over Time.
If you're building something complex and want a second brain before things get expensive — let's talk.

Many AI automations work in demos but collapse in real systems. This article explains why most pipelines fail and how AI workflows with n8n and OpenAI create a reliable automation architecture.

Many AI products fail not because of poor models, but because of poor architecture decisions. This guide explains the real difference between AI agents vs AI workflows, and how to design scalable AI systems that work reliably in production.

Many teams build AI features but struggle to turn them into reliable automation systems. This guide explains how to design production AI workflows with n8n, OpenAI, and vector databases to automate real business operations efficiently.