Logo
JourneyBlogWorkContact

Engineered with purpose. Documented with depth.

© 2026 All rights reserved.

Stay updated

Loading subscription form...

GitHubLinkedInTwitter/XRSS
Back to Blog

AI Development

How I Shipped Production-Ready AI Agents for a Client

production engineering
ai agents
agentic workflows
tool calling
idempotency
ai reliability
llm observability
distributed systems
Jun 11, 2026
7 min read
1 views
How I Shipped Production-Ready AI Agents for a Client

When a Client's Refund Bot Started Paying People Twice

A client came to me last year with what looked like a finished product. Their support agent was the best demo their team had ever shown. The goal was always production-ready AI agents, but what they actually had was a brilliant prototype. It read a customer's email, looked up their order, decided whether a refund was warranted, and issued it, all in one smooth loop. Leadership loved it, so they shipped it behind a feature flag to five percent of tickets. Within a week, their finance team flagged that three customers had been refunded twice for the same order, and one had been refunded for an order that never existed.

That gap, the one between a flawless demo and a system you can trust with real money and real customers, is where most agent projects quietly die. This is the story of how we closed it for them, and the engineering discipline that got us there.

Why "Just Add a Retry" Made Everything Worse

The team's first instinct had been the obvious one. Tool calls to their order API occasionally timed out, the agent would throw, and the ticket would land in a dead-letter queue. So they wrapped the agent step in a retry. If it failed, run it again.

The duplicate refunds got worse, not better.

Here is what they had missed. When an agent step fails after a side effect has already happened (the refund went through, but the API response timed out) retrying replays the side effect. The model had no memory that it had already issued the refund. From its perspective, every attempt was the first attempt.

The next thing they tried was a bigger model. Frontier model upgrades improved the reasoning, but reliability barely moved, because the failures were never reasoning failures. They then spent two weeks tightening prompts, adding "do not issue duplicate refunds" in increasingly stern language. It helped at the margins and plateaued fast. You cannot prompt your way out of a systems problem.

What the Traces Actually Revealed

The breakthrough came when we stopped treating the agent as a magic box and started treating it like what it really is, a distributed system that happens to make decisions with an LLM. We added structured tracing around every single step. That meant the model's reasoning, the exact tool name and arguments it chose, the raw tool response, token counts, and latency.

Three patterns jumped out almost immediately:

  • The model occasionally hallucinated tool arguments, passing an order_id that did not exist because the customer had pasted a tracking number instead.

  • Roughly four percent of tool calls were timing out at the network layer, and the retry was replaying side effects on every one of them.

  • When a step failed for a genuinely unrecoverable reason, the agent often "recovered" by inventing a plausible sounding response to the customer, a silent failure that looked like success in their metrics.

None of this had been visible before, because the system had logs that said "agent ran" and "agent failed," and nothing in between. The LLM was rarely the problem. The system around the LLM was missing every safeguard we would take for granted in any other production service.

Treating the Agent Like a Distributed System

The fix was not a clever prompt. It was applying boring, battle tested distributed systems discipline to the agent loop. Four changes did the heavy lifting.

Idempotency on every side effect

Every action that touched money or external state got an idempotency key derived from the ticket ID and the action type. The refund API now rejected a second call carrying the same key. A retry could replay the intent as many times as it liked, but the side effect happened exactly once. This single change eliminated duplicate refunds entirely.

Validate tool arguments before executing

We stopped trusting the model's tool arguments blindly. Every tool got a schema, and arguments were validated, and where possible resolved against real data, before execution. If the model passed an order_id that did not exist, the tool returned a structured error the agent could reason about, rather than crashing or, worse, acting on garbage.

Classify errors, then retry only the safe ones

We split failures into two buckets. Transient errors covered things like network timeouts and rate limits. Terminal errors covered things like an invalid order or a business rule violation. Only transient errors triggered a bounded, backed off retry. Terminal errors were surfaced and never retried. That is the difference between resilience and a loop that confidently does the wrong thing five times in a row.

Make low-confidence actions ask for help

For any action above a risk threshold, such as refunds over a certain amount or account changes, the agent stopped and queued the action for human approval instead of executing on its own. Counterintuitively, this increased how much the client trusted it with, because the blast radius of a mistake was now capped.

Production Ai Agent Architecture Diagram

The Engineering Impact

The numbers told a clear story once the dust settled. Duplicate action incidents went from several a week to zero, and stayed there. The p99 latency stabilized, because the system was no longer silently retrying timed out calls into oblivion. Most importantly, mean time to resolution for agent bugs dropped from days to under an hour, because every failure now arrived with a trace that told us exactly which step broke and why. We could replay a single failed ticket end to end instead of guessing.

Observability turned out to be the highest leverage investment of the entire project. You cannot fix what you cannot see, and for the first month the team had been flying blind.

The Business Impact

Reliability is not an engineering luxury. It is what unlocks scope. Once the agent stopped making expensive mistakes, the client was comfortable raising its autonomy from five percent of tickets to nearly sixty. That translated into a measurable drop in support cost per ticket and a response time their customers actually noticed.

Just as important, the budget caps and model routing we added (cheap models for triage, the expensive model only for genuinely ambiguous cases) kept their inference bill flat even as volume grew tenfold. For a founder, that is the difference between an AI feature that erodes margin and one that compounds it.

The lesson I keep coming back to is simple. The model is the easy part. Production-ready AI agents are won or lost on the unglamorous engineering around the model, the idempotency, validation, error classification, and observability. Treat your agent like a distributed system, because that is exactly what it is.


Common wrong implementation

// Naive agent loop: retries replay side effects, no validation
async function runAgentStep(ticket: Ticket) {
  const decision = await llm.run(agentPrompt(ticket));

  // The model picked a tool and arguments, and we trust them blindly
  const { tool, args } = decision;

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await tools[tool](args); // issueRefund(args) runs AGAIN on every retry
    } catch (err) {
      // Retry everything, including actions that already succeeded
      if (attempt === 2) throw err;
    }
  }
}

Why it fails in production: if issueRefund succeeds but the network response times out, the catch block retries the whole step and issues a second refund. There is no schema check, so hallucinated arguments execute directly, and there is no trace, so the failure stays invisible until finance notices the money.

Production-ready solution

// Production loop: idempotency, validation, typed errors, observability
async function runAgentStep(ticket: Ticket, span: Trace) {
  const decision = await llm.run(agentPrompt(ticket));
  const { tool, args } = decision;

  // 1. Validate arguments against a schema before doing anything
  const parsed = toolSchemas[tool].safeParse(args);
  if (!parsed.success) {
    span.record("invalid_args", parsed.error);
    return { status: "tool_error", error: parsed.error }; // hand back to the model
  }

  // 2. Idempotency key makes side effects exactly once
  const idempotencyKey = `${ticket.id}:${tool}`;

  let attempt = 0;
  while (attempt < 3) {
    try {
      span.record("tool_call", { tool, attempt });
      return await tools[tool](parsed.data, { idempotencyKey });
    } catch (err) {
      // 3. Only retry transient errors, never business or terminal ones
      if (!isTransient(err) || attempt === 2) {
        span.record("tool_failed", { tool, err });
        throw err;
      }
      await backoff(attempt++);
    }
  }
}

In plain language: validate first so garbage never executes, stamp every side effect with an idempotency key so a retry becomes a no-op instead of a duplicate, retry only the errors that are actually safe to retry, and emit a trace span at every step so failures show up in seconds rather than days.

Naive Vs Production Ai Agent Retry Flow

Suggested Articles

  • Learn when to use autonomous systems versus structured automation in AI Agents vs AI Workflows Architecture Step by Step Guide.

  • Discover how to build scalable AI automation pipelines in Building Production-Ready AI Workflows with n8n, OpenAI, and Vector Databases.

  • Understand value-based pricing strategies in How Service-Based IT Companies Stop Undervaluing Their Work.

  • Explore common performance bottlenecks and scaling challenges in Why Most Next.js Apps Become Slow Over Time.


External Links

  • AWS Builder's Library, Making retries safe with idempotent APIs

  • Stripe Docs, Idempotent requests

  • Google SRE Book, Addressing cascading failures

  • OpenTelemetry, Documentation

  • Anthropic, Tool use (function calling) overview

Table of Contents

  • When a Client's Refund Bot Started Paying People Twice
  • Why "Just Add a Retry" Made Everything Worse
  • What the Traces Actually Revealed
  • Treating the Agent Like a Distributed System
  • Idempotency on every side effect
  • Validate tool arguments before executing
  • Classify errors, then retry only the safe ones
  • Make low-confidence actions ask for help
  • The Engineering Impact
  • The Business Impact
  • Common wrong implementation
  • Production-ready solution
  • Suggested Articles
  • External Links

Frequently Asked Questions

If you're building something complex and want a second brain before things get expensive — let's talk.

Continue Reading

Why Most AI Automation Pipelines Break in Production - The AI Workflows with n8n and OpenAI Architecture That Actually Works
AI Development9 min read

Why Most AI Automation Pipelines Break in Production - The AI Workflows with n8n and OpenAI Architecture That Actually Works

Many AI automations work in demos but collapse in real systems. This article explains why most pipelines fail and how AI workflows with n8n and OpenAI create a reliable automation architecture.

Mar 16, 20267 views
AI Agents vs AI Workflows Architecture Step by Step Guide
AI Development9 min read

AI Agents vs AI Workflows Architecture Step by Step Guide

Many AI products fail not because of poor models, but because of poor architecture decisions. This guide explains the real difference between AI agents vs AI workflows, and how to design scalable AI systems that work reliably in production.

Mar 12, 202623 views
Building Production-Ready AI Workflows with n8n, OpenAI, and Vector Databases
AI Development8 min read

Building Production-Ready AI Workflows with n8n, OpenAI, and Vector Databases

Many teams build AI features but struggle to turn them into reliable automation systems. This guide explains how to design production AI workflows with n8n, OpenAI, and vector databases to automate real business operations efficiently.

Mar 11, 20267 views