Should AI agents act autonomously or require human approval?

It depends on blast radius. Low-risk, easily reversible actions can run fully autonomously. High-risk or irreversible actions, such as large refunds, account changes, or anything touching money or permissions, should pass a confidence or value threshold and queue for human approval when they exceed it. Counterintuitively, adding a human-in-the-loop escape hatch usually lets you grant the agent more overall autonomy, because the cost of its worst mistake is capped. Start conservative, watch the traces, and widen autonomy as the data earns your trust.

How much observability do AI agents actually need?

More than most teams start with. At minimum, capture a trace for every step that includes the model's reasoning, the exact tool name and arguments chosen, the raw tool result, token counts, latency, and cost. Without this, "the agent failed" tells you nothing actionable, and silent failures look identical to successes in your metrics. Good tracing turns mean time to resolution from days into minutes, because you can replay a single failed run end to end. It is typically the highest leverage investment in the entire project.

What is the difference between transient and terminal errors in agent retries?

Transient errors are temporary and likely to succeed if you try again, such as network timeouts, throttling, or a briefly unavailable dependency. Terminal errors are deterministic failures that will never succeed on retry, such as an invalid order ID, a validation failure, or a business rule that forbids the action. Retrying transient errors with exponential backoff improves resilience. Retrying terminal errors just burns time and tokens while the agent confidently repeats a mistake. Classify the error first, retry only the transient class, and surface terminal errors back to the model or a human.

How do you stop an AI agent from repeating actions like duplicate payments?

Use idempotency keys. Derive a stable key for each side-effecting action, for example by combining the request or ticket ID with the action type, and pass it to the downstream API. The API records the key on first success and treats any later call with the same key as a no-op, returning the original result. This makes retries safe. The agent can replay its intent as many times as needed, but the actual side effect happens exactly once. Most payment and provisioning APIs support an idempotency key header for precisely this reason.

Why do AI agents fail in production when they work fine in a demo?

Demos run on the happy path, with clean inputs, fast networks, and one request at a time. Production introduces malformed inputs, timeouts, rate limits, concurrency, and edge cases the model never saw. The failures are usually not reasoning failures, since the LLM is mostly fine. They are systems failures. A tool call times out after a side effect lands, a hallucinated argument executes unchecked, or a step silently "recovers" by inventing a response. Treating the agent as a distributed system and adding idempotency, validation, error classification, and tracing closes most of the gap.

AI Development

How I Shipped Production-Ready AI Agents for a Client

production engineering

Jun 11, 2026

7 min read

16 views

How I Shipped Production-Ready AI Agents for a Client

When a Client's Refund Bot Started Paying People Twice

A client came to me last year with what looked like a finished product. Their support agent was the best demo their team had ever shown. The goal was always production-ready AI agents, but what they actually had was a brilliant prototype. It read a customer's email, looked up their order, decided whether a refund was warranted, and issued it, all in one smooth loop. Leadership loved it, so they shipped it behind a feature flag to five percent of tickets. Within a week, their finance team flagged that three customers had been refunded twice for the same order, and one had been refunded for an order that never existed.

That gap, the one between a flawless demo and a system you can trust with real money and real customers, is where most agent projects quietly die. This is the story of how we closed it for them, and the engineering discipline that got us there.

Why "Just Add a Retry" Made Everything Worse

The team's first instinct had been the obvious one. Tool calls to their order API occasionally timed out, the agent would throw, and the ticket would land in a dead-letter queue. So they wrapped the agent step in a retry. If it failed, run it again.

The duplicate refunds got worse, not better.

Here is what they had missed. When an agent step fails after a side effect has already happened (the refund went through, but the API response timed out) retrying replays the side effect. The model had no memory that it had already issued the refund. From its perspective, every attempt was the first attempt.

The next thing they tried was a bigger model. Frontier model upgrades improved the reasoning, but reliability barely moved, because the failures were never reasoning failures. They then spent two weeks tightening prompts, adding "do not issue duplicate refunds" in increasingly stern language. It helped at the margins and plateaued fast. You cannot prompt your way out of a systems problem.

What the Traces Actually Revealed

The breakthrough came when we stopped treating the agent as a magic box and started treating it like what it really is, a distributed system that happens to make decisions with an LLM. We added structured tracing around every single step. That meant the model's reasoning, the exact tool name and arguments it chose, the raw tool response, token counts, and latency.

Three patterns jumped out almost immediately:

The model occasionally hallucinated tool arguments, passing an order_id that did not exist because the customer had pasted a tracking number instead.
Roughly four percent of tool calls were timing out at the network layer, and the retry was replaying side effects on every one of them.
When a step failed for a genuinely unrecoverable reason, the agent often "recovered" by inventing a plausible sounding response to the customer, a silent failure that looked like success in their metrics.

None of this had been visible before, because the system had logs that said "agent ran" and "agent failed," and nothing in between. The LLM was rarely the problem. The system around the LLM was missing every safeguard we would take for granted in any other production service.

Treating the Agent Like a Distributed System

The fix was not a clever prompt. It was applying boring, battle tested distributed systems discipline to the agent loop. Four changes did the heavy lifting.

Idempotency on every side effect

Every action that touched money or external state got an idempotency key derived from the ticket ID and the action type. The refund API now rejected a second call carrying the same key. A retry could replay the intent as many times as it liked, but the side effect happened exactly once. This single change eliminated duplicate refunds entirely.

Validate tool arguments before executing

We stopped trusting the model's tool arguments blindly. Every tool got a schema, and arguments were validated, and where possible resolved against real data, before execution. If the model passed an order_id that did not exist, the tool returned a structured error the agent could reason about, rather than crashing or, worse, acting on garbage.

Classify errors, then retry only the safe ones

We split failures into two buckets. Transient errors covered things like network timeouts and rate limits. Terminal errors covered things like an invalid order or a business rule violation. Only transient errors triggered a bounded, backed off retry. Terminal errors were surfaced and never retried. That is the difference between resilience and a loop that confidently does the wrong thing five times in a row.

Make low-confidence actions ask for help

For any action above a risk threshold, such as refunds over a certain amount or account changes, the agent stopped and queued the action for human approval instead of executing on its own. Counterintuitively, this increased how much the client trusted it with, because the blast radius of a mistake was now capped.

Production Ai Agent Architecture Diagram

The Engineering Impact

The numbers told a clear story once the dust settled. Duplicate action incidents went from several a week to zero, and stayed there. The p99 latency stabilized, because the system was no longer silently retrying timed out calls into oblivion. Most importantly, mean time to resolution for agent bugs dropped from days to under an hour, because every failure now arrived with a trace that told us exactly which step broke and why. We could replay a single failed ticket end to end instead of guessing.

Observability turned out to be the highest leverage investment of the entire project. You cannot fix what you cannot see, and for the first month the team had been flying blind.

The Business Impact

Reliability is not an engineering luxury. It is what unlocks scope. Once the agent stopped making expensive mistakes, the client was comfortable raising its autonomy from five percent of tickets to nearly sixty. That translated into a measurable drop in support cost per ticket and a response time their customers actually noticed.

Just as important, the budget caps and model routing we added (cheap models for triage, the expensive model only for genuinely ambiguous cases) kept their inference bill flat even as volume grew tenfold. For a founder, that is the difference between an AI feature that erodes margin and one that compounds it.

The lesson I keep coming back to is simple. The model is the easy part. Production-ready AI agents are won or lost on the unglamorous engineering around the model, the idempotency, validation, error classification, and observability. Treat your agent like a distributed system, because that is exactly what it is.

Common wrong implementation

// Naive agent loop: retries replay side effects, no validation
async function runAgentStep(ticket: Ticket) {
  const decision = await llm.run(agentPrompt(ticket));

  // The model picked a tool and arguments, and we trust them blindly
  const { tool, args } = decision;

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await tools[tool](args); // issueRefund(args) runs AGAIN on every retry
    } catch (err) {
      // Retry everything, including actions that already succeeded
      if (attempt === 2) throw err;
    }
  }
}

Why it fails in production: if issueRefund succeeds but the network response times out, the catch block retries the whole step and issues a second refund. There is no schema check, so hallucinated arguments execute directly, and there is no trace, so the failure stays invisible until finance notices the money.

Production-ready solution

// Production loop: idempotency, validation, typed errors, observability
async function runAgentStep(ticket: Ticket, span: Trace) {
  const decision = await llm.run(agentPrompt(ticket));
  const { tool, args } = decision;

  // 1. Validate arguments against a schema before doing anything
  const parsed = toolSchemas[tool].safeParse(args);
  if (!parsed.success) {
    span.record("invalid_args", parsed.error);
    return { status: "tool_error", error: parsed.error }; // hand back to the model
  }

  // 2. Idempotency key makes side effects exactly once
  const idempotencyKey = `${ticket.id}:${tool}`;

  let attempt = 0;
  while (attempt < 3) {
    try {
      span.record("tool_call", { tool, attempt });
      return await tools[tool](parsed.data, { idempotencyKey });
    } catch (err) {
      // 3. Only retry transient errors, never business or terminal ones
      if (!isTransient(err) || attempt === 2) {
        span.record("tool_failed", { tool, err });
        throw err;
      }
      await backoff(attempt++);
    }
  }
}

In plain language: validate first so garbage never executes, stamp every side effect with an idempotency key so a retry becomes a no-op instead of a duplicate, retry only the errors that are actually safe to retry, and emit a trace span at every step so failures show up in seconds rather than days.

External Links

AI Development

How I Shipped Production-Ready AI Agents for a Client

production engineering

Jun 11, 2026

7 min read

16 views

When a Client's Refund Bot Started Paying People Twice

Why "Just Add a Retry" Made Everything Worse

The duplicate refunds got worse, not better.

What the Traces Actually Revealed

Three patterns jumped out almost immediately:

The model occasionally hallucinated tool arguments, passing an order_id that did not exist because the customer had pasted a tracking number instead.
Roughly four percent of tool calls were timing out at the network layer, and the retry was replaying side effects on every one of them.
When a step failed for a genuinely unrecoverable reason, the agent often "recovered" by inventing a plausible sounding response to the customer, a silent failure that looked like success in their metrics.

Treating the Agent Like a Distributed System

The fix was not a clever prompt. It was applying boring, battle tested distributed systems discipline to the agent loop. Four changes did the heavy lifting.

Idempotency on every side effect

Validate tool arguments before executing

Classify errors, then retry only the safe ones

Make low-confidence actions ask for help

The Engineering Impact

Observability turned out to be the highest leverage investment of the entire project. You cannot fix what you cannot see, and for the first month the team had been flying blind.

The Business Impact

Common wrong implementation

// Naive agent loop: retries replay side effects, no validation
async function runAgentStep(ticket: Ticket) {
  const decision = await llm.run(agentPrompt(ticket));

  // The model picked a tool and arguments, and we trust them blindly
  const { tool, args } = decision;

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await tools[tool](args); // issueRefund(args) runs AGAIN on every retry
    } catch (err) {
      // Retry everything, including actions that already succeeded
      if (attempt === 2) throw err;
    }
  }
}

Production-ready solution

// Production loop: idempotency, validation, typed errors, observability
async function runAgentStep(ticket: Ticket, span: Trace) {
  const decision = await llm.run(agentPrompt(ticket));
  const { tool, args } = decision;

  // 1. Validate arguments against a schema before doing anything
  const parsed = toolSchemas[tool].safeParse(args);
  if (!parsed.success) {
    span.record("invalid_args", parsed.error);
    return { status: "tool_error", error: parsed.error }; // hand back to the model
  }

  // 2. Idempotency key makes side effects exactly once
  const idempotencyKey = `${ticket.id}:${tool}`;

  let attempt = 0;
  while (attempt < 3) {
    try {
      span.record("tool_call", { tool, attempt });
      return await tools[tool](parsed.data, { idempotencyKey });
    } catch (err) {
      // 3. Only retry transient errors, never business or terminal ones
      if (!isTransient(err) || attempt === 2) {
        span.record("tool_failed", { tool, err });
        throw err;
      }
      await backoff(attempt++);
    }
  }
}

How I Shipped Production-Ready AI Agents for a Client

When a Client's Refund Bot Started Paying People Twice

Why "Just Add a Retry" Made Everything Worse

What the Traces Actually Revealed

Treating the Agent Like a Distributed System

Idempotency on every side effect

Validate tool arguments before executing

Classify errors, then retry only the safe ones

Make low-confidence actions ask for help

The Engineering Impact

The Business Impact

Common wrong implementation

Production-ready solution

Suggested Articles

External Links

Frequently Asked Questions

Have a project worth building?

Screen Job Applications Faster With AI Shortlisting

Predict Customer Churn and Win Them Back With AI

Automate Your Product Photography With AI Editing

How I Shipped Production-Ready AI Agents for a Client

When a Client's Refund Bot Started Paying People Twice

Why "Just Add a Retry" Made Everything Worse

What the Traces Actually Revealed

Treating the Agent Like a Distributed System

Idempotency on every side effect

Validate tool arguments before executing

Classify errors, then retry only the safe ones

Make low-confidence actions ask for help

The Engineering Impact

The Business Impact

Common wrong implementation

Production-ready solution

Suggested Articles

External Links

Frequently Asked Questions

Have a project worth building?

Screen Job Applications Faster With AI Shortlisting

Predict Customer Churn and Win Them Back With AI

Automate Your Product Photography With AI Editing

How I Shipped Production-Ready AI Agents for a Client

When a Client's Refund Bot Started Paying People Twice

Why "Just Add a Retry" Made Everything Worse

What the Traces Actually Revealed

Treating the Agent Like a Distributed System

Idempotency on every side effect

Validate tool arguments before executing

Classify errors, then retry only the safe ones

Make low-confidence actions ask for help

The Engineering Impact

The Business Impact

Common wrong implementation

Production-ready solution

Suggested Articles

External Links

Frequently Asked Questions

Should AI agents act autonomously or require human approval?

How much observability do AI agents actually need?

What is the difference between transient and terminal errors in agent retries?

How do you stop an AI agent from repeating actions like duplicate payments?

Why do AI agents fail in production when they work fine in a demo?

Have a project worth building?

Continue Reading

Screen Job Applications Faster With AI Shortlisting

Predict Customer Churn and Win Them Back With AI

Automate Your Product Photography With AI Editing

How I Shipped Production-Ready AI Agents for a Client

When a Client's Refund Bot Started Paying People Twice

Why "Just Add a Retry" Made Everything Worse

What the Traces Actually Revealed

Treating the Agent Like a Distributed System

Idempotency on every side effect

Validate tool arguments before executing

Classify errors, then retry only the safe ones

Make low-confidence actions ask for help

The Engineering Impact

The Business Impact

Common wrong implementation

Production-ready solution

Suggested Articles

External Links

Frequently Asked Questions

Should AI agents act autonomously or require human approval?

How much observability do AI agents actually need?

What is the difference between transient and terminal errors in agent retries?

How do you stop an AI agent from repeating actions like duplicate payments?

Why do AI agents fail in production when they work fine in a demo?

Have a project worth building?

Continue Reading

Screen Job Applications Faster With AI Shortlisting

Predict Customer Churn and Win Them Back With AI

Automate Your Product Photography With AI Editing