Writing · Agentic Systems

Agentic Workflow Systems In Production: A Field Report

Demo agents and production agents are different species. Most "agentic" frameworks fall apart the moment they touch real operations. Here is what survives.

Most agent demos are theater. They run a tidy three-step task in a sandbox, the model picks the right tool every time, the failure modes are hidden by the camera angle, and the demo ends before anyone has to handle a tool timeout, a hallucinated argument, or a state-machine race condition. Production agents are a different animal. They run thousands or millions of times a day, against tools that fail, in environments where partial completion is worse than no completion, and they have to be observable and rollbackable.

I have shipped agentic systems in regulated industries, in advertising, in operations workflows that touch real money. The patterns that work are not the ones in the popular frameworks. The frameworks optimize for the demo. Production optimizes for the long tail.

What "agentic" actually means in production

Strip the marketing. An agentic system in production is a control loop where a language model selects from a set of tools, takes an action, observes the result, and decides the next action — all within bounds defined by code that the model cannot violate. The model is the planner. Code is the runtime. The interesting engineering work is almost entirely in the runtime.

If your agent does not have an enforced state machine around it, you do not have an agentic system. You have a chatbot that calls APIs and writes files. The difference matters because production failure modes live in state, not in the prompts.

The five things that break

Every production agentic system I have built or rescued failed first at one or more of these. They are predictable. Plan for them.

1. Tool reliability

Tools fail. Third-party APIs return 502s. Internal services time out under load. A search tool returns an empty result. A model that is told "the tool returned nothing" will often respond by hallucinating a result instead of asking for help. This is not a prompt problem. This is a runtime problem. Wrap every tool with retry policies, circuit breakers, and explicit error semantics. The model needs to be told the difference between "tool returned no results" and "tool failed and you should escalate." If the runtime conflates those, the agent will too.

2. State management

Long-running agents accumulate state. The context window is not state. The context window is a cache of state that the model is allowed to see. Real state lives in a database that survives restarts. I have seen teams ship agents whose entire memory was the conversation buffer. The first time the process crashed mid-task, every in-flight job was lost. State has to be checkpointed at every meaningful step, with idempotency keys on every external action so retries do not double-charge a customer or double-send an email.

3. Cost spirals

An agent that loops is an agent that bankrupts you. The classic failure: the agent is asked to "complete this task," it fails on step three, decides to retry, the retry fails, it tries another approach, fails again, and burns through fifty thousand tokens producing nothing. I have seen single jobs hit $40 in inference cost on a workflow that should have taken twenty cents. The runtime needs hard budgets per task. Tokens, dollars, wall-clock time, tool calls. When any budget is hit, the agent escalates to a human with full context. No budget, no production.

4. Observability gaps

You cannot debug what you cannot see. Most agent frameworks log the final answer and maybe the trace. That is not enough. You need every model input, every model output, every tool input, every tool output, every state transition, every retry, every cost increment, indexed and searchable. When a customer complaint comes in three weeks after a bad run, you need to be able to pull the full trace in under thirty seconds. If you cannot, your team will fly blind and the agent will be a black box that nobody trusts.

5. Human-in-the-loop boundaries

The hardest design question in agentic systems is where humans intervene. Too much intervention and the agent provides no leverage. Too little and the agent ships errors at machine speed. The pattern that works: the agent ranks its own confidence on every consequential action, and any action below the confidence threshold goes to a queue for human review. Confidence is calibrated against ground truth, not assumed. I have seen teams ship "high confidence" agents whose self-reported confidence had a Pearson correlation of 0.1 with actual correctness. That is a coin flip with extra steps.

The architecture I keep coming back to

After enough rebuilds, I converge on the same shape every time.

Three components. A planner that produces a plan as structured output — a sequence of steps, each with an explicit tool call and expected outcome. An executor that runs each step against the tool layer with retries, budgets, and idempotency. A critic that reviews each step's output against the expected outcome and decides whether to continue, retry with a revised plan, or escalate.

The planner runs once, ideally with a stronger model. The executor is mostly deterministic code with a small model in the loop for parsing. The critic uses a model again, but a cheap one, because its job is mostly comparison and triage. This split lets you spend tokens where they matter and use code where it does.

Persistent state lives in Postgres. Every step gets a row. Every tool call gets a row. Every model call gets a row. State transitions are atomic. If the process dies mid-task, a worker picks up where it left off, because the database has a complete picture and every external action is idempotent.

Deterministic checkpoints sit at every step boundary. The runtime is allowed to retry within a step, but it cannot freely reorder or skip steps. The model cannot rewrite the plan mid-execution; if a plan revision is needed, the executor surfaces the failure and the planner is invoked again with the updated context. This is unfashionable. It also works.

The agent's freedom should be exactly as wide as your willingness to debug what it does. Anything wider is hubris.

Cost discipline

A few patterns I use on every system.

Tier the models. Use a frontier model for planning and final synthesis. Use a smaller, cheaper model for routine tool argument extraction and routing decisions. The cost spread between a flagship model and a workhorse model is often ten to thirty times. On a workflow with twelve steps, eleven of them do not need the flagship.

Cache aggressively. Hash the prompt plus the tool spec plus the relevant state slice and store the result. Many "long-running agent tasks" actually repeat the same sub-tasks. A cache hit rate of 30 to 50 percent is common in real workloads, and it is free quality and free cost reduction.

Truncate context with intent. Do not stuff the entire conversation into every call. Maintain a structured working memory — facts, decisions, open questions — and reconstruct the prompt from that memory each step. Tokens drop by half. Quality typically goes up because the model is not distracted by irrelevant history.

Set a per-task budget and enforce it in code, not in the prompt. The model will ignore "you have $0.50 to complete this task" the moment it gets stuck. The runtime will not.

Where agents actually win

Agents work best where the task is well-bounded, the tools are reliable, the cost of an error is low, and the throughput requirement is high enough that humans were never going to do it economically. Operational triage. Multi-step research. Document processing where the structure varies but the goal is fixed. Customer support escalation routing. Outbound personalization at scale. In those domains a well-built agentic system delivers ten to fifty times throughput at a fraction of the unit cost.

Agents work poorly where the task is open-ended, the tools are unreliable, the cost of an error is high, and the data is novel every time. Code generation against a complex codebase. Anything involving novel quantitative reasoning. Anything where the user's intent is ambiguous and asking a clarifying question is socially expensive. In those domains agents accelerate failure as effectively as they accelerate success, and the math gets ugly fast.

The honest answer is that the band of "agents are clearly better than a script plus a human" is narrower than the marketing suggests, but inside that band the leverage is real and durable. The teams that ship the right agent for the right workflow pull ahead. The teams that try to make every workflow agentic burn money and trust simultaneously.

What to build first

If you are starting from scratch, build the runtime before you build the agent. State machine, tool layer with retries and budgets, observability, human escalation queue. That is a month of work and it makes every subsequent agent ten times easier to ship and a hundred times easier to operate. The teams that try to do it the other way around — starting with a clever prompt and adding the runtime later — never finish the runtime, because by then they are firefighting in production.

Agents are not magic. They are control loops with stochastic planners. Treat them like the systems they are, build the runtime first, and they will repay the discipline. Skip the runtime and they will eat your week, your month, and your customer trust in that order.

Ajit Samuel is a New York City based founder and operator. He architects, ships, and operates production AI, agentic systems, real-time data platforms, advertising technology, and growth infrastructure. ajitsamuel.com.