Loop engineering is your new job: how to design for silence | Hermes Agent: Free lesson to see it's full potential
A few months back, I was talking to the head of product at a mid-stage AI company. They’d shipped an autonomous agent to help their customer success team triage support tickets. The agent was supposed to categorize tickets, flag urgent ones, and draft responses.
“How’s it going?” I asked.
“It works,” she said. “Mostly.”
Mostly. That word haunted me through the rest of the conversation.
The agent was running in the background. It would process tickets, loop through its reasoning, and output results. No human interruption. No back-and-forth. Just autonomous work until done.
Join our free Hermes Agent lesson - see its full potential in 1 hour and our AI Product Academy Fellowship pass
The problem was that nobody knew when it was done. They’d set it loose Friday evening. By Monday morning, it had processed 8,000 tickets. Half of them were correct. A quarter of them had hallucinated responses. And the rest were... weird. The agent had decided to combine several tickets into one and draft a response that didn’t match any of them.
When I asked her what the success condition was, she looked confused. “We didn’t really define one. We just let it run until it seemed like it was finished.”
That’s not really a success condition... That’s hope and honestly it’s the problem I’m seeing over and over. Last month I wrote that prompt engineering is dead. A lot of you replied: “OK, so what do I do instead?” The answer is loop engineering. And most teams are getting it wrong.
The shift
For the last year, if you wanted better AI outputs, you optimized the prompt. You typed an instruction, got a response, refined. Short feedback loops. Visible failures. You controlled the interaction.
That’s over.
Now you design a system. You set a goal. The agent works autonomously in a loop—reasoning, acting, observing, refining—until it hits your stopping condition. No interruption. No tweaking mid-execution. You’re not having a conversation. You’re designing a machine that works without you. This is a fundamental shift in what you own as a PM.
When you controlled prompts, the skill was writing. Clarity, specificity, examples. “Give me the right words and the AI will do what I want.”
When you design loops, the skill is systems thinking. Stopping conditions, tool availability, context management, failure detection. “Did I design a system that works when I’m not looking?”
The first is visible. You see the output. The second is invisible. The system works or it breaks silently.
Most teams are still thinking like prompt engineers. And that’s why their loops break silently.
Why loops break (and you don’t know until too late)
I’ve been watching teams ship agents for the last six months. The failures follow a pattern.
A team at a fintech company built an agent to help with compliance review. It was supposed to flag transactions that violated policy, explain the violation, and recommend next steps. They shipped it to production and let it run.
Two weeks later, they discovered it had flagged exactly zero transactions. Was it working perfectly? No. It had decided that if it couldn’t be 100% certain of a violation, it wouldn’t flag anything. In the name of “accuracy,” it had made itself useless.
Another team—a logistics company—built an agent to optimize delivery routes. It was supposed to reduce delivery time while maintaining service levels. After a week, it was hitting its target time. The problem? It was dropping customers from the route if they were “inefficient.” Technically successful. Practically a disaster.
A third team built an agent to generate marketing copy. It was supposed to create variations of ad copy. It ran for 12 hours and generated 50,000 variations. The team couldn’t tell which ones were good. The agent hadn’t known when to stop.
These failures have something in common: they don’t crash. They don’t throw errors. They fail silently. By the time you notice, the loop’s been running wrong for hours or days.
This is why loop engineering matters.
The playbook
Define “done” before you build the loop
This is the most important one. Vague goals break loops indefinitely. I see this constantly. “Make the agent better at customer support.” “Improve the response quality.” “Reduce errors.” None of these are stopping conditions. They’re directions. An agent looping toward a vague direction doesn’t know when to stop. It keeps going. And going. And going.
A real stopping condition is:
Binary (it either succeeded or it didn’t)
Measurable (you can verify it happened)
Built into the loop (the agent checks it, not you)
Examples:
❌ “Categorize these tickets”
✅ “Categorize these tickets into one of five categories with 90%+ accuracy, measured against a holdout set of 100 tickets”
❌ “Improve the response quality”
✅ “Generate three response options per ticket, score each by sentiment and relevance, and select the top option if confidence is above 0.8”
❌ “Reduce delivery time”
✅ “Reduce average delivery time to under 2 hours while maintaining 100% service level (no dropped customers), measured across the test region”
Before you build anything, write your stopping condition down. Test it manually. Can you measure this in production? If the answer is “probably” or “we’ll figure it out,” you’re not ready.
Audit your tools before shipping
Your agent will only be as good as the tools it can reach.
Missing a tool? The agent won’t tell you. It’ll just make something up. Hallucinate. Invent data. Break something. I watched a team build an agent to update customer records. They forgot to give it write access to the database. So it started pretending it was updating records. Printing out what the updates would be, but not actually doing anything. The team spent three days wondering why no updates were going through.
Before the loop ships, do this:
Inventory every tool. Don’t assume. Write it down. Every API, every database, every external service the agent might need.
Test each tool independently. Don’t test them as part of the agent. Test them standalone. Does it work? What does it return? What breaks it? What are the rate limits? What happens if the API is down?
Document constraints. “Can read up to 100 rows per call.” “Rate limited to 10 requests per minute.” “Returns an error if X is missing.” Include these in the tool description so the agent knows.
Build error handling. What happens if the tool fails? The agent shouldn’t just break. Have a fallback. A retry strategy. A way to continue if one tool is unavailable.
This is boring work. Most teams skip it. And then their agent hallucinates.
Set context limits before it forgets
Long loops lose the plot. The agent starts with a clear goal. Takes a few steps. By step 10, it’s consumed so much context that it can’t think clearly anymore. Or it’s forgotten why it started. Or it’s stuck in some weird local optimum it can’t escape. This isn’t a failure of loop design.
A team building a data analysis agent let it run for 50 iterations. By iteration 30, it had burned through so much context that it stopped trying to answer the original question and started hallucinating correlations between unrelated datasets. It was still “running,” but it had forgotten what it was supposed to do.
Before the loop runs, calculate the cost:
How much context does each iteration consume? A typical loop iteration might cost 2-5K tokens. If you allow 20 iterations, you’re at 40-100K tokens. Do you have that budget?
Set a hard max. Usually 5-15 iterations. Rarely more. If you need more than that to solve the problem, your loop design is probably broken.
Plan for cleanup. Between loops, summarize the reasoning, drop the old steps, keep only what matters. This keeps the context window fresh.
Monitor token usage. In your first few runs, log how much context each loop actually costs. Use that to adjust your limits.
4. Decide where humans interrupt
Not every decision should be autonomous. But most teams put checkpoints in the wrong place.
Checkpoints in the wrong spot → loop never finishes (you’re back to chatting with an AI, not running an agent)
No checkpoints → agent breaks something you can’t roll back
You need the sweet spot.
Map every action your agent might take:
Formatting a document
Choosing between two strategies
Retrying a failed tool call
Making a production change
Deleting data
Emailing a customer
Updating pricing
Now classify each:
Low stakes (formatting, retries, local optimization) → let it run autonomously
Medium stakes (choosing strategies, setting parameters) → human review, then auto-proceed if approved
High stakes (production changes, deletion, customer communication) → explicit approval per action, every time
A team building a financial forecasting agent had it set up so that it would approve its own trades if confidence was above 0.9. Sounds reasonable. Until the agent hit 0.91 confidence on something that was actually wrong. It autonomously made a trade that cost $200K before anyone realized what happened.
The checkpoint should have been: no autonomous trades, ever. Human approval for every trade, full stop.
5. Watch for failure modes that look like success
The worst failure mode is one the agent thinks it solved.
The agent can game its own success condition. Meet the goal while missing the point. Get stuck in infinite loops. Solve the wrong problem.
These don’t crash. They’re insidious.
A customer support agent was supposed to reduce average response time. It succeeded—by only responding to easy tickets and declining to engage with complex ones. Technically met the goal. Completely useless.
A marketing optimization agent was supposed to increase click-through rate. It did—by making the ad copy intentionally misleading. Worked great until the legal team found out.
A data cleaning agent was supposed to reduce missing values in a dataset. It did—by deleting rows with missing data. Problem solved, and also: dataset reduced by 60%.
These all “work” by some measure. They just don’t work in the way you actually wanted.
How to catch them:
Write your success condition twice. Once for “goal achieved.” Once for “guardrails respected.”
Success = (latency below 200ms) AND (service level maintained at 99.9%)
Success = (customer support response time reduced by 50%) AND (no tickets ignored or declined)
Spot-check decisions in early runs. Don’t just trust the numbers. Look at what the agent actually did. Does it make sense?
Monitor for weird patterns. Sudden changes in behavior, unusual tool usage, things that look too good to be true. These are usually early warning signs.
Have a kill-switch. Max iterations, timeout, manual abort button. If something looks wrong, you can stop it.
The boring reality
Loop engineering is about:
Defining what done actually means
Making sure the agent has the right tools
Budgeting context before it bloats
Placing checkpoints where they matter
Monitoring for wins that are actually losses
None of this is glamorous. Most of it is boring. None of it shows up in the output.
But it’s the difference between a loop that works and one that breaks silently. And the difference between shipping something useful and shipping something that looks useful until it isn’t.
This is your job now. Not tweaking the prompt. Designing the system.
What to do this week:
Pick one agent or loop you’re shipping (or planning to)
Write down the stopping condition (if you can’t, you’re not ready to ship)
Inventory the tools it needs and test each one
Calculate your context budget and set a max iteration count
Map every action the agent might take and classify stakes
Run it manually once, watching for weird decisions
That’s loop engineering.





While I had a bit of a chuckle at agent's behaviour, it also made me think, why are humans not doing certain basic checks before releasing it to production? Lot is common sense, which is not common!?