Uncategorized

Building an AI Agent That Actually Works in Production

Most AI agents fail in production. Here's how to build one that scales, handles real edge cases, and doesn't cost you a fortune.

Most AI agents built by startups never make it past a polished demo. They work beautifully on the founder’s laptop, then hit production and collapse under the weight of real data, edge cases, and latency constraints. The gap between “works in notebooks” and “works at scale” is where most projects die.

We’ve built dozens of AI agents at Amora-from document processors for accountants to lead-scoring systems for sales teams-and the difference between the ones that survive and the ones that don’t isn’t magic. It’s architecture, pragmatism, and a clear-eyed view of what actually needs to be intelligent.

Stop Building Agents That Are Too Clever

The first mistake is trying to make your agent solve everything. You watch a demo of GPT-4 chaining thoughts across ten steps, and you think your agent needs to do the same. It doesn’t.

A production agent should do one thing well: take an input, make a decision or perform a task, and return a result. If your agent is supposed to process invoices, it processes invoices. It doesn’t also email the supplier, book the payment, and analyse spending trends-unless those are genuinely part of the same workflow, and even then, they’re often better as separate systems.

The reason is practical: every step you add multiplies your failure modes. If your agent has five steps and each has a 95% success rate, you’re at 77% overall success. At ten steps, you’re at 59%. In production, that translates to tickets, rollbacks, and users switching to a competitor.

Build narrow agents. Chain them together if you need to. A narrow agent is faster, cheaper to run, easier to test, and easier to debug when something goes wrong.

Separate the AI from the Logic

Here’s the architecture pattern that works: your agent isn’t one monolith. It’s a thin AI layer sitting on top of deterministic functions.

Your LLM (usually GPT-4 or Claude) handles understanding and decision-making. Everything else-validation, database queries, API calls, state management-happens in ordinary code. When the agent needs to fetch data, it calls a function. When it needs to update state, it calls another function. The functions are boring, testable, and they don’t hallucinate.

This matters because:

  • You can test the logic without hitting the LLM API (faster, cheaper)
  • You know exactly what data the agent has access to
  • If something breaks, you know whether it’s a logic error or a model problem
  • You can swap models without rewriting everything
  • You can add monitoring and rate limits at the function layer

A fintech platform we worked with started by asking GPT-4 to make payment decisions directly. Six months in, they separated the decision logic into a rules engine. The LLM now returns intent, the rules engine makes the actual decision, and suddenly they had auditability, compliance trails, and 40% lower latency.

Use Structured Output and Validation

Never rely on an LLM to return unstructured text that you then parse. That way lies madness. Use constrained output.

Modern models support structured outputs-OpenAI’s JSON mode, Anthropic’s native JSON, or tools like Instructor that wrap any API. Your agent returns a schema: status code, decision, confidence score, reasoning. You validate against that schema before you use the output.

This does three things:

  1. It forces the model to be specific instead of verbose
  2. It lets you reject invalid outputs instead of trying to interpret gibberish
  3. It makes logging and debugging straightforward

If the model returns an invalid schema, log it, increment a counter, and fall back to a default behaviour. Don’t try to fix it at runtime.

Cost Control and Latency Are Real Constraints

An agent that costs AUD $2 per request sounds cheap until you’re processing 10,000 requests a day. That’s $20,000 a month. Add a 3-second latency on top and you’ve got a system users won’t use.

You need three levers:

  1. Caching: If you’re processing similar inputs, cache the results. A document classifier might see the same document type a hundred times a day. Cache the classification.
  2. Cheaper models for simple tasks: GPT-4 is smart but slow and expensive. GPT-4o mini or Claude Haiku handle classification, extraction, and routing in milliseconds for a fraction of the cost. Reserve GPT-4 for genuinely complex reasoning.
  3. Async processing: Don’t make users wait for the agent to finish. Process asynchronously, queue the work, and notify them when it’s done. A document analysis might take 10 seconds-that’s fine in the background, terrible in a request/response cycle.

A ecommerce client we built a product recommendation agent for was going to hit their usage limits in month two. We switched from pure GPT-4 to a hybrid: GPT-4o mini for initial filtering (removing 80% of products instantly, costing 2¢), then GPT-4 for final ranking on the shortlist (costing 15¢). Cost per request dropped from $1.20 to $0.25. Latency went from 8 seconds to 2.

Plan for Failure and Monitoring

Your agent will fail. Models hallucinate. APIs go down. Inputs are malformed. You need to know about it before your customer does.

Set up monitoring from day one, not day 100:

  • Log every input, output, and intermediate step (structured logs, not text blobs)
  • Monitor token usage and cost per request to catch runaway costs early
  • Track latency percentiles-p50, p95, p99-not just averages
  • Set alerts for schema validation failures, API errors, and unusual input patterns
  • Sample outputs and review them weekly. You’ll catch edge cases you didn’t anticipate

When something breaks, you want a paper trail. We’ve debugged dozens of agent issues, and 90% of them come from unclear logs. The other 10% come from not having logs at all.

Ship in Phases, Not Bang-for-Buck

You don’t need a perfect agent on day one. You need one that works for 80% of cases and fails clearly on the other 20%. Version 1 should handle common inputs reliably. Version 2 adds edge case handling. Version 3 adds performance optimisation.

Start with a rules-based fallback. If the agent can’t decide with confidence, the system returns a default answer or routes to a human. This isn’t failure-it’s pragmatism. A loan approval agent that says “I’m 73% confident, escalate to a human” is infinitely better than one that approves a risky loan at 95% confidence because the model was overconfident.

When you’re ready to build, or if you want to talk through the architecture for your specific use case, talk to Amora about your build. We’ve shipped agents in 28 days that handle real production load, and we know exactly where the pitfalls are.

The founders and operators who win with AI aren’t the ones building the cleverest systems. They’re the ones building the systems that actually work when users show up.

Got something you want built?

Amora Digital is an Australian software and AI agency. We scope it, build it, and ship it – live in 28 days. No offshore teams. No surprises.

Book a discovery call

Ready to stop guessing and start growing?

Book a 30-minute strategy call. No pitch, no pressure — just a clear read on what's working, what isn't, and where the lift is.

Book your strategy call