Uncategorized

RAG Done Right: A Practical Architecture for Australian Startups

Retrieval-Augmented Generation works—if you build it properly. Here's what actually matters when you're shipping AI products on a startup budget.

Retrieval-Augmented Generation (RAG) is the most talked-about pattern in AI right now, and for good reason. It lets you bolt current, proprietary data onto a language model without retraining it. The problem is most implementations fail because founders treat it like a checkbox instead of a system that needs proper architecture.

You’ve probably heard the pitch: “Just throw your documents into a vector database and ask questions.” That works in demos. In production, with real user data and variable quality documents, it falls apart quickly. Slow responses. Hallucinations creeping back in. Token spend balloons. Users complain that the AI can’t actually answer their specific questions.

The good news: RAG is worth doing, and the fixes are concrete. This is what we’ve learned building RAG systems for Australian startups over the last two years.

Why RAG Fails (And What You’re Actually Building)

RAG looks simple on the surface. You embed documents, store vectors in a database, retrieve relevant chunks, and feed them to a language model. Three moving parts.

In reality, you’re building a retrieval system. And retrieval is hard. The language model is the easy bit.

Most failures come from one of these:

  • Poor document preparation: Dumping raw PDFs, Word docs, or web pages into a vector store without chunking or metadata. You end up retrieving irrelevant or incomplete context.
  • Wrong embedding model: Using a generic model trained on internet text when your domain is niche (biotech, legal, financial). The vectors don’t capture your domain semantics.
  • No reranking: You retrieve 10 chunks, but only 3 are actually useful. Without a reranking step, the language model wastes tokens on noise.
  • Missing evaluation: You don’t measure whether retrieved chunks actually answer the query. You ship, users complain, you shrug and tweak prompts.
  • Token efficiency ignored: Each API call to Claude or GPT-4 costs money. If your retrieval pulls in 50 pages when 5 would do, your unit economics collapse fast.

Start by accepting this: RAG is not a language model problem. It’s a search problem. Your architecture should reflect that.

The Minimal Production Architecture

Here’s what we recommend for Australian startups shipping their first RAG product, targeting a budget of roughly AUD $5,000-$15,000 in infrastructure and vendor costs for the first year:

  1. Document processing pipeline: Ingest raw documents (PDFs, Word, HTML, markdown). Extract text, identify sections, chunk intelligently (not just by token count-by semantic boundaries). Store raw chunks plus metadata (source, date, section, confidence score). Build this once, use it for every update cycle.
  2. Embedding layer: Use a domain-aware embedding model or fine-tune one. For most Australian startups, start with OpenAI’s text-embedding-3-large or Cohere’s embed-english-v3.0. They’re good enough and battle-tested. Only specialise if your domain is very narrow.
  3. Vector store: Pinecone, Weaviate, or Milvus. Pinecone is the easiest to ship with (serverless, no ops burden). Weaviate gives you more control if you want to run it yourself. For early stage, Pinecone wins on speed-to-market.
  4. Retrieval + reranking: Query the vector store, get top 20 candidates. Run them through a reranker (Cohere’s rerank-english-v3.0 is excellent). Take top 3-5 results. This cuts your token spend and improves answer quality visibly.
  5. LLM layer: Claude 3.5 Sonnet or GPT-4o. Use the reranked context, add a prompt that tells the model when to say “I don’t know,” and return both the answer and a confidence score.
  6. Logging and feedback: Log every query, the chunks retrieved, the model response, and whether the user found it useful. This is your signal for improvement. Without it, you’re flying blind.

That’s it. Five components. Each one should take a single engineer 1-2 weeks to implement properly, assuming your documents are ready.

Chunks, Metadata, and the Hidden Complexity

The thing that catches most teams is document preparation. You’ll spend more time here than anywhere else, and it’s not glamorous.

Raw documents are messy. PDFs have formatting noise. Web pages have navigation cruft. Word documents have tracked changes and comments. You need to:

  • Extract clean text (use libraries like pdfplumber for PDFs, BeautifulSoup for HTML).
  • Identify structure: is this a legal contract, a technical manual, a financial report? Different structures need different chunking.
  • Chunk by meaning, not just tokens. A 1,000-token limit that cuts a sentence in half is worse than a 2,000-token chunk that keeps a paragraph intact.
  • Add metadata: source document name, date added, section heading, author, confidence (if the extraction was clean vs. messy).

For a startup with 500-2,000 documents, you can often do initial preparation with a script and human review. For 10,000+, you need some automation. Most Australian startups fall in the middle and solve it with a Python pipeline plus a small amount of manual QA.

The payoff is huge. Good chunks + good metadata means your retrieval works reliably. You spend less time tuning prompts and more time shipping features.

Measuring What Actually Works

You can’t improve what you don’t measure. Most teams skip this and regret it.

Build three basic metrics into your system from day one:

  1. Retrieval precision: Of the top-5 chunks retrieved, how many contain information relevant to the query? Aim for 80%+. If you’re below 70%, your retrieval is broken.
  2. User satisfaction: Does the user find the answer helpful? Add a thumbs-up / thumbs-down button on every response. Aim for 75%+ positive feedback.
  3. Token efficiency: How many tokens does each query cost on average? Track this per query type. If it’s climbing, your reranking isn’t working or your chunks are too large.

These three numbers tell you whether to fix retrieval, the prompt, or your architecture. Everything else is detail.

Pitfalls to Avoid (And When to Ship Anyway)

You’ll be tempted to:

  • Wait for the perfect embedding model. Don’t. Start with a standard one. Swap it out in week 3 if needed. Your users care about answers, not the math inside the embedding layer.
  • Build a custom vector store. Don’t. Use Pinecone or Weaviate. You’ll save six months of ops headache.
  • Over-engineer the chunking strategy. Start simple: fixed-size chunks with overlap, plus metadata. If retrieval quality is low, then experiment with recursive chunking or semantic splitting.
  • Forget about failure modes. What happens when a query has no relevant documents? When the model can’t answer? When the database is down? Build explicit handling for each, don’t let it fail silently.

Most of these delays don’t matter before launch. Ship a working MVP with basic RAG, measure real user queries, then iterate. If you’re an Australian founder building an AI product, the cycle speed matters more than architectural perfection right now.

That said: if you’re considering a serious RAG build and want to avoid the mistakes we’ve seen, talk to Amora about your build. We’ve done this enough times to know which corners to cut and which ones matter.

The Bottom Line

RAG works. It’s not magic, and it’s not automatic. You need to treat retrieval as a first-class problem, measure ruthlessly, and iterate based on real user feedback.

For Australian startups: this is solvable with a small team, reasonable budget, and 8-12 weeks of focused work. The architecture is straightforward. The execution is disciplined. The results-an AI product that actually knows your data and answers questions reliably-are worth the effort.

Got something you want built?

Amora Digital is an Australian software and AI agency. We scope it, build it, and ship it – live in 28 days. No offshore teams. No surprises.

Book a discovery call

Ready to stop guessing and start growing?

Book a 30-minute strategy call. No pitch, no pressure — just a clear read on what's working, what isn't, and where the lift is.

Book your strategy call