Data Foundations: What to Get Right Before AI

Most founders I talk to want to build AI features yesterday. They’ve seen ChatGPT, they’ve read the hype, and they’re worried about being left behind. But here’s the thing: almost every AI project that fails early does so because the data underneath was never right to begin with. You can have the best model in the world, but garbage data produces garbage output. And that’s not a problem you fix in production.

This post covers what you actually need to sort out before you commit to an AI product or feature-the unglamorous foundation work that determines whether your AI is useful or just expensive.

1. Know Your Data Quality Before You Touch a Model

Data quality isn’t a nice-to-have. It’s the difference between an AI feature that works and one that costs you money and reputation.

Start with a real audit. Don’t guess. Walk through your actual database or data warehouse and answer these questions:

How much of your data is missing, null, or blank? (Anything above 5% in critical fields is a warning sign.)
How consistent is it? If you store customer names, are they normalised? Or do you have “John Smith”, “john smith”, “J. Smith” as three different records?
How fresh is it? If you’re building a pricing model but your cost data is six months old, you’re training on stale information.
Who enters it, and what’s their incentive? Data entered by salespeople under quota pressure looks different to data entered by an automated system.

A fintech we worked with had been storing transaction amounts correctly, but the merchant category codes were hand-keyed with no validation. When they tried to build a fraud detection model, it was like training on sand. They spent three weeks cleaning that field before the model was even worth running.

The hard truth: data cleaning is about 60-70% of the work on any real AI project. Budget for it. Plan for it. Don’t pretend it won’t be there.

2. Define What Success Actually Looks Like

Before you build anything, you need to know what you’re measuring. This sounds obvious. It rarely is in practice.

“We want to use AI to improve customer retention” is not specific enough. What does improved retention mean? A 2% lift? A 10% lift? Over what timeframe? For which customer segment?

You need a baseline metric before you touch AI. Run a report right now on whatever you’re trying to improve. That number is your starting point. Without it, you’ll ship something, declare it a win because you built it, and never know if it actually worked.

For a classification problem (like predicting which leads will convert), you’re looking at metrics like precision and recall. For a ranking problem (like sorting which customers to contact first), you might measure lift in conversion rate or AUD value per contact. For a generation problem (like creating email copy), you might use human review scores or actual click-through rates.

Write this down. Bind it to a number. You’ll thank yourself when you’re six weeks into the project and someone asks whether you’re shipping or not.

3. Sort Out Your Data Pipeline and Storage

Where does your data live, and how does it get there?

If you’re running AI models-especially ones that need to work in real time or near-real time-you need to be able to move data reliably. That usually means:

A data warehouse or lake where you can store and query everything in one place. (Snowflake, Redshift, or BigQuery are the common choices. They cost anywhere from a few hundred to a few thousand AUD per month depending on volume.)
A pipeline that keeps it fresh. Stale data means your model sees the world as it was, not as it is.
A way to isolate training data from production data. If you train a model on today’s data and test it on today’s data, you’ll overfit and ship something that doesn’t work in the real world.

You don’t need a perfect data stack on day one. But you do need a deliberate one. If your data lives in five different spreadsheets and a Salesforce instance with no single source of truth, you’re not ready to train models yet. Fix that first.

A simple data warehouse-even a managed one-costs less than the engineering time you’ll waste trying to work around missing infrastructure.

4. Document Your Data, or Pay For It Later

Every field in your database should have a definition. Not because it’s fun, but because humans forget things.

What does “status” actually mean? Is it the customer’s current status or their status at the time the record was created? What’s a “high-value customer”? AUD 10,000 revenue, or AUD 100,000? Over what period?

Without documentation, your data scientists spend 20% of their time asking other people what things mean. Multiply that across a team, and you lose weeks.

Documentation also matters for compliance. If you’re handling personal data (and as an Australian business, you probably are), you need to be able to show an auditor exactly what you store, where it comes from, and how you’re using it. That’s easier with documentation from the start, not a panicked reverse-engineering job later.

A simple data dictionary-a spreadsheet or wiki that maps field names to what they mean, their data type, and whether they’re personally identifiable-takes a day to create and saves months of friction.

5. Plan For Governance Before You’re Forced To

The moment you start using AI to make decisions about people-which emails to send them, which price to show them, whether they’re credit-worthy-you need guardrails.

This doesn’t mean hiring lawyers immediately. It means thinking about:

Who has access to what data? (If your junior developer can download your entire customer database, you have a problem.)
How do you test for bias? If your model makes different decisions about different groups of people (whether because of protected characteristics or correlated proxies), you need to know.
How do you audit what the model decided? If a customer complains that they were rejected for credit or shown a higher price, can you explain why?
How often do you retrain? If your model was trained in January and it’s now September, it’s probably seeing a different world. You need a schedule.

None of this requires expensive compliance infrastructure. It does require intentionality. Build it in at the start instead of bolting it on when something goes wrong.

Start Here, Not With Models

The teams that ship working AI products fast aren’t the ones who jump straight to fine-tuning language models or building neural networks. They’re the ones who spent two or three weeks getting their data house in order first, wrote down what success looked like, and then built something that actually solved a problem.

If you’re at the point where you’re thinking about building AI into your product or business, but you’re not sure whether your data is ready, talk to Amora about your build. We’ll help you figure out what’s actually a blocker and what’s just noise.

The data work isn’t exciting. But it’s the difference between shipping an AI feature in 28 days that actually works, and spending six months on something that never quite does.

Got something you want built?

Amora Digital is an Australian software and AI agency. We scope it, build it, and ship it – live in 28 days. No offshore teams. No surprises.

Book a discovery call

Free download · No payment, no spam

The AI SaaS Buyer's Checklist

17 questions to ask before signing any quote

#australia · #data-foundations

1. Know Your Data Quality Before You Touch a Model

2. Define What Success Actually Looks Like

3. Sort Out Your Data Pipeline and Storage

4. Document Your Data, or Pay For It Later

5. Plan For Governance Before You’re Forced To

Start Here, Not With Models

Got something you want built?

Branding a Startup Before You Have Traction

Go-To-Market for Australian Software Startups: Building Real Revenue

Google Ads in 2026: What Works for Australian B2B

Ready to stop guessing and start growing?