You’ve decided to add AI to your product. Good instinct. The problem: LLM costs can spiral fast. A single poorly optimised feature can turn a $500/month API bill into $50,000. This post is about the technical and operational moves that keep that number reasonable.
Why LLM costs surprise founders
Most founders underestimate API expenses because they think in terms of features, not tokens. An LLM doesn’t charge per feature call-it charges per token (roughly every 4 characters). A single user interaction might fire off 3-5 API calls, each consuming 500-2,000 tokens, and suddenly that “free tier” feature is costing you AUD 0.03-0.10 per use.
Scale that to 10,000 monthly active users making just 5 requests each, and you’re looking at AUD 1,500-3,000 per month. Add more features, more calls, or higher-volume use cases, and you’re north of AUD 10,000 fast.
The surprise happens because costs aren’t visible in your codebase-they’re embedded in API calls. You don’t see a cost function; you just ship a feature and watch your bill climb.
Start with cheaper models where possible
Not every task needs GPT-4. Amora works with founders regularly on AI products, and the pattern is always the same: pick the smallest model that solves your problem.
Here’s the rough cost landscape (AUD per million input tokens, approximate):
- Claude 3.5 Haiku – AUD 0.80-1.20. Fast, cheap, solid for classification and simple summaries.
- GPT-4o Mini – AUD 0.15-0.40. Good all-rounder for moderate reasoning.
- GPT-4 – AUD 3-5. Overkill for 80% of use cases. Reserve for complex reasoning.
- Open-source models (via Together, Fireworks, etc.) – AUD 0.10-0.80. Self-hosted or API; lower cost, less hand-holding.
A feature that uses GPT-4 when Haiku works fine is burning 5-10x more than necessary. Before you pick a model, actually test cheaper ones. You’ll often find the difference in quality is negligible for your specific task.
Cache aggressively, process asynchronously
Two architectural moves kill most wasteful spending:
- Caching. If the same input or query appears twice-even within the same day-you’re paying twice. Cache the LLM response. Redis, simple file storage, or a database table. If a user edits a document and re-runs the same AI feature, don’t call the API again; return the cached result.
- Async processing. Don’t call the LLM synchronously in a user request. Queue it (use Bull, RabbitMQ, or a simple database job table), process it in the background, then notify the user when done. This reduces API timeouts, lets you batch calls, and gives you time to cancel redundant requests.
A fintech we worked with was running LLM summarisation on every document upload, waiting for the API to respond before returning the page. Response time: 4-8 seconds. Cost: AUD 8,000/month. We moved it to async, added caching, and dropped it to AUD 1,200/month. Same feature, faster UX, lower bill.
Batch and filter ruthlessly
Most teams send every user request to the LLM. Instead:
- Filter before you call. Is the input valid? Is it duplicate? Is it actually asking for an LLM-powered feature, or can you handle it with a simple rule or database query? A regex check or keyword search can eliminate 20-40% of unnecessary API calls.
- Batch requests. If you’re processing user uploads or periodic data, queue them and send 50 at once via batch APIs (OpenAI has batch processing; it’s 50% cheaper and slower, but fine for non-critical work).
- Set token limits. Don’t let the LLM ramble. Set `max_tokens` low. A summary should be 100 tokens, not 500. A classification should be 1-2 tokens. This is the easiest cost reduction you’ll miss if you don’t look.
Monitor and set hard limits
You can’t control what you don’t measure. Set up cost tracking immediately:
- Log every API call with model, tokens, and cost.
- Build a simple dashboard showing daily spend, cost per feature, cost per user.
- Set alerts: “Spend goes above AUD 100/day, email the team.”
- Implement rate limits: “Each user gets 10 API calls/day, no exceptions.”
The team that watches costs weekly will catch a runaway feature in days. The team that doesn’t look until month-end will find they’ve burnt AUD 20,000 on a feature that should cost AUD 2,000.
Hard limits also protect you operationally. If a bug or a malicious actor hammers your API, rate limiting caps your downside.
Consider building vs. buying
If you’re using LLMs for a core feature-not a nice-to-have-you might want to run your own model. This changes the cost structure entirely.
Open-source models (Llama, Mistral, etc.) can run on a single GPU instance (AUD 0.50-1.50/hour on AWS or Lambda). If you’re doing 100,000 inferences per month, that’s often cheaper than API calls. The trade-off: you manage infrastructure, latency is your problem, quality is slightly lower.
It’s worth investigating for high-volume, low-complexity tasks (classification, extraction, basic summarisation). It’s rarely worth it for reasoning-heavy work or low-volume use cases.
Know when to stop optimising
There’s a point of diminishing returns. If your LLM bill is AUD 200/month and your ARR is AUD 500,000, optimising further is theatre. If your bill is AUD 15,000 and your ARR is AUD 100,000, it’s existential.
Spend the engineering time where it matters. Most founders should focus on:
- Using a cheaper model.
- Adding caching and async processing.
- Setting token limits and rate limits.
That’s 80% of the savings. Everything else is tweaking.
Closing
LLM costs are real and they grow quietly. The founders who stay ahead are the ones who treat API spend like product metrics-measure it, alert on it, and make it visible to the team. If you’re shipping an AI feature and cost control feels unclear, talk to Amora about your build. We ship MVPs in 28 days, and cost-efficient architecture is built in from day one, not retrofitted later.
Got something you want built?
Amora Digital is an Australian software and AI agency. We scope it, build it, and ship it – live in 28 days. No offshore teams. No surprises.