Uncategorized

Controlling LLM Costs Before They Eat Your Runway

LLM API calls spiral fast. Here's how to architect for cost, choose the right model, and keep your AI product profitable.

If you’re building a product that talks to an LLM-whether that’s a chatbot, content generator, or AI agent-you’re probably already wondering when the bill arrives. The honest answer: sooner than you think, unless you design for cost from day one.

We’ve seen startups burn through 50K+ AUD in a single month on API calls that could’ve cost a quarter of that with smarter routing. The difference isn’t luck. It’s architecture.

Why LLM Costs Blow Up

Every API call to Claude, GPT-4, or Gemini costs money per token. A token is roughly 4 characters. A typical paragraph is 100-200 tokens. That sounds cheap until you multiply it: 10,000 users each sending 5 requests per day, each request costing 0.02 AUD in input tokens and 0.06 AUD in output tokens. That’s roughly 30,000 AUD monthly, and you haven’t scaled yet.

The real problem isn’t the cost per call. It’s scope creep in what you’re sending the model.

  • Full context injection: Sending your entire knowledge base with every request. Stop. Use retrieval instead.
  • No caching: Asking the same questions repeatedly without storing results. Build a cache layer.
  • Synchronous processing: Running inference in real-time for non-urgent tasks. Batch it.
  • Wrong model for the job: Using GPT-4 to classify simple text when GPT-4 Mini handles it fine at 1/10th the cost.
  • No rate limits: Letting users hammer your API without throttling. They will.

Most of these aren’t technical problems. They’re design decisions made before a single line of code runs.

Choose Your Model Stack Carefully

This is where most teams go wrong. They pick one model-usually the most capable one-and call it everywhere.

That’s fine if your budget is infinite. For everyone else, you need a tiered approach.

  1. Simple tasks (classification, extraction): Use Llama 3.1 (via Together AI or modal.com), GPT-4 Mini, or Claude Haiku. Cost: roughly 0.0001-0.0005 AUD per 1K input tokens.
  2. Moderate complexity (summarisation, moderate reasoning): GPT-4 Mini, Claude 3.5 Haiku, or Gemini 1.5 Flash. Cost: 0.0001-0.001 AUD per 1K input tokens.
  3. Hard problems (complex analysis, multi-step reasoning): Claude 3.5 Sonnet or GPT-4o. Use sparingly. Cost: 0.001-0.003 AUD per 1K input tokens.

The hierarchy matters. If 80% of your queries are “is this email spam?”, route those to the cheap models and save the expensive ones for edge cases.

Real numbers: a fintech product we worked with reduced costs by 60% just by moving straightforward KYC verification from GPT-4 to Claude Haiku. The output quality didn’t change. The users didn’t notice. The runway extended by several months.

Build a Cost-First Architecture

Before you write the first API call, design your system to minimise them.

Caching and storage: Don’t ask the LLM the same question twice. Implement a caching layer (Redis, in-memory, or even PostgreSQL with hashing). If a user asks “what are your shipping times?” and another user asks the same thing, one should hit cache, not the API.

Retrieval-augmented generation (RAG): If you’re building a product that answers questions about your data, don’t shove the entire dataset into the prompt. Use semantic search to pull only the relevant documents, then pass those to the LLM. This cuts token count dramatically.

Batch processing: If you need to process 100,000 customer support tickets tomorrow, don’t call the API 100,000 times in real-time. Use batch APIs (OpenAI and Anthropic both offer them). They cost 50% less and don’t compete for rate limits.

Local inference: For very high-volume, latency-insensitive tasks, consider running a smaller open-source model locally (Llama, Mistral) via Ollama or vLLM. Zero API costs. The quality tradeoff might be acceptable for your use case.

Output control: Tell the model to be concise. Instead of “write a summary,” say “write a 2-sentence summary.” This cuts token output by 60-70% and users often prefer it anyway.

Monitor and Cap Your Spending

Even with good architecture, you need visibility. Most founders don’t know their actual LLM costs until the bill arrives.

Set up monitoring:

  • Track tokens per user per day. Set alerts if it spikes 50% above baseline.
  • Log which models handle which requests. Understand your distribution.
  • Calculate cost per user cohort. If your enterprise customers cost 10x to serve, you have a problem.
  • Use API provider dashboards (OpenAI, Anthropic, Together) to catch runaway requests early.
  • Set hard spending caps on API keys. Most providers let you set maximum monthly usage.

This isn’t hypothetical. We’ve seen a single bug-a retry loop with exponential backoff misconfigured-burn through 8K AUD in a weekend. Monitoring would’ve caught it in hours.

Make Trade-offs Consciously

You can have fast, cheap, or good. Pick two. Then decide what you’re willing to sacrifice based on your actual constraints.

A customer support chatbot that takes 10 seconds to respond is worse than one that responds in 2 seconds, even if it’s slightly less intelligent. Use a faster model.

A content generator that costs 0.50 AUD per article but produces something saleable is better than one that costs 2.00 AUD and is marginally better. Users are okay with “good enough” if it’s fast and cheap.

A data analysis tool that’s 95% accurate and costs 1K AUD monthly beats 99% accurate at 50K AUD monthly if your users accept the tradeoff.

Explicit trade-offs are better than default choices. If you’re building an AI product and haven’t thought about cost vs. quality, talk to Amora about your build. We architect these decisions early so they don’t become problems later.

The Real Lesson

LLM costs aren’t a technical problem you solve after launch. They’re a product design problem you solve before. Choosing the right model, caching aggressively, batching where possible, and monitoring relentlessly will keep your burn rate sane while you find product-market fit.

Your runway is precious. Don’t let API calls waste it.

Got something you want built?

Amora Digital is an Australian software and AI agency. We scope it, build it, and ship it – live in 28 days. No offshore teams. No surprises.

Book a discovery call

Ready to stop guessing and start growing?

Book a 30-minute strategy call. No pitch, no pressure — just a clear read on what's working, what isn't, and where the lift is.

Book your strategy call