Cost & Scale Modeling

The demo costs nothing. The thousandth user is where the math gets real.

We model what the idea actually costs and how it performs at the scale you are aiming for, so you commit with a number you trust instead of a per-user guess and a brave face.

The projection

Cheap today. The question is the shape of that line at scale, and whether it stays under the budget or blows straight through it.

The problem

Per-user costs that round to zero in the demo do not round to zero at scale.

AI made it easy to build something impressive cheaply. A few hundred users, a modest bill, and it all looks sustainable. The trap is assuming the line stays flat. It rarely does. The cost per request that was a rounding error becomes the whole business model when you multiply it by real volume, and you find out at the worst time: right after you have grown.

It is not just the obvious bill. It is the model call that costs ten times more than you assumed under real usage, because every retry, every long context window, and every agent that loops through its tools multiplies the per-request cost long before that increase shows up on a monthly invoice. It is the latency that was fine for one user and unacceptable for a thousand hitting it at once. The infrastructure that scaled linearly until it hit a wall and needed a redesign. The free-tier service that stops being free exactly when you start to succeed. Each one turns "we are growing" into "we are growing and bleeding money," which is a worse problem than not growing.

Picture the version of you this happens to. You ship an AI feature, a few hundred people try it, the bill is a coffee a day, and everyone agrees it is sustainable. Six months later it works. Usage is ten times higher, the launch went well, and you open the invoice expecting ten times the coffee. Instead the number is forty times higher, because the heavy users you attracted hammer the expensive paths, retries piled up under load, and the logging you turned on to debug the launch is now its own line item. The product succeeded. The economics did not, and you found out the month after it was too late to choose differently. That is the moment we exist to move earlier.

We model it before you commit. The real cost per user, how it moves as volume climbs, where the latency breaks, and what the bill looks like at the scale you are actually aiming for. You get a number you can plan around, the levers that change it, and an early warning on the cliffs, so growth is something you priced in, not something that surprises you.

The non-obvious part

You cannot load-test a system that does not exist yet. So you model it.

Load testing answers a different question than modeling. A load test takes a thing that already exists and asks how it holds up. Useful, and we do it when there is something to point traffic at. But the expensive decisions happen before the code exists, when there is nothing to test. At that point the only tool that tells you the truth is a model: a deliberate sketch of how cost and load behave at a scale you have not reached, built from measured unit costs and named assumptions instead of a running system. Modeling is how you see the cliff while it is still cheap to steer around it.

This matters because cost curves are almost never straight lines, and the bends are where businesses die. A service is free until a tier ends, flat until a rate limit forces an upgrade, smooth until a single database is doing all the writes and replication starts falling behind. Data egress is the quiet one: it is charged per gigabyte, it scales with every user you add, and it sits off to the side where nobody looks. A back-of-the-envelope example makes the point. One hundred million requests a month at fifty kilobytes of response each is about five terabytes of egress, which at the rough going rate of nine cents a gigabyte is roughly four hundred and fifty dollars a month, climbing in lockstep with traffic. None of that shows up when a hundred people are testing the demo. All of it shows up later, and a spreadsheet would have shown it first.

The harshest version of the cliff is architectural. Some designs scale by adding more of the same: another server, another replica, another cache. Others scale fine until a fundamental limit hits, and then the only fix is a rewrite. Common wisdom in system design is to plan capacity for several times your expected growth, because what is comfortable at ten thousand users can break entirely at a hundred thousand, and by then the structure of the system is the problem, not the size of the bill. The difference between the two kinds of design is invisible in a demo and obvious in a model. You want to know which one you are about to build before you build it, not six months and a painful migration after.

That is the whole trade we are offering. A day with a sketch and honest numbers, against a quarter spent rebuilding something that worked beautifully right up until it succeeded. The cheapest place to discover a cliff is on paper, where moving the line costs an afternoon instead of an architecture.

How we run it

Model the bill before it arrives.

Cost and scale modeling is not a spreadsheet of optimism. It is a hard look at how the numbers actually behave as you grow, and these are the principles we run it on.

Measure the real unit cost, not the demo one

We find what a single user, request, or transaction genuinely costs under realistic conditions: the model calls, the infrastructure, the third-party services, the parts that hide in the demo because the volume is tiny. We count what the demo conveniently leaves out, the retries that fire under load, the long context windows that real users send, the agent loop that calls a tool three times instead of once, because those are exactly the things that make a per-request cost arrive ten times bigger than the back of your napkin said. The unit cost is the number everything else multiplies, so we get it honest before we project anything from it. Get this one wrong by a factor of three and every downstream projection is wrong by a factor of three too.

Project the curve, not a single point

Cost rarely scales in a straight line. We model how it actually moves as volume climbs, where it is linear, where it jumps, and where a tier or a limit changes the slope, so you see the shape of the curve at 10x and 100x, not just one comfortable number at today's tiny scale. We plot it at several points along the way, not two, because the interesting behavior lives between them: the spot where a cheap tier runs out, the volume where a flat line suddenly bends. A single number at today's scale tells you almost nothing about the number that ends your runway.

Find the cliffs before you drive off them

The dangerous costs are the discontinuous ones: the free tier that ends, the rate limit that forces an upgrade, the architecture that scales smoothly until it suddenly does not. The worst cliff is the one where a single database is handling every write and replication starts lagging behind real traffic, because the fix is not a bigger invoice, it is a redesign. We map those cliffs in advance, name the volume at which each one hits, and put it on the calendar, so the moment you cross one is a planned decision, not a surprise invoice and an emergency rewrite.

Model latency as well as money

Scale is not only a cost question; it is a speed one. Work that was instant for one user can crawl when a thousand hit it at once. We model how latency behaves under real concurrency, where requests start queueing, where a shared resource becomes the bottleneck, and how the slowest responses move as load climbs, because users feel the worst few percent of requests, not the average. We model how latency behaves under real concurrency, so the experience does not quietly fall apart at exactly the moment more people start to use it.

Account for the costs nobody puts on the napkin

The bill people model is the compute and the model calls. The bill that surprises them is everything around it. Data egress charged per gigabyte that grows with every user. Storage that only ever goes up, because almost nothing gets deleted. Logging and observability volume that quietly becomes its own line item, the tax you pay to understand the system you built. We pull these out into the open and put a number on each, because the cost almost always lives somewhere other than where you guessed, and the line items off to the side are the ones that compound silently until they are the whole problem.

Show the levers, not just the number

A scary projection is only useful if you can do something about it. We lay out the levers that actually move it, caching the repeated work, routing the easy cases to a cheaper model and saving the expensive one for when it earns its keep, batching requests that do not need to be instant, negotiating or changing a tier, so you are not just told it will cost too much, you are shown the handful of changes that bring it back under the line. And we rank them, because two of those levers usually do most of the work and the rest are noise, and knowing which two saves you from optimizing the things that do not matter.

"Nobody is killed by the cost in the demo. They are killed by the cost at scale, discovered the month after they finally succeed. We do that math while it is still a decision."
Inferzo · Bending binaries to behave

What you get

A number you can plan a business around.

Not a vague "it should be fine." A modeled, defensible view of what it costs and how it performs at the scale you are aiming for, with the levers to change it.

The real unit cost: what one user, request, or transaction actually costs under realistic conditions

A cost curve projected to the scale you are targeting, not just today's volume

The cliffs flagged: the tiers, limits, and thresholds where the cost or the architecture jumps

A latency model showing how the experience holds up under real concurrency

The levers that move the numbers, ranked by how much they actually help

A clear verdict: sustainable as-is, sustainable with these changes, or not at this price

A written model you can take to a board, an investor, or a pricing decision

Worried your AI feature is cheap now and a problem at scale? Tell us the per-request cost and where you are headed, and we will model what the bill becomes.

Invoke us

Is this the right call

When this fits.

Good fit

You have an AI or usage-based product and need to know what it costs at real scale before you commit
Your costs look fine today and you are about to grow fast enough for that to change
You are setting a price and need your real cost per user, so you do not lose money on every one
An investor or board wants the unit economics to hold up at scale, not just at launch

Wrong call

Your costs are flat, well understood, and not going to change with scale. There is nothing to model.
You are pre-idea and have nothing to cost yet. Start with feasibility or a proof of concept first.
You want someone to tell you it will be cheap. We model the real number, and sometimes the real number is a no.

How it runs

Quick, concrete, and honest about the cliffs.

Cost and scale modeling is a short, focused exercise, not a standing cost-accounting function. It runs in days, sized to the decision in front of you, because its job is to give you a number you can act on now, not to monitor spend forever.

The output is concrete and defensible. Real measured unit costs, an explicit projection, named assumptions you can challenge, and the cliffs marked, so the model holds up when a CFO or an investor pushes on it instead of falling apart under the first hard question.

It connects to the build. If the numbers work, you proceed knowing your economics. If they only work with changes, those changes become part of the architecture from day one, designed in instead of retrofitted in a panic. If they do not work at any price worth paying, you found out before you built the thing, which is the cheapest possible time to find out.

What we settle before we begin: the scale you are actually targeting, what counts as affordable at that scale, and the cost drivers, model calls, infrastructure, third-party services, that the model has to account for. Everything else follows from those three.

Ready to start

Tell us what it costs today and where you are headed.

Describe the product, what a request or a user costs now, and the scale you are aiming for. We will model what the bill and the latency actually become at that size, the cliffs along the way, and the levers that keep it under the line.

Invoke us