Why every cloud AI has usage caps (the unit economics)

Outlier · solo-built in Grand Rapids · published 2026-06-09 Last updated 2026-06-09

Quick answer

Cloud inference costs the provider real money per token: GPU time, power, cooling, datacenter overhead.
A flat monthly fee plus unlimited usage would mean heavy users cost more than they pay. Caps close that gap.
That's why caps survive every pricing tier, including the $200/month ones.
Local AI has no marginal token cost, so there's nothing to cap. The economics, not generosity.

People talk about usage caps like they're a temporary annoyance the providers will eventually fix. They won't, and it's worth understanding why. The cap is not a bug or a capacity hiccup. It's arithmetic. Every token a cloud model generates burns GPU-seconds someone has to pay for, and a flat $20 subscription only buys so many of them.

What a token actually costs

When a cloud model answers you, a rack of GPUs spends real seconds computing your tokens. Those GPUs cost tens of thousands of dollars each, draw kilowatts, and sit in datacenters with cooling and networking overhead. The provider pays for all of it whether you notice or not.

Nobody outside the labs knows the exact per-token cost, and it keeps falling. But it is not zero, and at flagship-model quality it's not close to zero. Multiply a non-zero cost by millions of subscribers and the arithmetic gets serious fast.

Why the flat subscription forces a cap

A subscription is a bet: the provider bets the average user consumes less compute than the fee covers. Light users subsidize heavy users. That works until the heavy users get too heavy, and AI made them very heavy. One person running coding agents all day can burn hundreds of times the compute of someone asking three questions a week.

So the provider has two choices: price the plan for the heaviest users (and lose everyone else), or cap usage so the heaviest users can't sink the pool. Every provider picked the cap. Anthropic added weekly limits on top of Claude's rolling five-hour windows in 2025 specifically because, in their own words, a small number of users were running Claude continuously. OpenAI meters its top models by message count and downgrades you when you run out.

Why paying more doesn't remove the cap

The $100 and $200 tiers raise the limit because they cover more compute. They don't remove it, because the arithmetic never goes away. Whatever the fee, there's a usage level where you cost more than you pay, and the provider has to protect against it. That's why the fine print on every tier, including Max and Pro, keeps the throttle clause.

Receipts: Anthropic's help center documents the rolling reset windows; the 2025 weekly-cap change was announced by Anthropic and widely reported, with the company citing users running Claude "continuously in the background." OpenAI publishes per-plan message limits in its own documentation. None of this is hidden; it's just rarely added up.

The one architecture with nothing to cap

Run the model on your own machine and the marginal cost of a token lands on hardware you already own. There's no provider absorbing your GPU time, so there's no pool to protect and nothing to cap. This isn't a pricing decision a local tool gets to brag about. It's structural. Outlier doesn't cap usage for the same reason your text editor doesn't: your Mac does the work.

The honest other side: your Mac is slower than a rack of H100s. Local Core 27B runs about 20.7 tok/s where cloud flagships run 80 to 100. You trade speed for a meter that doesn't exist. For long sessions and agent work, that trade usually wins. For quick one-off questions, it may not matter either way.

Frequently asked questions

Will AI usage caps go away as compute gets cheaper?

They'll loosen, but the structure stays. As long as a flat fee covers a non-zero per-token cost, there's a usage level that loses the provider money, and they'll cap against it. Cheaper compute raises the cap; it doesn't delete it.

Isn't the API 'uncapped'?

The API has no hard usage wall, but it meters every token and bills you for it, plus rate limits per minute. It trades a cap for a variable bill. Heavy agent use through an API routinely costs hundreds of dollars a month.

How does Outlier afford to be uncapped?

Because your Mac supplies the compute. Outlier sells the software once (or $20/month); it doesn't pay for your GPU time, so heavy use costs it nothing. That's the structural difference between local and cloud, not a promotional choice.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (all 7 model tiers incl. Plus 397B). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac