Why We Think Passing AI Token Costs to Customers is the Wrong Call
John Honovich
The pattern playing out across the industry
Over the last few months, companies that rolled out AI tools internally or built AI-powered products are pulling back. The stories are getting specific: one company burned through its entire annual AI coding budget in four months. Another reportedly spent $500,000 on a single AI tool in one month because nobody set usage limits. Individuals are costing themselves thousands of dollars in a day. Teams are waking up to bills that are multiples of what they budgeted, and a growing response is to ration access, impose hard caps, or cut back entirely.
That reaction is understandable. But the most common solution, passing variable token costs to customers through usage-based pricing, makes the problem worse, not better.
Why pass-through pricing damages the customer relationship
When you charge customers per token or per API call, you've offloaded the cost problem without solving it. Your customer didn't sign up to manage inference costs. They signed up to get a job done. When the bill comes in higher than expected, they don't think about how many tokens they consumed. They think about what they agreed to pay versus what they're being charged. It's the same dynamic as surprise cell phone overage charges (used to be decades ago): the customer used the service the way services get used, and now they're looking at a number they didn't budget for. The anger lands on the provider most times. You've turned a cost management challenge into a trust problem.
Why we went unlimited
We decided to go a different direction. Axamy charges a flat subscription fee with no usage caps. Token costs are our problem to manage, not our customers'. That decision is partly about customer experience, but it's also self-interested in a straightforward way. We have higher per-user prices than many AI tools, and we sell to groups rather than individuals, which gives us the margin to absorb these costs in a way a lower-priced individual-user product might not.
That's a real structural difference, and I won't pretend that unlimited works for every business. There are almost certainly cases where usage varies so wildly across customers that some form of limit makes sense, and we may find that ourselves at some point. But making token costs your customer's default problem is the wrong starting position.
Unlimited pricing is a forcing function
The deeper reason we chose this is that when the token bill lands on us, we have a direct financial incentive to actually fix the cost problem rather than pass it through. In the last two weeks, our team shipped a significant amount of work on exactly this. A few concrete examples:
Prompt cache busting: We found that rotating presigned image URLs were invalidating our prompt cache on every turn, because a new URL string counts as a cache miss and forces a full context rewrite. We traced one runaway session to 3.7 million written tokens in 42 calls. Fixed.
Volatile context restructure: We moved dynamic state to the tail of the prompt with explicit cache breakpoints, so mid-turn state changes cost a few hundred re-read tokens instead of invalidating the entire cache.
Lazy loading: We stopped injecting action descriptions into every turn by default. Uncached tokens on that piece dropped from roughly 2,000 to 500 per turn, a 75% reduction.
On-demand context expansion: We replaced a 31,000-token full plan dump injected on every turn with a 1,000-token roster digest. The agent pulls the detail only when it actually needs it.
Bypassing the LLM for simple operations: Status changes, assignee updates, and due date edits now happen directly without an LLM call, cutting both the token cost and a 20-second roundtrip.
Custom GitHub integration: Instead of using a generic third-party connector, we built our own GitHub app. It's smarter about what context it pulls and when, and it cut the token cost of GitHub-related operations by more than 75% compared to the off-the-shelf approach.
None of this is quick work. But it compounds: every optimization reduces costs on every session going forward, permanently.
The point
The companies rationing AI access right now aren't wrong that the costs are real. But rationing is a stopgap, and pass-through is an abdication. What actually solves it is building architecturally efficient systems. That work is hard, but the incentive structure matters enormously. Because we absorb the cost rather than pass it through, we have a direct financial reason to keep making the system cheaper to run. That alignment, where our interests and our customers' interests point in the same direction, is what unlimited pricing actually buys you.
