Every LLM Tool Call Needs an Output Budget

John Honovich

Jun 11, 2026

TL;DR

Tools are one of the main reasons AI agents are useful, but tool outputs can quietly explode cost, latency, and context usage. A short user request can become a huge model request if tools return verbose API objects, metadata, comments, logs, or too many results. The fix is not to give up on tools. It is to profile real agent traces and treat tool output as a first-class optimization problem: return the minimum useful information by default, with drill-down paths when more is needed.

Disclaimer

I’m sure strong engineering teams are already doing versions of this. I’m writing because I could not find many practical posts focused specifically on tool output budgets. If you know good examples, I’d appreciate links.

Excitement And Risk Over Tools

One of the most exciting things about modern AI systems are tools. Being able to connect models to search systems, databases, CRMs, browser automation, internal APIs, email systems, calendars, and third-party integrations is genuinely powerful. When an agent combines multiple tools to answer a question or complete a task, it can feel almost magical. It starts to feel less like a chatbot and more like a person piecing together information from different systems.

That excitement is real. It is also dangerous.

Once a tool works, the natural instinct is to add more. More APIs. More integrations. More actions. More data sources. The system becomes more capable, and the demos become more impressive.

But we started looking more carefully at the actual traces.

How many tool calls did the agent make?
How much did each tool call return?
Did a short user question trigger twenty tool calls?
Did each of those tool calls send thousands or tens of thousands of tokens back into the model?

That was the uncomfortable part. The agent could be doing the right thing at a high level while still using far more tokens than necessary.

This is not a post arguing against agents, tools, MCP, integration platforms, or third-party services. Quite the opposite. Tools are one of the main reasons modern AI systems are useful. But if tool outputs are treated as incidental, agents can become slow, expensive, and context-limited very quickly.

The Cost Problem

A normal LLM request can be tiny. A user might ask a short question. A model might produce a short answer. A simple math or reasoning question might involve only a few dozen input tokens and a few hundred output tokens. That can cost a tiny fraction of a cent.

But once tools enter the loop, the economics can change completely.

A user can ask a short question, the agent can call a tool, and the tool can send back a very large response. Sometimes that response includes useful information. Often it also includes metadata, timestamps, comments, audit fields, internal IDs, redundant fields, verbose records, irrelevant results, too many rows, too much page content, or too much log output.

Now the request is no longer small. The user’s prompt may have been 30 tokens, but the tool result might be 30,000 or 100,000 tokens.

That creates three problems immediately:

Cost, because model APIs charge by tokens.
Latency, because large contexts take longer to process.
Token maxing, because the request can hit the context limit before the model has done the useful part of the work.

When people complain about agents producing unexpectedly large bills, burning through context, or feeling impractical at scale, I suspect this is often part of the reason. The problem is not necessarily that agents are inherently wasteful. The problem is often that the tools are not optimized for what the model actually needs.

The conclusion should not be:

Tools are the problem.

The better conclusion is:

Tool outputs need budgets.

The Thing That Changed Our Behavior: Per-Query Cost Breakdowns

The most useful thing we did was not a clever prompt.

It was instrumentation.

For every internal agent query, we made it possible for our team to inspect a cost and trace breakdown. Not just the final answer. The whole path.

Which tools were called?
How many times?
How much did each tool return?
How much context did each step consume?
How much did the model call cost?
Did the agent bounce between tools?
Did it call a tool, miss, call another tool, then come back to the first one?
Did it break a cache?
Did a short user question turn into a huge model request because one tool returned too much?

That visibility changed how we debugged the system.

Before that, a query could look fine from the outside. The answer might be correct. The UI might feel normal. But underneath, the agent might have made a chain of expensive calls, included too much tool output, or carried unnecessary context into later steps.

Once everyone internally could see the breakdown, patterns became obvious.

A tool returned 40 fields when 6 were needed.
A search returned 30 results when 5 would have been enough.
An integration included every comment, timestamp, and metadata field.
A tool call bypassed a cache.
An agent made 15 calls when 4 should have worked.

These were not philosophical problems. They were visible in the trace.

Every time we used the trace view internally, we learned something. Sometimes we found obvious bugs. Sometimes we found inefficient tool descriptions. Sometimes we found that a tool was doing the right thing but returning the wrong shape of output.

That is what pushed us toward the output-budget mindset.

Without per-query cost breakdowns, “tool output optimization” sounds abstract. With them, it becomes obvious. You can see exactly where the tokens are going.

The Mistake We Made

Our mistake was not dramatic. We were not intentionally dumping entire databases into models. We were not sending millions of records into context. Many of the tools were reasonable. The system worked. The agents were useful.

The mistake was treating tool output as secondary.

If the tool returned the correct information, we initially tended to think the tool was working. But correctness is not enough. A tool can return the correct information and still return it inefficiently. It can include too many records, too many fields, too much metadata, too much surrounding context, or too little guidance about what to do next.

That was the shift in thinking for us.

We stopped asking only:

Did the tool work?

And started asking:

Did the tool return the right amount of information, in the right shape, at the right cost?

That is a different standard.

It is closer to performance optimization than basic feature development. The first version proves the tool works. The next version asks whether it works efficiently.

A tool would return a useful answer, but also include more fields than necessary. A search would return relevant results, but too many of them. An integration would return a valid API object, but the object was designed for software, not for an LLM context window. A tool description would say what the tool did, but not firmly enough constrain what should come back.

None of this looked catastrophic in isolation. But across many tools and many agent steps, the costs compounded quickly.

Once we started examining traces and token usage carefully, we found that a number of tool outputs could be reduced significantly. In some cases, the reductions were on the order-of-magnitude level. Not every tool had that much waste, and some of the savings came from fixing our own early mistakes. But the pattern was clear enough that we changed how we think about tools.

The lesson was simple:

Every LLM tool call needs an output budget.

Tool Outputs Become the Next Prompt

The key mental model is simple: a tool output is not just an API response. A tool output becomes part of the next prompt.

The model reads it. The provider charges for it. The context window has to hold it. The next step of the agent depends on it.

That means tool output design is not a secondary detail. It is prompt design, cost control, latency control, and context management all at once.

If a tool returns 50,000 unnecessary tokens, those tokens do not stay inside the tool. They become model input.

This is why “just use a bigger context window” is not a full answer. Bigger context windows help. They are useful. But they do not make excessive tool output free.

Even if a model can accept 200,000 tokens, sending 100,000 unnecessary tokens still costs money, adds latency, and increases the chance that the agent hits limits later in the task.

A bigger context window can absorb waste. It does not eliminate waste.

What an Output Budget Means

An output budget is not just a hard token cap. A hard cap is useful, but it is not enough.

A bad version of this idea is:

Return the first 10,000 characters and cut off the rest.

That is truncation. Sometimes it is necessary, but it is a crude last resort.

A better output budget asks:

What is the minimum useful output this tool should return by default?

Not the absolute minimum. Not a blindly shortened response. The minimum useful response.

That usually means enough information for the model to make the next decision, plus a way to ask for more if needed.

For example, a search tool often should not return every matching document by default. It can return something like:

{
  "count": 184,
  "summary": "Most matching results concern login failures after the latest firmware update.",
  "top_results": [
    {
      "id": "result_1",
      "title": "Login failures after firmware 5.2",
      "why_relevant": "Direct match to the reported issue"
    },
    {
      "id": "result_2",
      "title": "Session timeout increase",
      "why_relevant": "Likely related failure mode"
    }
  ],
  "handle": "search_abc123",
  "next_actions": [
    "fetch_result",
    "fetch_more_results",
    "search_within_results"
  ]
}

{
  "count": 184,
  "summary": "Most matching results concern login failures after the latest firmware update.",
  "top_results": [
    {
      "id": "result_1",
      "title": "Login failures after firmware 5.2",
      "why_relevant": "Direct match to the reported issue"
    },
    {
      "id": "result_2",
      "title": "Session timeout increase",
      "why_relevant": "Likely related failure mode"
    }
  ],
  "handle": "search_abc123",
  "next_actions": [
    "fetch_result",
    "fetch_more_results",
    "search_within_results"
  ]
}

A log tool often works better when it returns clusters, counts, and representative lines before raw logs:

{
  "summary": "The dominant failure is AuthTimeoutError beginning at 14:03 UTC.",
  "error_clusters": [
    {
      "error": "AuthTimeoutError",
      "count": 3812,
      "first_seen": "14:03",
      "last_seen": "14:18"
    },
    {
      "error": "DatabaseConnectionError",
      "count": 102,
      "first_seen": "14:07",
      "last_seen": "14:09"
    }
  ],
  "representative_examples": [
    "14:03 AuthTimeoutError: token refresh exceeded 5000ms"
  ],
  "handle": "logs_456"
}

{
  "summary": "The dominant failure is AuthTimeoutError beginning at 14:03 UTC.",
  "error_clusters": [
    {
      "error": "AuthTimeoutError",
      "count": 3812,
      "first_seen": "14:03",
      "last_seen": "14:18"
    },
    {
      "error": "DatabaseConnectionError",
      "count": 102,
      "first_seen": "14:07",
      "last_seen": "14:09"
    }
  ],
  "representative_examples": [
    "14:03 AuthTimeoutError: token refresh exceeded 5000ms"
  ],
  "handle": "logs_456"
}

A CRM or third-party integration often should not return every field the upstream API exposes. It should return the fields relevant to the task, with a way to fetch the full object only if needed.

The goal is not to hide information from the model. The goal is to avoid making the model pay to read information it does not need yet.

There are cases where the model really does need raw output. The point is not to forbid that. The point is to make raw output an explicit choice, not the default result of every integration.

The Output Adapter Pattern

One pattern we have found useful is putting an adapter between raw integrations and the LLM.

Instead of:

Integration → LLM

Integration → LLM

use:

Integration → Output Adapter → LLM

Integration → Output Adapter → LLM

The adapter’s job is to turn system-shaped output into model-shaped output.

That can include:

Projection, removing fields the model does not need.
Ranking, returning the most relevant items first.
Aggregation, returning counts, totals, or grouped results instead of raw rows.
Clustering, grouping logs, errors, documents, or records into meaningful buckets.
Sampling, returning representative examples instead of all examples.
Summarization, compressing verbose output into a task-relevant summary.
Handle generation, storing the raw result elsewhere and returning a reference the model can expand later.

This adapter does not always require a vector database or another LLM call. Sometimes the best adapter is just a better query. Sometimes it is a ranking function. Sometimes it is a summary. Sometimes it is a cached handle. Sometimes it is a tool description that more firmly constrains what the tool should return.

The important point is that the raw output from an integration is rarely the ideal input for a model.

Third-Party Integrations Make This More Important

This becomes especially important when using third-party integration layers.

These systems are valuable because they let you connect to many external services quickly. That is the point. You can suddenly give an agent access to many useful actions and data sources without building every integration yourself.

That is powerful. It is also exactly why output budgeting matters.

The default shape of an external API response is not necessarily the right shape for an LLM. An API response may be designed for another program, a dashboard, a backend workflow, or a human developer inspecting JSON. It may include every field because that is convenient for a general-purpose API.

But an LLM context window is a different environment.

When integrating third-party tools, the question should not only be:

Can the agent call this service?

It should also be:

What exactly comes back, and how much of it should the model see by default?

If you skip that question, the integration may work beautifully at small scale and become painful once the agent starts chaining calls together.

The Database and RAG Analogy

A useful analogy is database query optimization, but only up to a point.

In normal software systems, we try not to fetch far more data than we need. We push filtering, projection, aggregation, and ranking closer to the data. If we need a count, we ask for a count. If we need five records, we ask for five records. If we need three fields, we do not return every field in the table.

RAG systems apply a similar idea to documents. The point of retrieval is that we do not send the whole corpus to the model. We search, rank, filter, and send the most relevant chunks.

Tool-heavy agents need the same discipline, but across a wider range of systems: CRMs, ticketing systems, search tools, browser tools, log tools, calendars, email, databases, and third-party integrations.

The difference is that the penalty for over-fetching is much higher when the recipient is an LLM. Sending extra data to a normal backend service is usually wasteful. Sending extra data into a model is wasteful in several ways at once: it increases cost, adds latency, fills the context window, and gives the model more material to sift through.

That is why tool output design needs to be treated as a first-class part of agent architecture.

The goal is not to force every tool into a database pattern. The goal is to apply the same underlying principle:

Do as much filtering, ranking, aggregation, and shaping as possible before the result enters the model context.

What Changed For Us

The practical change was the workflow.

First, we instrumented real usage. For internal queries, we made it possible to inspect the trace: which tools were called, how many times, what each tool returned, how much context each step consumed, and where cost accumulated.

Then we used those traces to tune the system.

Sometimes the fix was a tool description that more clearly told the agent what to request or what not to request. Sometimes the fix was changing the default number of results. Sometimes it was removing fields. Sometimes it was returning a summary instead of raw records. Sometimes it was caching a result and returning a handle. Sometimes it was changing the tool itself so it performed more filtering before returning anything to the model.

The point is that we stopped treating tool outputs as incidental.

We made the tool output a first-class optimization target.

For each important tool, we started asking:

What does this tool return by default?
Is that default appropriate for the most common use case?
How many results should come back before the model asks for more?
Which fields are almost never useful to the model?
Should this return raw records, a summary, an aggregate, or a handle?
Can the tool perform filtering or ranking before the LLM sees the result?
Does the trace show the agent repeatedly calling this tool because the output is poorly shaped?
Is the tool output optimized for a software API, or for an LLM context window?
How will we inspect the cost and trace of this tool in real usage?

That last question matters. Many integrations return perfectly valid API responses that are still poor model inputs.

This is why I think every LLM tool call needs an output budget. Not just a maximum token cap, but an intentional default shape based on what the model usually needs next.

Tools Are Too Valuable To Waste Tokens On Bad Outputs

Tools and integrations are too valuable to treat their outputs casually.

If agents are going to use hundreds of tools across internal systems and third-party services, then tool output design has to become a first-class engineering concern.

Otherwise, we should not be surprised when agents become slow, expensive, or context-limited.

In our experience, fighting token maxing did not mean giving up on tools. It meant making tool outputs much more intentional.

Every LLM tool call needs an output budget because every unnecessary tool token is charged, processed, and carried over to the next model call.

At scale, that becomes one of the difference-makers between agents that feel magical and agents that feel impractical.

If you have seen good writing, tools, or patterns around this, I would appreciate links. I suspect many strong teams are already doing versions of this, but I would like to see the practice discussed more explicitly.

‹ Microsoft CEO's Viral Tweet, The Learning Loop, And How Axamy Solves This

What Dot.Com Bandwidth Taught Me About the AI Token Cost Panic ›

You focus on the work. Axamy handles the coordination.

Try Axamy Free