Budgets

Budgets are real and they're non-negotiable. Every token is a debt owed to the provider's context window. Every in-flight artifact and open reader is a debt owed to your runtime. Treat either as unlimited and your agent fails — silently from truncation, catastrophically from overflow, or expensively from runaway tool output materializing in RAM. ADK is not a safety net. It is an accounting system. You must own the shedding policy or the window will own you.

This page is the primitives; Token Thrift is the philosophy

The context window is not a chat history — it is what you send for one dispatch. Build it subtractively: hold a large working set, then trim it down to the focused slice each call needs. The Token Thrift recipe makes that case (with the research, and when not to do it); the Punching Above Its Weights showcase builds it live on a real on-device model. This page is the accounting primitives both rely on.

ADK does not enforce a budget

Batteries do, and middleware composes on top of what they expose. If your loop has no battery configured to enforce a window, nothing in the runner will stop you from exceeding it. You built an unbounded loop, and it will behave like one.

Measure before you dispatch

Guessing token counts from string length is how you lose control before the request leaves your process. Every text-bearing primitive in ADK is a Tokenizable — Message.content, Memory.content, Thought.content, Retrievable.content, systemPrompt, standing instructions, Identity.representation. Every one of them spends from the same context budget. You must call value.estimateTokens(encoding) or you are budgeting with partial information.

const tokens = message.content.estimateTokens('o200k_base')

The encoding determines how much trust you can place in the number. Known tokenizers are exact. Unknown encodings fall back to a crude approximation — better than nothing, not truth:

Encoding	Fidelity	Library
`gpt2`, `r50k_base`, `p50k_base`, `p50k_edit`, `cl100k_base`, `o200k_base`	Exact	js-tiktoken
`gemini`	Exact	@lenml/tokenizer-gemini
`llama2`	Exact	llama-tokenizer-js
`claude`	~3.5 chars/token heuristic	Approximate
anything else	`ceil(length / 4)`	Approximate

Counts are cached per encoding and invalidated on .set(). Repeated calls in middleware are cheap. Unknown or failing encodings fall back to ceil(length / 4) — the number is always finite, it is not always accurate. If you are running against a model with a heuristic tokenizer, widen your safety margins.

Estimates are pre-flight guardrails, not billing receipts

The local tokenizer exists to stop you from overflowing the context window. The provider's usage metadata is the only authoritative count. If you bill clients based on estimateTokens, you will lose money.

The context window is a hard ceiling

A configured window number is not enforcement unless the battery can count against it. Set contextWindow without tokenEncoding and nothing enforces — the adapter has a limit and no way to measure whether you are approaching it.

new OpenAIChatCompletionsAdapter({
  tokenEncoding: 'o200k_base',
  contextWindow: 128_000,
  // ...
})

Without both fields, there is no enforceable ceiling. You have configuration, not a budget.

Bad budget wiring fails on first dispatch, not at construction

The adapter does not validate this at construction time. Misconfigure it and the error surfaces on the first dispatch as E_INVALID_OPENAI_CHAT_COMPLETIONS_OPTIONS. Your agent starts up cleanly and dies the moment it tries to run.

When both fields are set, the adapter counts every bucket before dispatching — system prompt, standing instructions, memories, retrievables, timeline. When the total exceeds the window, the adapter refuses the prompt and throws E_OPENAI_CHAT_COMPLETIONS_CONTEXT_OVERFLOW.

That exception is a tool, not just a failure. It carries the total count, the window limit, the encoding, and a per-bucket breakdown across system prompt, standing instructions, memories, retrievables, and timeline. Middleware reads this breakdown to make deliberate cuts — shed the weakest retrievables, compact the oldest timeline entries, or drop low-value memories — then retries. Without the breakdown, you are guessing at the failure.

No Budget class. No global token bookkeeping.

The adapter's per-call accounting and the overflow exception's payload are the only enforcement layer. Middleware that needs pre-emptive shedding must call estimateTokens itself and trim ctx.turnMessages / ctx.turnMemories / ctx.turnRetrievables before the adapter sees them. No one is doing this for you.

Large outputs must be queried, not dumped

A Tokenizable body lives entirely in memory. A SpooledArtifact body does not — it lives behind a SpoolReader, backed by a stream, a file handle, or an in-memory buffer depending on the storage battery. Large tool outputs are a budget hazard in two directions: they can exhaust your process memory, and inlining them consumes the context window. SpooledArtifact keeps the body out of both until you ask for a slice.

The range API never materializes the full body in one allocation — reads happen line-by-line:

artifact.byteLength()                  // total byte length
artifact.lineCount()                   // total line count
artifact.head(10)                      // first 10 lines
artifact.tail(50)                      // last 50 lines
artifact.cat(start, end)               // [start, end) line range; both args optional
artifact.grep(/pattern/)               // matching lines
artifact.estimateTokens('o200k_base')  // reads full body, then estimates
artifact.asString()                    // explicit "read everything"

Full-body reads are budget events — and both are explicit

asString() and estimateTokens() are the only methods that materialize the entire body. They are named to say exactly what they do. Call either on a 500MB artifact and accept the consequences. Range queries (head, tail, cat, grep) read line-by-line and never allocate the full buffer. There is no .toString() escape hatch — that path does not exist on SpooledArtifact.

Inline the body only when the model must read all of it

By default, ToolCall.inline is false — a SpooledArtifact result renders as a handle, and its body stays out of the context window. This is the secure default for a reason: the battery already spool-wraps every tool result, so the bytes are sitting behind a reader either way; the only question is whether to pour them into the next prompt. Pouring by default is the failure mode the spool exists to prevent. You opt into inlining for output the model genuinely needs to read end-to-end.

new ToolCall({
  tool: 'render_diff',
  results: spooledDiffArtifact,
  inline: true,                        // the model must read the whole diff — inline the body
  // ...
})

Batteries do not measure result sizes against thresholds, do not guess intent, and do not silently switch modes. Silent budget policy is worse than no policy — nobody knows what got hidden, inlined, or dropped. The flag is policy. Batteries obey it; the default value is the only thing that is opinionated, and it is opinionated toward keeping the window clean.

With the default (inline: false) and a SpooledArtifact, the model does not receive the body. It receives a compact handle envelope containing the callId, artifact kind, byte and line counts, and the list of available artifact_* tools with instructions to use them. The model gets discovery information and a controlled read path. Your context window is not consumed by 500KB of log output the model only needs to grep.

Handles only make sense for a queryable artifact

A handle hands the model artifact_* tools to read a SpooledArtifact incrementally. A Tokenizable result has nothing to query — an ArtifactTool's answer or an error string IS the content — so it always renders inline regardless of the flag. That is not a misconfiguration; under handle-by-default it is the ordinary case for those two result kinds, and the battery does it silently.

Own the inline decision before dispatch

The decision is about queryability, not raw size. A 50KB diff the model must read end-to-end is inline: true. A 5KB log the model needs to search rides the default handle. Size alone is a lazy heuristic that will produce wrong answers in both directions — but when you have not made an explicit call, the default keeps the body out of the window rather than in it.

There are two places to make the call:

At the source. The tool's handler wraps its result in a SpooledArtifact; the call is a handle by default. Set inline: true on the emitted ToolCall when the producer knows the output is meant for wholesale consumption, not interrogation.

At rendering time. Middleware calls TurnContext.mutateToolCall before the adapter sees the call. Use this for dynamic context pressure: estimate the cost, decide whether the body deserves space in the window, flip the flag either way.

Your middleware owns the budget policy

The adapter obeys the flag blindly. It does not infer queryability or budget pressure. The default keeps bodies out of the window; if you flip something to inline: true and it converts the model's context into a log dump, that is your call, not the adapter's. The adapter will not intervene for you.

A real budget cuts before it crashes

A real budgeting pipeline is a trash compactor with telemetry: it knows what is low-value, crushes it before dispatch, and leaves enough structure to recover deliberately when the first cut was not enough.

Load. Input middleware loads memories, retrievables, and history — the material that will compete for the window.
Measure and shed. Input middleware counts each bucket with estimateTokens, identifies what is over budget, and cuts: calls TurnContext.deleteRetrievable to drop low-relevance retrievables, or compacts older timeline entries into a single summary Message.
Convert. Prior SpooledArtifact tool results are already handles by default, so the next dispatch sees references instead of bodies for free. Input middleware only intervenes the other way — setting inline: true on the few results the model genuinely needs in full this turn — or flips a previously-inlined body back to a handle under pressure via TurnContext.mutateToolCall.
Enforce. Dispatch runs. The adapter enforces the hard ceiling. If shedding was insufficient, it refuses the prompt with E_OPENAI_CHAT_COMPLETIONS_CONTEXT_OVERFLOW.
Recover. Dispatch failure is surfaced via runner.observe('error', ...) — the turn output pipeline does not run on a dispatch throw. Your error observer reads the per-bucket breakdown from the exception, applies targeted fallback shedding, and triggers a retry as an explicit higher-level decision (a new run() invocation). The runner does not quietly retry the same failed budget for you.

Three primitives. Your policy.

Budget enforcement is built from Tokenizable.estimateTokens, the SpooledArtifact range API, and ToolCall.inline. They do not make decisions. They are sharp tools, not a padded cell. Count, cut, handle, or fail.

Budgets are real. They do not become negotiable because you ignored them. If you do not build a pipeline that enforces them, your agent is a toy waiting for its first production prompt to break it.

What each pipeline owns

Envelopes

Persistence

Identity and Reasoning

Media

Budgets

Measure before you dispatch

The context window is a hard ceiling

Large outputs must be queried, not dumped

Inline the body only when the model must read all of it

Own the inline decision before dispatch

A real budget cuts before it crashes

What each pipeline owns

Envelopes

Persistence

Identity and Reasoning

Media

Budgets ​

Measure before you dispatch ​

The context window is a hard ceiling ​

Large outputs must be queried, not dumped ​

Inline the body only when the model must read all of it ​

Own the inline decision before dispatch ​

A real budget cuts before it crashes ​

Budgets

Measure before you dispatch

The context window is a hard ceiling

Large outputs must be queried, not dumped

Inline the body only when the model must read all of it

Own the inline decision before dispatch

A real budget cuts before it crashes