Budgets
Budgets are real and they're non-negotiable. Every token is a debt owed to the provider's context window. Every in-flight artifact and open reader is a debt owed to your runtime. Treat either as unlimited and your agent fails — silently from truncation, catastrophically from overflow, or expensively from runaway tool output materializing in RAM. ADK is not a safety net. It is an accounting system. You must own the shedding policy or the window will own you.
ADK does not enforce a budget
Batteries do, and middleware composes on top of what they expose. If your loop has no battery configured to enforce a window, nothing in the runner will stop you from exceeding it. You built an unbounded loop, and it will behave like one.
Measure before you dispatch
Guessing token counts from string length is how you lose control before the request leaves your process. Every text-bearing primitive in ADK is a Tokenizable — Message.content, Memory.content, Thought.content, Retrievable.content, systemPrompt, standing instructions, Identity.representation. Every one of them spends from the same context budget. You must call value.estimateTokens(encoding) or you are budgeting with partial information.
const tokens = message.content.estimateTokens('o200k_base')The encoding determines how much trust you can place in the number. Known tokenizers are exact. Unknown encodings fall back to a crude approximation — better than nothing, not truth:
| Encoding | Fidelity | Library |
|---|---|---|
gpt2, r50k_base, p50k_base, p50k_edit, cl100k_base, o200k_base | Exact | js-tiktoken |
gemini | Exact | @lenml/tokenizer-gemini |
llama2 | Exact | llama-tokenizer-js |
claude | ~3.5 chars/token heuristic | Approximate |
| anything else | ceil(length / 4) | Approximate |
Counts are cached per encoding and invalidated on .set(). Repeated calls in middleware are cheap. Unknown or failing encodings fall back to ceil(length / 4) — the number is always finite, it is not always accurate. If you are running against a model with a heuristic tokenizer, widen your safety margins.
Estimates are pre-flight guardrails, not billing receipts
The local tokenizer exists to stop you from overflowing the context window. The provider's usage metadata is the only authoritative count. If you bill clients based on estimateTokens, you will lose money.
The context window is a hard ceiling
A configured window number is not enforcement unless the battery can count against it. Set contextWindow without tokenEncoding and nothing enforces — the adapter has a limit and no way to measure whether you are approaching it.
new OpenAIChatCompletionsAdapter({
tokenEncoding: 'o200k_base',
contextWindow: 128_000,
// ...
})Without both fields, there is no enforceable ceiling. You have configuration, not a budget.
Bad budget wiring fails on first dispatch, not at construction
The adapter does not validate this at construction time. Misconfigure it and the error surfaces on the first dispatch as E_INVALID_OPENAI_CHAT_COMPLETIONS_OPTIONS. Your agent starts up cleanly and dies the moment it tries to run.
When both fields are set, the adapter counts every bucket before dispatching — system prompt, standing instructions, memories, retrievables, timeline. When the total exceeds the window, the adapter refuses the prompt and throws E_OPENAI_CHAT_COMPLETIONS_CONTEXT_OVERFLOW.
That exception is a tool, not just a failure. It carries the total count, the window limit, the encoding, and a per-bucket breakdown across system prompt, standing instructions, memories, retrievables, and timeline. Middleware reads this breakdown to make deliberate cuts — shed the weakest retrievables, compact the oldest timeline entries, or drop low-value memories — then retries. Without the breakdown, you are guessing at the failure.
No Budget class. No global token bookkeeping.
The adapter's per-call accounting and the overflow exception's payload are the only enforcement layer. Middleware that needs pre-emptive shedding must call estimateTokens itself and trim ctx.turnMessages / ctx.turnMemories / ctx.turnRetrievables before the adapter sees them. No one is doing this for you.
Large outputs must be queried, not dumped
A Tokenizable body lives entirely in memory. A SpooledArtifact body does not — it lives behind a SpoolReader, backed by a stream, a file handle, or an in-memory buffer depending on the storage battery. Large tool outputs are a budget hazard in two directions: they can exhaust your process memory, and inlining them consumes the context window. SpooledArtifact keeps the body out of both until you ask for a slice.
The range API never materializes the full body in one allocation — reads happen line-by-line:
artifact.byteLength() // total byte length
artifact.lineCount() // total line count
artifact.head(10) // first 10 lines
artifact.tail(50) // last 50 lines
artifact.cat(start, end) // [start, end) line range; both args optional
artifact.grep(/pattern/) // matching lines
artifact.estimateTokens('o200k_base') // reads full body, then estimates
artifact.asString() // explicit "read everything"Full-body reads are budget events — and both are explicit
asString() and estimateTokens() are the only methods that materialize the entire body. They are named to say exactly what they do. Call either on a 500MB artifact and accept the consequences. Range queries (head, tail, cat, grep) read line-by-line and never allocate the full buffer. There is no .toString() escape hatch — that path does not exist on SpooledArtifact.
Give the model a handle when it should query
By default, ToolCall.inline is true — the battery inlines the full body into the context. You opt into handles; handles do not happen automatically based on size.
new ToolCall({
tool: 'build_report',
results: spooledMarkdownArtifact,
inline: false, // render a query handle; keep the body out of context
// ...
})Batteries do not measure result sizes against thresholds, do not guess intent, and do not silently switch modes. Silent budget policy is worse than no policy — nobody knows what got hidden, inlined, or dropped. The flag is policy. Batteries obey it.
With inline: false and a SpooledArtifact, the model does not receive the body. It receives a compact handle envelope containing the callId, artifact kind, byte and line counts, and the list of available artifact_* tools with instructions to use them. The model gets discovery information and a controlled read path. Your context window is not consumed by 500KB of log output the model only needs to grep.
A handle requires a queryable artifact
If ToolCall.results is a Tokenizable and inline: false, the battery renders inline anyway and warns. A handle without a queryable artifact is a lie.
Own the inline decision before dispatch
The decision is about queryability, not raw size. A 50KB diff the model must read end-to-end is inline. A 5KB log the model needs to search is a handle. Size alone is a lazy heuristic that will produce wrong answers in both directions.
There are two places to make the call:
At the source. The tool's handler wraps its result in a SpooledArtifact and sets inline: false before the ToolCall is emitted. Use this when the producer knows the output is intended for interrogation, not wholesale consumption.
At rendering time. Middleware calls TurnContext.mutateToolCall before the adapter sees the call. Use this for dynamic context pressure: estimate the cost, decide whether the body deserves space in the window, flip the flag.
Your middleware owns the budget policy
The adapter obeys the flag blindly. It does not infer queryability or budget pressure. If bodies are converting the model's context into a log dump, that is because your middleware did not intervene. The adapter will not intervene for you.
A real budget cuts before it crashes
A real budgeting pipeline is a trash compactor with telemetry: it knows what is low-value, crushes it before dispatch, and leaves enough structure to recover deliberately when the first cut was not enough.
- Load. Input middleware loads memories, retrievables, and history — the material that will compete for the window.
- Measure and shed. Input middleware counts each bucket with
estimateTokens, identifies what is over budget, and cuts: callsTurnContext.deleteRetrievableto drop low-relevance retrievables, or compacts older timeline entries into a single summaryMessage. - Convert. Input middleware turns queryable prior tool results into handles by setting
inline: false, so the next dispatch sees references instead of bodies. - Enforce. Dispatch runs. The adapter enforces the hard ceiling. If shedding was insufficient, it refuses the prompt with
E_OPENAI_CHAT_COMPLETIONS_CONTEXT_OVERFLOW. - Recover. Dispatch failure is surfaced via
runner.observe('error', ...)— the turn output pipeline does not run on a dispatch throw. Your error observer reads the per-bucket breakdown from the exception, applies targeted fallback shedding, and triggers a retry as an explicit higher-level decision (a newrun()invocation). The runner does not quietly retry the same failed budget for you.
Three primitives. Your policy.
Budget enforcement is built from Tokenizable.estimateTokens, the SpooledArtifact range API, and ToolCall.inline. They do not make decisions. They are sharp tools, not a padded cell. Count, cut, handle, or fail.
Budgets are real. They do not become negotiable because you ignored them. If you do not build a pipeline that enforces them, your agent is a toy waiting for its first production prompt to break it.