LLM batteries

Writing custom HTTP fetch blocks, manual Server-Sent Events (SSE) streaming loops, and retry logic by hand is usually not the interesting part of your agent. Use the batteries unless you are deliberately replacing the execution loop.

typescript

executorCallback: new OpenAIChatCompletionsAdapter({ model, apiKey, autoAck: true }).executor()

That is it. That single line resolves the entire DispatchExecutorFn interface. autoAck: true tells the executor to call ctx.ack() automatically after a tool-call-free response; the default is false, meaning the implementor owns turn completion.

ADK ships five LLM batteries: OpenAIChatCompletionsAdapter, WebLLMChatCompletionsAdapter, OllamaAdapter, LiteRtLmAdapter, and TransformersJsAdapter. They satisfy DispatchExecutorFn directly.

They handle SSE streaming, token math, safety envelopes, tool call dispatching, artifact forging, and transient error recovery. They are not small convenience helpers. They are executor implementations.

Compatible Endpoints

OpenAIChatCompletionsAdapter

The adapter works against any endpoint speaking the OpenAI Chat Completions wire format:

Cloud model APIs — anything that natively speaks this wire shape.
Self-hosted inference servers — any server that exposes a Chat Completions-compatible HTTP interface.
Proxy gateways and routing layers that expose a standard /v1/chat/completions interface.
Ollama's OpenAI-compat /v1 — point baseURL at http://localhost:11434/v1. For Ollama's richer native /api/chat (per-request context size, native thinking, structured output, model lifecycle), use the OllamaAdapter below instead.

Point baseURL at your endpoint. The adapter sends standard HTTP. It does not care what sits behind it.

WebLLMChatCompletionsAdapter

Runs models locally in the browser or supported JS runtimes via WebGPU. Use it for local-first or zero-server deployment models. It accepts WebLLMChatCompletionsAdapterOptions to configure loading and cache policies.

LiteRtLmAdapter (on-device LiteRT-LM via WebGPU)

Runs Google's .litertlm models on-device in the browser via WebGPU and a bundled wasm runtime, wrapping @litert-lm/core. Like the WebLLM battery, it is local-first and zero-server. Unlike the WebLLM battery, it is not a thin extension of the OpenAI Chat Completions wire shape — LiteRT-LM has its own API, so LiteRtLmAdapter is a standalone battery. It accepts LiteRtLmAdapterOptions.

@litert-lm/core is an optional peer dependency. The ADK does not bundle it (it ships its own ~19 MB wasm); install it yourself when you want this battery:

bash

pnpm add @litert-lm/core

typescript

import { LiteRtLmAdapter } from '@nhtio/adk/batteries/llm/litert_lm'

const adapter = new LiteRtLmAdapter({
  model: 'https://example.test/models/gemma3-1b-it-int4.litertlm',
  samplerParams: { type: 3 /* GREEDY */ },
  maxOutputTokens: 512,
  autoAck: true,
})

// executorCallback: adapter.executor()

The required model is a .litertlm URL string, a ReadableStream<Uint8Array>, or a Blob. The default loader (createEngine) self-bootstraps the wasm from jsdelivr and sets up the WebGPU device on first dispatch — you do not call loadLiteRtLm() yourself. Inject engine (a pre-built engine) or createEngine (a custom loader) to control that, or to mock the engine in tests without WebGPU. LiteRtLmAdapter.isAvailable() is a static navigator.gpu probe; isWebGPUAvailable overrides it per instance.

Generation is configured with LiteRT-native fields, not OpenAI sampling params:

Concern	LiteRT field	Notes
Sampler	`samplerParams`	`{ type, k, p, temperature, seed }`. `type` is the numeric `SamplerType`: `1` TOP_K, `2` TOP_P, `3` GREEDY.
Max generation	`maxOutputTokens`	per-turn output cap (`SessionConfig.maxOutputTokens`).
Context length	`maxNumTokens`	`EngineSettings.mainExecutorSettings.maxNumTokens`.
Backend	`backend`	numeric `Backend` enum (`CPU` / `GPU` / …); defaults to `GPU_ARTISAN`.
Multimodality	`audioModalityEnabled` / `visionModalityEnabled`	enable for models that accept it.
Thinking	`enableConstrainedDecoding` / `filterChannelContentFromKvCache`	influence how the model emits reasoning; see the text-parsing note below.

Tool calls and reasoning are parsed from text. The @litert-lm/core v0.13.1 JS runtime is text-in / text-out: it injects tool definitions into the conversation but the model emits tool calls and "thinking" as raw text in the assistant content, in the model family's own format — it does not return a populated tool_calls array or channels object (those wire fields exist only for feeding history back in). So this battery parses them out with the same shared, configurable parser layer the transformers.js battery uses: toolCallParser and reasoningParser (both default 'auto' — try the bundled family parsers in priority order; set a family name, 'none', or a custom function). Parsed tool-call arguments are a plain object (no JSON.parse); extracted reasoning becomes ADK thoughts; the cleaned prose is the assistant message. STASH_KEY is 'liteRtLm'. Its exceptions are E_INVALID_LITERT_LM_OPTIONS, E_LITERT_LM_CONTEXT_OVERFLOW, E_LITERT_LM_STREAM_ERROR, E_LITERT_LM_INVALID_TOOL_CALL_ARGS, and E_UNSUPPORTED_MEDIA_MODALITY.

The docs lag the library — trust the types AND a real run

The published @litert-lm/core JS guide trails the actual implementation: sampler controls and multimodality flags are real typed fields the public docs omit, so map against the installed .d.ts. But the .d.ts also over-promises — its Message.tool_calls / Message.channels fields are not populated on output by the v0.13.1 runtime (verified against the package's own README, which is text-in / text-out). The lesson cuts both ways: the types are the spec for inputs, but only a real model run tells you what the runtime actually emits. The dependency is young (pinned exact) and volatile — re-verify on upgrade.

Preview models are text-in / text-out

The current preview .litertlm models are text-only even though the types expose vision/audio. The adapter is built to the typed multimodal contract, but the native multimodal path cannot be exercised end-to-end until a multimodal .litertlm ships. Media that the configured model cannot consume is governed by unsupportedMediaPolicy (default 'throw' → E_UNSUPPORTED_MEDIA_MODALITY; switch to 'fallback-stash' or 'synthetic-description' to degrade to text).

TransformersJsAdapter (on-device, Node AND browser, via transformers.js)

Runs ONNX text-generation models on-device with @huggingface/transformers (transformers.js). Unlike the WebLLM and LiteRT-LM batteries — both WebGPU/browser-only — transformers.js is environment-neutral: its package auto-selects onnxruntime-node (native, runs in plain Node, no GPU) or onnxruntime-web (WASM + WebGPU in the browser). So this battery runs server-side and client-side from one codepath, and is not gated on navigator.gpu. It is a standalone battery accepting TransformersJsAdapterOptions.

@huggingface/transformers is an optional peer dependency — install it yourself:

bash

pnpm add @huggingface/transformers

typescript

import { TransformersJsAdapter } from '@nhtio/adk/batteries/llm/transformers_js'

const adapter = new TransformersJsAdapter({
  model: 'onnx-community/gemma-4-E2B-it-ONNX',
  dtype: 'q4f16',         // 'auto' | 'fp32' | 'fp16' | 'q8' | 'q4' | 'q4f16' | …
  device: 'webgpu',       // 'auto' | 'webgpu' | 'wasm' | 'cpu' | 'gpu' | … (omit for the env default)
  maxNewTokens: 512,
  autoAck: true,
})

// executorCallback: adapter.executor()

The required model is any ONNX text-generation model id. device/dtype pick the backend and quantization; inject pipeline (a pre-built pipeline), createPipeline (a custom loader), or createStreamer (a custom streaming sink) to control loading or to mock it in tests without importing the peer. Generation knobs map to transformers.js generate kwargs: maxNewTokens, doSample, temperature, topK, topP, repetitionPenalty, stopStrings.

Tool calls and reasoning are parsed from text — and the parser is the knob. transformers.js injects tool definitions into the chat template but does not return structured tool calls or reasoning; the model emits both as family-specific raw text. This is exactly how vLLM, SGLang, and Ollama handle open-weight models: one post-hoc, per-family parser, selected by a flag, run after generation. The battery ships that as two configurable options:

Option	Default	Values
`toolCallParser`	`'auto'`	`'auto'` (try-all, first match wins) · `'hermes'` · `'gemma'` · `'gpt_oss'` · `'phi'` · `'pythonic'` · `'llama3_json'` · `'mistral'` · `'qwen3_coder'` · `'none'` · a custom function
`reasoningParser`	`'auto'`	`'auto'` · `'think_tag'` (`<think>…</think>`) · `'harmony_analysis'` (gpt-oss) · `'gemma_channel'` · `'none'` · a custom function

'auto' runs the bundled family parsers in priority order (hermes → gemma → gpt_oss → phi → pythonic → llama3_json → mistral → qwen3_coder for tools) until one extracts a valid call; the marker-anchored families (phi is anchored on the literal functools token) run first — no cross-family false positives — and the weak-signal JSON/pythonic forms are gated on the call's name matching a real tool. The bundled defaults target the small ONNX models that actually run here — Gemma 4 E2B/E4B, gpt-oss:20b, Qwen3-Instruct, Llama 3.2, SmolLM (the Ollama-Cloud-tier open-weight families). Anything unlisted: supply a custom parser function. Parsed reasoning becomes ADK thoughts; the cleaned prose is the assistant message; tool-call arguments are a plain object.

STASH_KEY is 'transformersJs'. Its exceptions are E_INVALID_TRANSFORMERS_JS_OPTIONS, E_TRANSFORMERS_JS_CONTEXT_OVERFLOW, E_TRANSFORMERS_JS_STREAM_ERROR, E_TRANSFORMERS_JS_INVALID_TOOL_CALL_ARGS, E_TRANSFORMERS_JS_TOOL_PARSE_FAILED, and E_UNSUPPORTED_MEDIA_MODALITY.

Streaming stops prose at the first markup marker (v1)

Streaming reports prose deltas live, but reportMessage deltas cannot be un-emitted. So once the buffered output shows a tool-call / reasoning start-marker, the adapter stops streaming further prose and persists the authoritative clean message after generation completes. The persisted message is always correct; only live prose that precedes a tool call in the same turn is truncated mid-stream. Marker-aware live withholding is a planned refinement.

Tool-call formats drift; onnxruntime-node is a native addon

Family formats are model-specific and shift across versions (Gemma alone has three incompatible tool formats across generations — the bundled 'gemma' parser targets the E2B/E4B delimited form; gpt-oss's Harmony channel ordering is an active upstream issue). 'auto' fails gracefully and a custom parser is the escape hatch. Also note the Node backend (onnxruntime-node) is a compiled native addon — platform-specific for Lambda / Alpine / ARM deployments.

OllamaAdapter (native `/api/chat`)

Targets Ollama's native /api/chat endpoint, which exposes capabilities the OpenAI-compat /v1 layer cannot:

Capability	Native `/api/chat` (`OllamaAdapter`)	OpenAI `/v1` (`OpenAIChatCompletionsAdapter`)
Per-request context size	`options.num_ctx`	requires a Modelfile rebuild
Reasoning	native `think` + `message.thinking`	provider-specific `reasoning` field
Structured output	`format` (`'json'` or a JSON schema)	`response_format` subset
Model lifecycle	`keep_alive`	—
Streaming	NDJSON (`done: true`)	SSE (`data:` / `[DONE]`)
Tool-call arguments	JSON object	JSON string

Use OllamaAdapter when you want those native features; use OpenAIChatCompletionsAdapter against <host>/v1 when you only need Chat-Completions-shaped access.

Local vs cloud. The only difference is baseURL plus the auth header:

typescript

import { OllamaAdapter } from '@nhtio/adk/batteries/llm/ollama'

// Local — baseURL defaults to http://localhost:11434, no auth.
const local = new OllamaAdapter({ model: 'llama3.2', autoAck: true })

// Cloud — point at ollama.com and supply an API key (→ Authorization: Bearer).
const cloud = new OllamaAdapter({
  model: 'gpt-oss:120b',
  baseURL: 'https://ollama.com',
  apiKey: process.env.OLLAMA_API_KEY,
  autoAck: true,
})

Native request shape. Top-level native controls are think, format, and keep_alive; all sampling/runtime parameters go in a NESTED options block (this is the key structural difference from the OpenAI wire, where they sit at the top level):

typescript

const adapter = new OllamaAdapter({
  model: 'llama3.2',
  think: 'high', // boolean | 'low' | 'medium' | 'high'
  format: 'json', // 'json' or a JSON schema object
  keep_alive: '5m', // string | number (0 unloads the model)
  options: {
    temperature: 0.4,
    num_ctx: 8192, // server-side KV-cache size — independent of the ADK `contextWindow` token guard
    top_p: 0.9,
    seed: 42,
  },
})

num_ctx (server KV-cache size) and the ADK control field contextWindow (token-budget enforcement, paired with tokenEncoding) are independent and intentionally not auto-synced.

The think channel. think is conditional-presence on the wire: omit it and no think key is sent; think: false sends think: false; true / 'low' / 'medium' / 'high' pass through verbatim. When thinking is active, the model's message.thinking is surfaced and persisted as a Thought (separate from the answer Message); when off, no thought is produced.

Media. Native /api/chat supports only base64 images[]. Every non-image modality (audio, document, video) is "unsupported" and routes through unsupportedMediaPolicy ('throw' by default; 'fallback-stash' or 'synthetic-description' to degrade to text) — a wider unsupported set than the OpenAI battery, which natively carries audio and document blocks.

Tool calls. Native tool-call arguments arrive as a JSON object (no JSON.parse); the adapter validates it is an object and otherwise persists a self-correcting E_OLLAMA_INVALID_TOOL_CALL_ARGS error result. Native calls carry no id, so the adapter synthesizes one for correlation; tool-result history messages are labelled with tool_name (the originating tool), not tool_call_id. There is no tool_choice — native /api/chat does not support forcing a tool.

Generation stats. On the terminal done: true object the adapter reads Ollama's native token counts (prompt_eval_count, eval_count) and nanosecond durations (total_duration, eval_duration, …) plus done_reason, and emits them via the runner's dedicated generationStats observability hook — distinct from the diagnostic log channel:

typescript

await runner.dispatch({
  // …
  observers: {
    generationStats: (stats) => {
      // stats.promptTokens, stats.completionTokens, stats.totalDurationNs, stats.finishReason,
      // stats.provider === 'ollama', stats.raw (full native object) …
      metrics.record(stats)
    },
  },
})

Transport. Ollama is HTTP-only — there is no native Unix-socket or CLI transport. To reach a socket-bound deployment, front it with a bridge (e.g. nginx proxy_pass, or socat UNIX-LISTEN:/tmp/ollama.sock TCP:localhost:11434) and point baseURL at the bridge, or inject a custom fetch (e.g. undici with a unix: socket path). The adapter targets <baseURL>/api/chat.

Construction and Validation

The constructor validates baseline options immediately on startup. Config bugs fail loud and fast. If you pass junk into OpenAIChatCompletionsAdapter, it throws E_INVALID_OPENAI_CHAT_COMPLETIONS_OPTIONS right away. If you pass junk into WebLLMChatCompletionsAdapter, it throws E_INVALID_WEBLLM_CHAT_COMPLETIONS_OPTIONS right away. Merged executor and stash overrides are revalidated at dispatch time.

typescript

import { OpenAIChatCompletionsAdapter } from '@nhtio/adk/batteries/llm/openai_chat_completions'

const adapter = new OpenAIChatCompletionsAdapter({
  model: process.env.MODEL_ID!,
  apiKey: process.env.API_KEY,
})

model is the only strictly required field. Everything else is optional; some fields have runtime defaults.

Validation on Overrides

Bypassing the constructor does not bypass validation. If you inject malformed config into executor overrides or the iteration stash, OpenAIChatCompletionsAdapter will throw E_INVALID_OPENAI_CHAT_COMPLETIONS_OPTIONS and WebLLMChatCompletionsAdapter will throw E_INVALID_WEBLLM_CHAT_COMPLETIONS_OPTIONS at dispatch time.

Three-Layer Options Merging

The adapter merges configuration from three sources at each iteration:

1. Constructor Baseline2. Executor Overrides3. Stash Overrides

typescript

// Lowest precedence - the global fallback config
const adapter = new OpenAIChatCompletionsAdapter({
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY,
  temperature: 0.7,
  autoAck: true,
})

typescript

// Mid precedence - applies to every turn run by this TurnRunner
const runner = new TurnRunner({
  ...storageAdapter,
  executorCallback: adapter.executor({
    temperature: 0.2, // Overrides 0.7 constructor baseline
    max_completion_tokens: 1024,
  }),
})

typescript

// Highest precedence - dynamic adjustments for a single iteration
const costControlMiddleware: DispatchPipelineMiddlewareFn = async (ctx, next) => {
  ctx.stash.set(OpenAIChatCompletionsAdapter.STASH_KEY, {
    model: 'gpt-4o-mini', // Downgrade model dynamically
    temperature: 0.0,
  })
  await next()
}

Bracket Access Mismatch

ctx.stash is a Registry instance, not a plain object. Bracket assignment like ctx.stash[STASH_KEY] = ... will not type-check, and the adapter reads only via .get(). Use ctx.stash.set(OpenAIChatCompletionsAdapter.STASH_KEY, ...).

Merging Rules

For headers, helpers, and retry: layers are merged key-by-key. A stash override that sets one custom header does not clear the headers defined in your constructor.
For all other fields: the highest-precedence layer with a defined value completely replaces lower-precedence configurations.

ADK Control Fields

These fields configure the adapter's runtime behavior:

Field	Type	Purpose
`model`	`string`	Required. Model identifier passed to the model endpoint.
`apiKey`	`string`	Bearer token for endpoint authentication.
`baseURL`	`string`	Endpoint URL. Defaults to `https://api.openai.com/v1`.
`headers`	`Record<string, string>`	Custom HTTP headers sent with every request.
`stream`	`boolean`	Toggles SSE streaming. Default `true`.
`streamIdleTimeoutMs`	`number`	Aborts request if the stream goes silent for this period.
`requestTimeoutMs`	`number`	Absolute timeout limit for the entire HTTP transaction.
`retry`	`ChatCompletionsRetryConfig`	Custom retry configuration for handling transient errors.
`fetch`	`typeof globalThis.fetch`	Custom HTTP fetch engine.
`contextWindow`	`number`	Total context budget; the adapter throws if this threshold is crossed.
`tokenEncoding`	`TokenEncoding \| null`	Token encoding used for local context calculations. Non-null requires `contextWindow`.
`selfIdentity`	`string`	Identifies the model for cleaning up raw reasoning traces.
`thoughtSurfacing`	`'all-self' \| 'latest-self' \| 'all'`	Controls which persisted thoughts are replayed into model history.
`replayCompatibility`	`ReadonlyArray<string>`	Forwards reasoning steps to compatibility-constrained endpoints.
`reasoningFieldPrecedence`	`ReadonlyArray<'reasoning' \| 'reasoning_content'>`	Order in which the adapter reads provider reasoning fields; on disagreement each surfaces as its own thought. Default `['reasoning', 'reasoning_content']`.
`bucketOrder`	`ChatCompletionsBucketOrder`	Sets the sorting order for system prompt segments.
`helpers`	`Partial<ChatCompletionsHelpers>`	Overrides specific translation steps.
`autoAck`	`boolean`	Automatically calls `ctx.ack()` after a tool-call-free response. Default `false`.
`strictToolChoice`	`boolean`	Halts execution if `tool_choice` demands an ephemeral artifact tool. Default `false`.
`unsupportedMediaPolicy`	`string`	Strategy when media inputs are incompatible with model modalities. Default `'throw'`.

autoAck

autoAck defaults to false. When false, the executor stores the assistant message and reports it, but does not call ctx.ack() — turn completion is the implementor's responsibility. This is the right default: auto-acking seizes turn-completion control from the output pipeline and prevents any quality gate (output filter, confidence check, human-in-the-loop approval) from running before the turn is declared done.

Set autoAck: true when you are building a single-shot executor with no output-side gate and you want the executor to own the full lifecycle. Every example in this page that wires an adapter directly into a TurnRunner sets autoAck: true so the turn ends after the first tool-call-free response. If you are building a pipeline that gates on output content, omit autoAck and call ctx.ack() yourself after your gate passes.

reasoningFieldPrecedence

Model reasoning/thinking output is not part of OpenAI's official Chat Completions spec — OpenAI hides it on Chat Completions and surfaces it only on the Responses API. OpenAI-compatible providers therefore invented their own field names and disagree: Ollama's /v1 and current vLLM emit reasoning, while legacy vLLM (≤0.8) and the DeepSeek API emit reasoning_content. The adapter reads every field named in reasoningFieldPrecedence that is present on the response, so a thinking model surfaces thoughts regardless of which convention its endpoint follows.

reasoningFieldPrecedence defaults to ['reasoning', 'reasoning_content'] (reasoning-first, matching Ollama and current vLLM). Precedence governs two things:

When more than one listed field is present with identical content (or only one is present), a single thought is emitted, attributed to the highest-precedence field — this is the common case.
When listed fields are present with divergent content, each surfaces as its own thought (ordered by precedence) rather than silently dropping one. In streaming mode both fields stream live as separate thought streams and are de-duplicated by content only at persistence time.

Reorder the array to prefer reasoning_content (e.g. ['reasoning_content', 'reasoning']), or pass a single-element array to read exactly one field.

Lifecycle & boot progress

The on-device batteries spend real wall-clock time before the first token: downloading weights (a .litertlm model is ~2 GB) and then booting the WebGPU/wasm runtime (engine creation, shader/graph compilation, accelerator registration). Each battery already accepts an onInitProgress — but that is the provider's own download-progress shape (different per provider, download-phase only, and LiteRT 0.13.1 exposes none at all). On top of it, every LLM battery (TransformersJsAdapter, LiteRtLmAdapter, WebLLMChatCompletionsAdapter) and the transformers.js embeddings battery accept an opt-in, normalized lifecycle callback surface — leave onInitProgress exactly as it was; this is additive.

Phases: loading → compiling → ready → generating → complete (or error). Subscribe to the aggregate firehose onLifecycle, the targeted per-phase hooks, or both — every transition fires onLifecycle AND the matching onPhase hook:

typescript

const adapter = new TransformersJsAdapter({
  model: 'onnx-community/gemma-4-E2B-it-ONNX',
  // Firehose: every phase transition.
  onLifecycle: (r) => console.log(`[${r.battery}] ${r.phase}`, r.detail ?? '', r.progress ?? ''),
  // Or targeted per-phase hooks (fire alongside the firehose):
  onLoading: (r) => updateProgressBar(r.progress), // r.progress is 0..1 when the provider reports it
  onCompiling: () => showSpinner('Compiling shaders…'), // coarse — the slow WebGPU boot before token 1
  onReady: () => hideSpinner(),
  onError: (r) => reportToSentry(r.error),
})

Each BatteryLifecycleReport carries { phase, battery, model, at, detail?, progress?, raw?, error? }. progress is normalized to 0..1 during loading when the provider reports it (transformers.js download events and WebLLM's InitProgressReport are both forwarded into a loading report; the provider's untouched payload rides along on raw). error is populated only on the error phase. The compiling phase marks the WebGPU/wasm shader/graph build between download and the first token — often the slowest part of a cold start, and otherwise invisible. It is a coarse marker: the on-device runtimes (LiteRT Engine.create, transformers.js from_pretrained) expose the boundary, not a granular progress stream, so progress is usually absent on a compiling report. (WebLLM refines it when its InitProgressReport text names a shader/cache stage.)

Semantics worth knowing:

Loads are single-flight cached, so loading/ready fire once per adapter instance (or once after reset()). A pre-built pipeline/engine (or a cached one) means nothing loaded — loading/ready are correctly skipped, and a turn emits just generating → complete.
generating/complete/error fire per turn.
LiteRT 0.13.1 reports no granular download bytes (its Engine.create takes no progress callback), so loading there is a single coarse "loading model + booting WebGPU runtime" marker followed by ready — the real observable the battery couldn't surface before.
Hooks are defensive: a callback that throws is swallowed and never breaks loading or a turn.

Model Request Body Fields

Schema-supported request body fields not explicitly defined in the ADK control group are forwarded in the JSON request body payload:

typescript

const adapter = new OpenAIChatCompletionsAdapter({
  model: 'gpt-4o',
  temperature: 0.7,
  max_completion_tokens: 2048,
  response_format: { type: 'json_object' },
  reasoning_effort: 'high',
  seed: 42,
})

Supported fields include: temperature, top_p, max_tokens, max_completion_tokens, stop, seed, presence_penalty, frequency_penalty, logit_bias, logprobs, top_logprobs, n, parallel_tool_calls, tool_choice, response_format, reasoning_effort, service_tier, store, metadata, and user.

Automatic Tool Forging

The Chat Completions adapter handles SpooledArtifact tool forging internally — it calls SpooledArtifact.forgeTools() for you.

Manual .bindContext() plumbing in your pipelines is unnecessary for local iteration-scope tools. The adapter merges via ToolRegistry — ToolRegistry.merge([ctx.tools, ...forged], { onCollision: 'replace' }) — dynamically during each dispatch iteration, then calls mergedRegistry.bindContext(ctx).

Overriding Translation Helpers

The adapter uses 18 translation hooks defined under ChatCompletionsHelpers to format core ADK types into standard Chat Completions message payloads. You do not need to rewrite all 18 from scratch; pass the specific fields you want to override via options.helpers.

Field note: the seams against a gateway nobody planned for

The first live media-pipeline tests ran against a Cloudflare Workers AI shim that calls itself OpenAI-compatible and then rejects two things the real OpenAI wire requires: content-block arrays on messages (it wants plain strings) and content: null on assistant tool-call messages (the spec's canonical shape). Neither quirk needed a library change. The arrays were flattened by overriding renderTimelineMessage and renderChatCompletionsToolCallResult — two of the 18 hooks — and the nulls were normalized in a ten-line injectable fetch wrapper. "Compatible with any endpoint speaking the wire shape" is marketing until an endpoint speaks the wire shape badly; the helpers-plus-fetch combination is what makes it true anyway. If your gateway misbehaves, the answer is almost certainly one of these two seams, not a fork.

typescript

const adapter = new OpenAIChatCompletionsAdapter({
  model: 'gpt-4o',
  autoAck: true,
  helpers: {
    renderStandingInstructions: (items) => {
      // items is Iterable<Tokenizable>
      return Array.from(items, (item) => `[INSTRUCTION]: ${String(item)}`).join('\n')
    },
    renderUntrustedContent: (content, attrs) => {
      return `[UNTRUSTED DATA id=${attrs.nonce}]\n${content}\n[END UNTRUSTED]`
    },
  },
})

The translation interface functions:

Helper Hook	Purpose
`renderUntrustedContent`	Fences third-party content using randomized nonces.
`renderTrustedContent`	Formats safe, first-party content blocks.
`renderStandingInstructions`	Compiles `Iterable<Tokenizable>` into a system prompt section.
`renderMemories`	Translates `Iterable<{ memory: Memory; attrs: MemoryAttrs }>` loaded memory records.
`renderRetrievableSafetyDirective`	Prepends instructions alerting the model to retrieval content boundaries.
`renderFirstPartyRetrievables`	Formats safe `Iterable<{ retrievable: Retrievable; attrs: RetrievableAttrs }>` records.
`renderThirdPartyPublicRetrievables`	Formats untrusted public search indexing records.
`renderThirdPartyPrivateRetrievables`	Formats restricted third-party data extractions.
`renderRetrievables`	Top-level dispatcher orchestrating the safe rendering of all retrievals.
`renderTimelineMessage`	Translates a single `Message` timeline record.
`renderThought`	Encapsulates model-generated chain-of-thought metadata.
`filterThoughts`	Truncates or selects thoughts according to the configured `thoughtSurfacing` policy.
`toolsToChatCompletionsTools`	Formats ADK `Tool` instances into API tool declarations.
`renderChatCompletionsSystemPrompt`	Concatenates all context blocks into the final primary system instructions.
`renderChatCompletionsToolCallResult`	Tool result → tool message content.
`descriptionToChatCompletionsJsonSchema`	Maps ADK type descriptions down to strict JSON schemas.
`buildChatCompletionsHistory`	Constructs the absolute request message list combining history, memories, system prompts, and tool sequences.
`createChatCompletionsToolCallDeltaAccumulator`	Manages streaming string accumulation for building completed tool structures.

The Battery as Reference Implementation

If you are determined to write a custom executor, study the OpenAIChatCompletionsAdapter source first. It is the broadest execution loop in the codebase. Pay specific attention to:

How configuration layers are merged securely and validated before calling the model.
How context components (ctx.turnMessages, ctx.turnMemories, ctx.turnRetrievables, and ctx.tools) are merged dynamically.
How SSE chunks are parsed, and how streamIdleTimeoutMs prevents silent hangs.
How the executor reports messages, thoughts, and tool calls via DispatchExecutorHelpers.
How the system ensures ctx.ack() and ctx.nack() are executed deterministically, especially when requests fail.

What each pipeline owns

Envelopes

Persistence

Identity and Reasoning

Media

LLM batteries

Compatible Endpoints

OpenAIChatCompletionsAdapter

WebLLMChatCompletionsAdapter

LiteRtLmAdapter (on-device LiteRT-LM via WebGPU)

TransformersJsAdapter (on-device, Node AND browser, via transformers.js)

OllamaAdapter (native `/api/chat`)

Construction and Validation

Three-Layer Options Merging

Merging Rules

ADK Control Fields

autoAck

reasoningFieldPrecedence

Lifecycle & boot progress

Model Request Body Fields

Automatic Tool Forging

Overriding Translation Helpers

The Battery as Reference Implementation

What each pipeline owns

Envelopes

Persistence

Identity and Reasoning

Media

LLM batteries ​

Compatible Endpoints ​

OpenAIChatCompletionsAdapter ​

WebLLMChatCompletionsAdapter ​

LiteRtLmAdapter (on-device LiteRT-LM via WebGPU) ​

TransformersJsAdapter (on-device, Node AND browser, via transformers.js) ​

OllamaAdapter (native /api/chat) ​

Construction and Validation ​

Three-Layer Options Merging ​

Merging Rules ​

ADK Control Fields ​

autoAck ​

reasoningFieldPrecedence ​

Lifecycle & boot progress ​

Model Request Body Fields ​

Automatic Tool Forging ​

Overriding Translation Helpers ​

The Battery as Reference Implementation ​

LLM batteries

Compatible Endpoints

OpenAIChatCompletionsAdapter

WebLLMChatCompletionsAdapter

LiteRtLmAdapter (on-device LiteRT-LM via WebGPU)

TransformersJsAdapter (on-device, Node AND browser, via transformers.js)

OllamaAdapter (native `/api/chat`)

Construction and Validation

Three-Layer Options Merging

Merging Rules

ADK Control Fields

autoAck

reasoningFieldPrecedence

Lifecycle & boot progress

Model Request Body Fields

Automatic Tool Forging

Overriding Translation Helpers

The Battery as Reference Implementation