Skip to content
22 min read · 4,323 words

A 3B model, a browser tab, and frontier-grade answers

No server. No API key. No tool-calling. A 4096-token window and a 3-billion-parameter model that has never heard of @nhtio/adk — answering questions about it correctly, with citations, on your GPU. Open the dialog in this site's header and try to break it. That's the agent. This page is how it's built.

TL;DR

A tool-less 3B model gives auditably grounded answers inside a documentation corpus because everything it can't do is built around it on @nhtio/adk: per-turn query rewrite + HyDE → hybrid retrieval → cross-encoder rerank → a sufficiency floor that abstains instead of guessing → injection as first-party Retrievables → an output gate that refuses to finish a turn until the answer cites a real link and emits no code. Big models win because they carry that focusing-and-structuring scaffolding baked into the weights; the mistake is leaning on that fuzzy internal capability instead of building the scaffolding outside the model, in code you can read, test, and fix. The model isn't the brain — it's a text-predictor on a leash, and the leash is the product. If your agent is bad, the leash is the part you skipped.

No hand-waving on the claim, because hype is what fills the space where shipping should be: this is frontier-grade within the retrieved envelope and nowhere else — docs Q&A grounded in a corpus the agent fetched, with a hard floor that makes it shut up rather than guess the moment the corpus comes up short. It is not a general reasoner. It will never pretend to be one, because pretending is the disease this whole page is a cure for. But inside that lane it humiliates models a hundred times its size — not because it got clever overnight, but because the lane is built, in code, by hand, while everyone else stands around waiting for a bigger model to make the problem disappear.

It's supposed to be impossible at this size. It isn't. It's just unfashionable, because doing it properly is engineering and prompt-magic is a demo you can ship before lunch and apologize for after launch. The model was never the bottleneck — the scaffolding around it was, and that part doesn't come in the box. A frontier model handed a vague prompt and a firehose of half-relevant context hallucinates with exactly the same confidence as a 3B, at ten times the cost — the bigger model just lies in better prose and invoices you for the privilege. Feed a small model the right paragraph and forbid it from inventing the rest, and it does the job. Feed a big one garbage and pray — which is precisely what most "agents" are — and you get garbage back, now with citations to nothing.

And there's no magic in the big ones either, so stop treating them like wizards. They win on three unromantic things: larger context windows, better training data, and learned internal subroutines that act as scaffolding the model carries inside itself — the focusing and structuring that otherwise has to be built by hand. Fine. When you genuinely can't narrow the focus from the outside, renting that internal scaffolding is the smart move — pay for it when the problem actually demands it. But the moment that fuzzy, baked-in black box becomes the foundation everything else stands on, the whole system is resting on a thing nobody can read, test, or fix when it knifes you in production — and it will. That's not architecture. It's superstition with a checkpoint file. The fix is to put the scaffolding where you can see it: outside the model, in code you control — built by hand, not rented from inside a bigger one. Do that, and a 3B with none of those baked-in advantages wins anyway. That's the whole bet of this page.

So the answer was never a smarter model. It's the scaffolding, and the rails that keep the model speaking only inside it. When an agent is bad, the model is almost always the one part that's fine — what's broken is everything around it that never got built. The weights aren't the bug; the missing machinery is. This page is the receipt: every claim above, in real code, with nothing hidden and nothing hand-waved. Most teams would rule this out before the first prototype — browser tab, 3B model, no backend. We built it. It works anyway.

Here is the actual thesis. A language model is the one component in this system that is non-deterministic by construction — sample the same prompt twice and you can get two different answers, and no amount of begging changes that. You can't make it deterministic. What you can do is build everything else deterministically and shrink the model's blast radius to the smallest possible box: a fixed pipeline decides what it sees, a coded threshold decides whether it's allowed to speak at all, and a coded gate decides whether what it said is allowed to stand. Query rewrite, retrieval, fusion, reranking, the sufficiency floor, token budgeting, citation validation, the retry policy — every one of those is deterministic, inspectable, and yours to fix. The model still rolls its dice; it just rolls them inside a cage you built, where a bad roll gets caught instead of shipped. This is about as deterministic as a system with a non-deterministic heart gets. Everything in this box except the dice-roll is code you can read, step, and fix — and the rest of this page is that code.

The rails are not suggestions. They're enforced in code, on every turn. Here is what the agent is not trusted to do — and therefore is mechanically prevented from doing, not asked nicely by a prompt:

What this agent is not allowed to do

  • Cannot browse. No network reach at answer time — it sees only the corpus the retrieval pipeline already fetched.
  • Cannot call tools. There is no tool surface. The model emits prose, nothing else.
  • Cannot answer without grounded links — and doesn't get many chances. An uncited answer fails the gate and regenerates; the failures never persist. Only if a small retry budget is exhausted does the last attempt stand, and the render path strips its phantom links to plain text so an invented URL never becomes a clickable 404.
  • Cannot show you code. A 3B mis-frames code it copies, so the gate rejects fenced blocks and code-like snippets — and the render path strips any that survive retry exhaustion, so code never reaches the screen regardless. The citation link carries you to the real, correctly-framed code instead.
  • Cannot retry retrieval to cover a bad answer. A failed gate regenerates against the same corpus — it cannot go re-fetch until the docs happen to agree with it.

Every one of those is a mechanism on this page, not a promise. The rest of this document is where each one lives in the code.

The shape of the turn

ADK owns the skeleton; the app decides which organs are worth having. Ask ADK is not another hand-rolled inference loop wearing a product name — it's a stock TurnRunner with the full storage-callback surface wired in and three middleware pipelines doing the real work. One run() is one turn.

The callback set is the entire ADK persistence contract, and the Ask ADK codebase wires exactly the slots it uses. The four fetch*/refresh* hydrators all do real work — messages, memories, retrievables, and standing instructions are what this agent traffics in. On the write side it's selective: storeMemory and storeStandingInstruction are real OPFS writes, storeMessage buffers (the accepted answer is persisted later, after the gate clears it — more below), and storeRetrievable is a no-op because retrievables are synthesized fresh each turn, never persisted. Everything else is a real-arity no-op: every *Thought*/*ToolCall* callback (a tool-less 3B has neither) plus the mutate*/delete* slots this agent never triggers. They exist because the contract requires them; they're empty because inventing fake work to look complete is how interfaces rot. That's how you satisfy an interface you don't fully use without lying to the next maintainer: wire what you use, stub the rest with the correct arity, fake nothing.

The three pipelines are where Ask ADK lives. turnInputPipeline hydrates history, retrieves and injects documents, recalls memory, loads standing instructions, and sheds context to fit the window — in that order, once per turn. dispatchOutputPipeline holds the citation gate. turnOutputPipeline extracts memory and persists the accepted answer.

ts
// Input pipeline: hydrate via ctx.fetch*() (delegates to the callbacks).
// Runs ONCE per turn — TurnRunner executes the turn input pipeline before
// DispatchRunner.dispatch(). Citation retries iterate the DISPATCH loop
// (dispatchInputPipeline/executor/dispatchOutputPipeline), which does NOT
// re-run these — so rewrite + retrieval happen exactly once. Do not move
// retrieval/standing/memory hydration into dispatchInputPipeline.
turnInputPipeline: [
  // Hydrate prior history from OPFS, then seed the CURRENT question
  // in-memory. The current message is intentionally NOT in OPFS yet
  // (persistence is deferred to turn-end), so we add it here directly so
  // the model sees what was just asked.
  (async (ctx: any, next: any) => {
    for (const m of await ctx.fetchMessages()) ctx.turnMessages.add(m)
    ctx.turnMessages.add(
      new Message({
        id: userMessageId,
        role: 'user',
        content: query,
        createdAt: DateTime.now(),
        updatedAt: DateTime.now(),
      })
    )
    await next()
  }) as TurnPipelineMiddlewareFn,
  (async (ctx: any, next: any) => {
    for (const r of await ctx.fetchRetrievables()) ctx.turnRetrievables.add(r)
    await next()
  }) as TurnPipelineMiddlewareFn,
  (async (ctx: any, next: any) => {
    for (const m of await ctx.fetchMemories()) ctx.turnMemories.add(m)
    await next()
  }) as TurnPipelineMiddlewareFn,
  (async (ctx: any, next: any) => {
    for (const si of await ctx.refreshStandingInstructions()) ctx.standingInstructions.add(si)
    await next()
  }) as TurnPipelineMiddlewareFn,
  // FINAL input step: Token-Thrift shedding. Everything that wants to be in
  // the prompt has now staked its claim; if the total would crowd out the
  // output reserve, shed lowest-priority buckets (RAG → memory → old turns)
  // until input fits contextWindow − reserve. Only emits a timeline step
  // when it ACTUALLY trims — a visible signal the window is under pressure.
  (async (ctx: any, next: any) => {
    const report = applyShedding(ctx, CONTEXT_WINDOW, SYSTEM_PROMPT_TOKENS)
    if (report) {
      const dropped: string[] = []
      if (report.droppedRag)
        dropped.push(`${report.droppedRag} chunk${report.droppedRag === 1 ? '' : 's'}`)
      if (report.droppedMemory)
        dropped.push(
          `${report.droppedMemory} ${report.droppedMemory === 1 ? 'memory' : 'memories'}`
        )
      if (report.droppedTurns)
        dropped.push(`${report.droppedTurns} old turn${report.droppedTurns === 1 ? '' : 's'}`)
      const droppedLabel = dropped.length ? dropped.join(', ') : 'nothing sheddable'
      bus.emit('step', {
        attempt: 0,
        kind: 'shed',
        state: report.refused ? 'failed' : 'done',
        label: report.refused
          ? `Context overflow — couldn't fit (${report.after}/${report.limit} tok)`
          : `Trimmed to fit window — dropped ${droppedLabel} (${report.before}→${report.after} tok)`,
        detail: {
          shed: {
            before: report.before,
            after: report.after,
            limit: report.limit,
            droppedRag: report.droppedRag,
            droppedMemory: report.droppedMemory,
            droppedTurns: report.droppedTurns,
            refused: report.refused,
          },
        },
      })
    }
    await next()
  }) as TurnPipelineMiddlewareFn,
],
turnOutputPipeline: [memoryAndStandingOutput, persistTurnOutput],
dispatchOutputPipeline: [citationGate],

Retrieval, rewrite, memory, and standing-instruction hydration run in the turn input pipeline, which executes exactly once before dispatch. Citation retries iterate the dispatch loop, which does not re-run the turn pipeline. That boundary is load-bearing: it's why a regeneration costs one more model call and not another full retrieval pass. (TurnRunner, Retrievable, and the pipeline contracts are ADK primitives — see The Loop and Assembly.)

Synthetic RAG: the Retrievable seam

The model doesn't know what @nhtio/adk is. It never will. Synthetic RAG starts with the admission most agent demos dodge — the model is not the authority, the documentation is: middleware fetches the relevant docs, wraps each as a first-party Retrievable, and the WebLLM battery renders them into a <retrieved_corpus> envelope before the model generates a single token. The model answers from the envelope. It has no idea where the text came from — only that it's there, tagged with the source URL it must cite.

There are no tools. The model does not get to perform curiosity. Retrieval code does the lookup before generation, and the model finds the corpus already in front of it.

ts
// Inject RAW chunk content (verbatim doc voice — see the no-author-code +
// anti-synthesis directives; we don't condense/paraphrase).
return finalHits.map(
  (h) =>
    new Retrievable({
      id: h.id,
      content: h.content,
      trustTier: 'first-party',
      source: `${h.pageUrl}#${h.anchor}`,
      kind: 'documentation',
      score: (h as { rerankScore?: number }).rerankScore ?? h.fusedScore,
      createdAt: now,
      updatedAt: now,
    })
)

trustTier: 'first-party' is a promise to the renderer: this is deployer-vetted material, render it as authoritative. Untrusted content (a user upload, an open-web scrape) would carry a different tier and a different envelope. Here the corpus is the product's own docs, so first-party is the truth, not a shortcut. The content is injected raw — verbatim doc prose, not a paraphrase — because the moment you let a 3B summarize its own context, you've put the arsonist in charge of writing the fire report.

This is the whole bet of synthetic RAG: ADK gives you a clean insertion point and a trust-tiered rendering contract; you build everything that decides what goes into it. The rest of this page is that "everything."

Recall first, judgment later: rewrite + HyDE

A conversational question is a bad search query. "And what about its arguments?" has no nouns a retriever can use. So before retrieval runs, the 3B does two small, non-streaming jobs — both of which only widen recall, neither of which is allowed to judge.

First, a rewrite: collapse the question (plus recent history, to resolve pronouns) into a few keywords. The prompt is deliberately tiny — one instruction, two examples, hard stop.

Field note: the elaborate-prompt trap

The tiny prompt is not aesthetic minimalism. It's scar tissue. An earlier, elaborate multi-rule rewrite prompt backfired completely: the 3B ignored every instruction and just answered the question instead, writing a Hello-World code sample where a search query belonged. The smaller the model, the shorter the leash — a long prompt is just more rope to wander off with.

Second, HyDE: write a short hypothetical documentation paragraph that would answer the question, and embed that. A made-up answer sits closer in vector space to the real doc chunk than a terse question does — it bridges the query-to-passage gap that bi-encoders fumble. The model doesn't know ADK, so its hypothetical may drift; that's fine, because HyDE only ever adds candidates to the pool. It cannot displace the right chunk, only fail to help.

ts
// Kept deliberately SHORT and single-purpose. An elaborate multi-rule prompt
// made the 3B IGNORE it and just answer the question (it wrote a Hello-World
// code sample instead of a search query). One instruction, two examples, hard
// stop. Vocabulary expansion is now carried by HyDE (a hypothetical answer is
// doc-shaped by nature), so the rewrite only needs to produce a short keyword
// query and resolve pronouns from history.
const SYSTEM_PROMPT = `Turn the user's message into a short documentation search query (3-8 keywords). Resolve pronouns using the conversation. Output ONLY the keywords, nothing else. Never answer the question. Never write code.
Example: "Show me a simple Hello World" -> "quickstart minimal example getting started setup"
Example: "What does it return?" (after talking about the executor) -> "executor return value ack nack"`
ts
export async function rewriteForRetrieval(
  engine: any,
  rawQuery: string,
  history: ConversationTurn[] = []
): Promise<string> {
  try {
    if (!engine?.chat?.completions?.create) return rawQuery
    const recent = history.slice(-4)
    const messages: Array<{ role: string; content: string }> = [
      { role: 'system', content: SYSTEM_PROMPT },
    ]
    for (const t of recent) messages.push({ role: t.role, content: t.content })
    messages.push({ role: 'user', content: rawQuery })
    const result = await engine.chat.completions.create({
      messages,
      stream: false,
      temperature: 0.1,
      max_tokens: 120,
    })
    const text = result?.choices?.[0]?.message?.content?.trim()
    if (!text) return rawQuery
    return text.replace(/^["'`]|["'`]$/g, '').trim() || rawQuery
  } catch {
    return rawQuery
  }
}
ts
// HyDE (Hypothetical Document Embeddings). Generate a short, plausible
// documentation paragraph that WOULD answer the question, then embed THAT for
// the cosine lane. A hypothetical answer sits closer in vector space to the
// real doc chunk than a terse question does — it bridges the query↔passage
// length/vocabulary gap that bi-encoders struggle with.
//
// IMPORTANT for a weak model: our 3B doesn't know ADK, so its hypothetical
// answer may use generic vocabulary and could drift OFF the right page. We
// therefore use HyDE only to WIDEN recall — the caller embeds both the
// rewritten query AND this doc and unions the candidates, so HyDE can only add
// hits, never displace the right one. The cross-encoder rerank then fixes
// ordering against the real question. Returns '' on any failure (caller skips
// the HyDE lane entirely).
const HYDE_SYSTEM_PROMPT = `Write a short, factual documentation paragraph (2-4 sentences) that would directly answer the user's question about a TypeScript library. Use precise technical terms and likely API/type names. Do not hedge, do not say "I think", do not mention that this is hypothetical. Output only the paragraph.`

export async function generateHydeDocument(engine: any, question: string): Promise<string> {
  try {
    if (!engine?.chat?.completions?.create) return ''
    const result = await engine.chat.completions.create({
      messages: [
        { role: 'system', content: HYDE_SYSTEM_PROMPT },
        { role: 'user', content: question },
      ],
      stream: false,
      temperature: 0.2,
      max_tokens: 160,
    })
    return String(result?.choices?.[0]?.message?.content ?? '').trim()
  } catch {
    return ''
  }
}

Both passes share the single loaded WebLLM engine with the answer model — the 1.6GB weights load once. They run serially, not concurrently: one worker engine processes one completion at a time. Latency is acceptable here; correctness is not optional.

Field note: one engine, one completion at a time

Serial is not a style choice. It is the physics of one engine. Firing the rewrite and HyDE passes concurrently against the single worker engine made them collide and both return empty. There is one set of weights on one GPU context; it does one completion at a time, and the pipeline is built to respect that.

Hybrid retrieval → rerank → abstain

A prebuilt index of documentation chunks plus their MiniLM embeddings is already loaded in the browser. Everything below runs against it, in-tab, with no network call.

This is the page's technical centerpiece: a three-layer relevance defense, because the 3B text-predictor cannot be trusted to notice that the context handed to it is poison.

Layer 1 — three lanes, fused. Both the rewritten query and the raw question go through three independent retrieval lanes: Orama's BM25, a hand-rolled BM25 over pre-tokenized terms, and cosine over the embedding vectors. The HyDE doc rides the cosine lane only — its hypothetical-answer embedding is scored against the chunk vectors and unioned into the cosine candidates, widening recall without voting in the lexical lanes. Three lanes that disagree are the point — fusing three copies of the same mistake is just numerology; Reciprocal Rank Fusion only earns its keep when its inputs are independent. The hand-rolled BM25 exists precisely so two of the three lanes aren't the same library scoring the same way.

ts
function bm25Rank(query: string[], chunks: AskAdkChunk[]): Array<{ id: string; score: number }> {
  // Tiny BM25: k1=1.2, b=0.75; doc length normalised to avg doc length.
  const k1 = 1.2
  const b = 0.75
  const N = chunks.length
  const df = new Map<string, number>()
  for (const c of chunks) {
    const seen = new Set(c.bm25Tokens)
    for (const t of seen) df.set(t, (df.get(t) ?? 0) + 1)
  }
  let avgdl = 0
  for (const c of chunks) avgdl += c.bm25Tokens.length
  avgdl = avgdl / Math.max(1, N)
  const idf = new Map<string, number>()
  for (const t of query) {
    const d = df.get(t) ?? 0
    idf.set(t, Math.log(1 + (N - d + 0.5) / (d + 0.5)))
  }
  const scores: Array<{ id: string; score: number }> = []
  for (const c of chunks) {
    const tf = new Map<string, number>()
    for (const tok of c.bm25Tokens) tf.set(tok, (tf.get(tok) ?? 0) + 1)
    let s = 0
    for (const q of query) {
      const f = tf.get(q) ?? 0
      if (!f) continue
      const i = idf.get(q) ?? 0
      const dl = c.bm25Tokens.length
      s += i * ((f * (k1 + 1)) / (f + k1 * (1 - b + b * (dl / avgdl))))
    }
    if (s > 0) scores.push({ id: c.id, score: s })
  }
  scores.sort((x, y) => y.score - x.score)
  return scores.slice(0, PER_LANE_LIMIT)
}
ts
// PAGE-COVERAGE BOOST. RRF ranks chunks independently, so a question whose
// real answer is spread across one page's many sections (e.g. "Bring your own
// LLM" has 28 chunks) can lose to a single distinctive chunk from a tangential
// page that happens to share vocabulary. That's how "how do I write my own LLM
// backend" ended up answered from byo-retrieval instead of byo-llm. We fix it
// by rewarding topical concentration: a page that contributes more total
// fused mass (across all its hit chunks) is the center of the query, so every
// chunk on it gets lifted. Multiplicative, capped, so a strong single-chunk
// hit still survives — this nudges, it doesn't override.
const pageMass = new Map<string, number>()
for (const h of fused) pageMass.set(h.pageUrl, (pageMass.get(h.pageUrl) ?? 0) + h.fusedScore)
const maxMass = Math.max(1e-9, ...pageMass.values())
for (const h of fused) {
  const massShare = (pageMass.get(h.pageUrl) ?? 0) / maxMass // 0..1, 1 = densest page
  h.fusedScore *= 1 + 0.5 * massShare // up to +50% for chunks on the densest page
}

fused.sort((a, b) => b.fusedScore - a.fusedScore)

The page-coverage boost is ugly in exactly the useful way production fixes are ugly. RRF ranks chunks independently, so a question whose answer is spread across one page's many sections can lose to a single loud chunk from a tangential page that happens to share vocabulary. The fix: reward topical concentration. A page contributing more total fused mass is the center of the query, so every chunk on it gets lifted — multiplicatively, capped, so a genuine single-chunk hit still survives. It nudges. It is not allowed to crown a loser.

Field note: the page that out-shouted the right one

This is not hypothetical. "How do I write my own LLM backend" once got answered from the retrieval docs instead of the LLM docs — one loud, vocabulary-matching chunk from the wrong page beat the many quieter, correct chunks spread across the right one. The boost exists because being right on average is worthless if the single loudest chunk is wrong.

Layer 2 — the generator is not the judge. The fused pool is a suspect list, not a verdict. A bi-encoder compressed query and document into separate vectors, so it scores "shares vocabulary" and "answers the question" almost the same. A cross-encoder (ms-marco-MiniLM, running on ONNX in the browser) reads query and passage jointly and emits a real relevance score. Crucially, that score comes from a cross-encoder actually trained for relevance — a different model entirely, upstream of the 3B, which isn't. Each chunk is scored against two query forms — the raw question and the keyword rewrite — and keeps its best, because any one phrasing from a small model may whiff. (HyDE widened recall already; it's deliberately kept out of the rerank, which judges relevance to what the user actually asked.)

ts
export async function rerankBest(
  queries: Array<string | undefined | null>,
  hits: RetrievalHit[]
): Promise<RerankedHit[]> {
  if (hits.length === 0) return []
  const forms = [...new Set(queries.map((q) => (q ?? '').trim()).filter(Boolean))]
  if (forms.length === 0) return hits.map((h) => ({ ...h, rerankScore: Number.NaN }))

  const best = new Map<string, number>() // chunk id → best score across forms
  let anyWorked = false
  for (const form of forms) {
    const scored = await rerankHits(form, hits)
    for (const h of scored) {
      if (Number.isNaN(h.rerankScore)) continue
      anyWorked = true
      const prev = best.get(h.id)
      if (prev === undefined || h.rerankScore > prev) best.set(h.id, h.rerankScore)
    }
  }
  if (!anyWorked) return hits.map((h) => ({ ...h, rerankScore: Number.NaN }))

  const out: RerankedHit[] = hits.map((h) => ({ ...h, rerankScore: best.get(h.id) ?? 0 }))
  out.sort((a, b) => b.rerankScore - a.rerankScore)
  return out
}

Layer 3 — abstain instead of bluff. If the best chunk still scores below a sufficiency floor, nothing retrieved actually answers the question. The pipeline does not hand the 3B tangential material and then perform the industry ritual of asking nicely for honesty — on the abstain path the model's output is never used for the answer. A chunk-none sentinel is injected and the turn still runs, but whatever the model improvises is discarded; the response the user gets is a deterministic refusal assembled in code after the run, naming the closest pages as links. The model is cut out of the one situation where it's most tempted to invent a "related feature" to fill the silence.

ts
// Deterministic refusal for the abstain path — assembled in code, never
// generated, so the model can't fabricate a "related feature" or fake API to
// fill the silence. Markdown so the dialog renders the links as anchors.
function buildAbstainRefusal(
  _query: string,
  closest: Array<{ title: string; url: string }>
): string {
  const lead =
    "The documentation doesn't cover this. I only answer from the @nhtio/adk docs, and nothing in them addresses your question."
  if (closest.length === 0) return lead
  const links = closest.map((p) => `- [${p.title}](${p.url})`).join('\n')
  return `${lead}\n\nThe closest pages, in case they help:\n\n${links}`
}

Only the survivors of all three layers get packed into the per-turn token budget, walking the reranked order until the next chunk would overflow. There is no fixed top-K; the model gets as much corpus as fits, best first.

ts
export function packToBudget(hits: RetrievalHit[], tokenBudget: number): RetrievalHit[] {
  const packed: RetrievalHit[] = []
  let used = 0
  for (const hit of hits) {
    const cost = Tokenizable.estimateTokens(hit.content, 'cl100k_base')
    if (used + cost > tokenBudget && packed.length > 0) break
    packed.push(hit)
    used += cost
  }
  return packed
}

The 4096-token window bites

The window is not a preference knob. 4096 is the compiled limit for this model's WebLLM build — force it higher and the model doesn't get roomier, it gets broken: token-level garbage. So every one of those tokens gets spent on purpose.

The generic budget split gives retrieval ~10% of the window. For a tool-using agent with a fat conversation, fine. For Ask ADK — no tools, short conversations, whose entire job is grounding answers in retrieved docs — that split is architectural malpractice: it starves the only grounding source, then acts surprised when the model hallucinates. Token-Thrift does RAG-first budgeting: every other slice gets a tight floor, and retrieval absorbs the remainder. The system-prompt cost is measured once and threaded in, because the old code assumed a 500-token prompt while the real one was ~2000 — so the "reserve" was fiction and input quietly ate the room the model needed to answer.

ts
export function computeBudget(
  contextWindow = DEFAULT_CONTEXT_WINDOW,
  // Measured token cost of the REAL system prompt. The old code assumed a 500-tok
  // floor while the actual prompt was ~2000 — so RAG was over-allocated by ~1500
  // tokens and the "reserve" was fiction: input could swell to fill the window,
  // leaving the model almost no room to answer. Pass the real measurement so the
  // RAG bucket (the remainder) honestly reflects what's left after the prompt.
  actualSystemPromptTokens?: number
): TokenThriftBudget {
  // RAG-FIRST allocation. The generic Cortex split gives RAG ~10% — correct for
  // a tool-using agent with a fat conversation, dead wrong here. Ask ADK has NO
  // tools, short conversations, and its ENTIRE job is grounding answers in
  // retrieved docs. So RAG is the dominant bucket: every other slice gets a
  // tight floor and RAG absorbs the remainder of the context window.
  //
  // The earlier 0.1×cw RAG bucket (~409 tok on a 4096 window) fit barely one
  // chunk — the model was starved and hallucinated APIs that don't exist. This
  // is the fix for that.
  const reserve = Math.max(768, Math.floor(0.25 * contextWindow)) // output side; matches the answer adapter's max_tokens
  const systemPrompt = Math.max(actualSystemPromptTokens ?? 500, Math.floor(0.12 * contextWindow))
  const standingInstructions = Math.max(120, Math.floor(0.03 * contextWindow))
  const currentUserMessage = Math.max(250, Math.floor(0.06 * contextWindow))
  const rawTurns = Math.max(400, Math.floor(0.1 * contextWindow))
  const memory = Math.max(120, Math.floor(0.03 * contextWindow))
  const toolOutputs = 0 // Ask ADK has no tools.
  // HEADROOM — slack for tokenizer drift (we size with cl100k_base; MLC counts
  // with the real Llama tokenizer, ~10-15% higher on prose) plus a little give.
  // Retry overhead is NOT a factor here anymore: failed attempts are pruned from
  // the thread on regeneration (see ask_adk_runtime), and retrieved chunks are
  // summarized down to the relevant span before injection rather than shoved in
  // raw — so the RAG footprint is small and bounded.
  const headroom = Math.max(450, Math.floor(0.1 * contextWindow))
  const claimed =
    reserve +
    systemPrompt +
    standingInstructions +
    currentUserMessage +
    rawTurns +
    memory +
    toolOutputs +
    headroom
  const rag = Math.max(700, contextWindow - claimed)
  const total = claimed + rag
  return {
    contextWindow,
    reserve,
    systemPrompt,
    standingInstructions,
    currentUserMessage,
    rawTurns,
    memory,
    rag,
    toolOutputs,
    total,
  }
}
ts
export function applyShedding(
  ctx: ShedCtxLike,
  contextWindow: number,
  actualSystemPromptTokens?: number
): ShedReport | null {
  const budget = computeBudget(contextWindow, actualSystemPromptTokens)
  const limit = contextWindow - budget.reserve
  const before = ctxInputTotal(ctx)
  if (before <= limit) return null

  let droppedRag = 0
  let droppedMemory = 0
  let droppedTurns = 0

  // T4: drop lowest-scored RAG chunks from the tail (turnRetrievables is added in
  // fused/reranked order, best first, so the last-inserted are the weakest).
  while (ctxInputTotal(ctx) > limit && ctx.turnRetrievables.size > 0) {
    const last = [...ctx.turnRetrievables].at(-1)!
    ctx.turnRetrievables.delete(last)
    droppedRag++
  }
  // T3: drop memory items (tail).
  while (ctxInputTotal(ctx) > limit && ctx.turnMemories.size > 0) {
    const last = [...ctx.turnMemories].at(-1)!
    ctx.turnMemories.delete(last)
    droppedMemory++
  }
  // T2: tail-truncate raw turns to a floor, oldest pairs first. Keep the most
  // recent FLOOR_RAW_TURNS messages (the current question is the newest and is
  // thus always retained).
  while (ctxInputTotal(ctx) > limit && ctx.turnMessages.size > FLOOR_RAW_TURNS) {
    const oldest = [...ctx.turnMessages][0]
    ctx.turnMessages.delete(oldest)
    droppedTurns++
  }

  const after = ctxInputTotal(ctx)
  return { before, after, limit, droppedRag, droppedMemory, droppedTurns, refused: after > limit }
}

Shedding is the final step of the input pipeline: once everything has staked its claim, if the total still overflows the window minus the output reserve, drop the lowest-priority buckets — tail RAG first, then memory, then the oldest conversation turns — until it fits. That looks like it contradicts RAG-first budgeting, and doesn't: RAG gets the largest allocation, and what sheds first is its tail — the lowest-reranked chunks, the ones least likely to matter. The best chunks sit at the head and are the last thing to go. Allocation favors RAG; eviction trims it from the bottom up. No summarization, no clever compression. Dropping is lossy but predictable; summarizing just hands the weak model one more chance to lie.

Asking a 3B to emit a synthetic citation marker — some [chunk-id] grammar you invented — is a format fetish that the model will fail. And think about what you're actually asking for: a model that could reliably follow a bespoke output grammar is a model that could reliably emit a tool call, and if it could do that, none of this page would exist — we'd hand it a retrieve() tool and go home. It can't. That's the entire premise. A 3B cannot reproduce a made-up token on demand any more than it can structure a function call. What it can do — the one string operation small models are genuinely reliable at — is copy a URL that's sitting right there in the corpus envelope.

So the citation contract is exactly that, and nothing fancier: an ordinary Markdown link to the chunk's source URL. The model writes [the executor callback](/assembly/byo-llm) inline, copying the URL verbatim. Validation is a page-path match against the retrieved set: a link to a retrieved page is a real citation; an internal link to a page that was not retrieved is a hallucinated path and gets counted as invalid. The contract is shaped around the one thing the model can already do, not around what would be tidy.

ts
export function validateCitationCoverage(
  text: string,
  retrievables: ReadonlyMap<string, AskAdkChunk>
): { valid: number; invalid: number } {
  const idx = buildCitationIndex(
    [...retrievables.values()].map((c) => ({
      id: c.id,
      pageUrl: c.pageUrl,
      anchor: c.anchor,
      title: c.title,
    }))
  )
  let valid = 0
  let invalid = 0
  for (const m of text.matchAll(MARKDOWN_LINK_RE)) {
    const target = (m[2] ?? '').trim()
    if (!INTERNAL_PATH_RE.test(target)) continue // external/fragment — ignore
    if (idx.byPath.has(normalizePath(target))) valid++
    else invalid++
  }
  return { valid, invalid }
}
ts
// ─── Post-render citation-link decoration ────────────────────────────────────
// Citations are ordinary Markdown links the model wrote to the documents, so by
// the time MarkdownIt → DOMPurify runs they're already <a> elements in the HTML.
// This pass walks those anchors and:
//   - For an internal link whose path matches a RETRIEVED page: mark it as a
//     citation (class, new-tab, canonical href + title) so it's styled and
//     points at the precise anchor we retrieved — not whatever (possibly wrong)
//     fragment the model typed.
//   - For an internal link whose path matches NO retrieved page: the model
//     invented a doc URL. Unwrap it to plain text (drop the bad href) and count
//     it, so a hallucinated link never becomes a clickable 404.
//   - External links (http/https/mailto) and pure #fragments: left untouched.

export interface CitationTarget {
  url: string
  title: string
}

/**
 * Decorate/clean citation links in an already-sanitized HTML fragment.
 * `targets` maps normalized internal path → the canonical { url, title } for the
 * retrieved page at that path. Operates on a detached <template>; returns HTML.
 */
export function decorateCitationLinks(
  safeHtml: string,
  targets: ReadonlyMap<string, CitationTarget>,
  ctx: CitationContext
): string {
  if (typeof document === 'undefined') return safeHtml
  if (safeHtml.indexOf('<a') === -1) return safeHtml

  const tpl = document.createElement('template')
  tpl.innerHTML = safeHtml

  for (const a of Array.from(tpl.content.querySelectorAll('a'))) {
    const href = (a.getAttribute('href') ?? '').trim()
    if (!INTERNAL_PATH_RE.test(href)) continue // external / fragment — leave alone
    const target = targets.get(normalizePath(href))
    if (target) {
      a.className = 'ask-adk-cite'
      a.setAttribute('href', target.url)
      a.setAttribute('title', target.title)
      a.setAttribute('target', '_blank')
      a.setAttribute('rel', 'noopener')
    } else {
      // Ungrounded internal link — model invented a doc path. Unwrap to text.
      ctx.invalidCount++
      const text = document.createTextNode(a.textContent ?? '')
      a.parentNode?.replaceChild(text, a)
    }
  }

  return tpl.innerHTML
}

By the time the answer renders (MarkdownIt → DOMPurify), citations are already <a> elements. A final DOM pass decorates the grounded ones (canonical href, the precise anchor we retrieved, new tab) and unwraps the hallucinated ones to plain text — a doc URL the model invented never becomes a clickable 404.

The model is also forbidden from emitting code at all. It describes in prose and links to the real, correctly-framed code in the docs.

Field note: why a real snippet is as dangerous as a fake one

You might expect copying code from the corpus to be safe. It isn't. The 3B mis-frames code it copies — it once lifted a battery-wiring one-liner and captioned it "how to write a custom executor." A mislabeled real snippet is exactly as harmful as an invented one, and the model can't be trusted to tell the difference. So it doesn't get to emit code at all; the link does that job.

The output gate owns ack

This is where ADK's dispatch seam stops being an abstraction and starts earning rent. The answer adapter is constructed with autoAck: false, which means the executor finishes generating and does not signal turn completion. That single option hands control to a dispatchOutputPipeline middleware — the citation gate — which inspects the whole answer and decides whether the turn is allowed to finish.

ts
async preload(): Promise<void> {
  if (this.#answerAdapter) return
  // Answer adapter: autoAck FALSE — the citation gate in the dispatch output
  // pipeline owns turn completion (ack/iterate). See ask_adk_citation_gate.ts.
  // tokenEncoding + contextWindow turn ON the battery's real per-bucket context
  // accounting: it sums actual tokens (system prompt + standing + memories +
  // retrievables + timeline) against the window and emits a
  // `context-window-usage` log on every turn — the source for the UI meter.
  //
  // stream FALSE on purpose. The citation gate must validate the WHOLE answer
  // before the user sees a single token; streaming would paint an un-vetted
  // (possibly un-cited, soon-to-be-discarded) answer live. We generate the
  // complete answer, gate it, and the dialog simulates a typewriter reveal of
  // the ACCEPTED text — the "thinking" feel without showing rejected output.
  this.#answerAdapter = this.#makeAdapter({
    stream: false,
    max_tokens: 1024,
    autoAck: false,
    tokenEncoding: 'cl100k_base',
    contextWindow: CONTEXT_WINDOW,
  })
  // preload() RETURNS the loaded engine. The adapter exposes no getEngine()
  // method and no public .engine property — the old `getEngine?.() ?? .engine`
  // resolved to undefined, silently leaving #engine null. That broke engine
  // SHARING (the rewrite adapter loaded the 1.6GB model a second time, or its
  // own engine probe failed) AND broke rewrite/HyDE, which fell back to the
  // raw query on every turn. Capture the returned engine directly.
  this.#engine = await this.#answerAdapter.preload()
  // Rewrite adapter: autoAck TRUE — single-shot completion with no output gate,
  // so it must self-terminate after one generation. Shares the answer engine.
  this.#rewriteAdapter = this.#makeAdapter({
    stream: false,
    max_tokens: 80,
    autoAck: true,
    engine: this.#engine ?? undefined,
  })
  // Warm the cross-encoder reranker in the background (don't block the engine
  // load on it). It's a small ONNX model; first turn shouldn't pay the load.
  void preloadReranker()
}

stream: false is the other half. The gate has to validate the complete answer before the user sees any of it — streaming would paint an un-vetted, possibly-uncited answer live and then yank it. So generation is non-streaming; the dialog reveals the accepted text with a typewriter effect afterward. The theater of thinking, without the malpractice of streaming rejected text.

The gate runs three checks against one shared retry budget: the answer must link at least one retrieved page, must contain no code, and must not gesture at "the documentation" in bare prose instead of linking it. Pass, and it calls ctx.ack(). Fail with budget left, and it does two things to the live turn state and withholds the ack: it deletes the failed assistant attempt (a wrong, uncited answer is dead weight in a 4096-token window and would only anchor the model to its own mistake), and it folds a corrective directive into the live user message. Then the dispatch loop iterates and generation repeats against the same corpus, with the directive now part of the prompt. The live turn state is the retry — no side-channel stash, no synthetic correction turn; just the original user message, amended.

ts
// Validate the SAME text the renderer will, resolved the SAME way. The 3B
// can't reproduce our synthetic chunk ids, but it reliably copies the real
// source URL / page out of the envelope. resolveCitations rewrites whatever
// form it emitted (markdown link, wrong-index slug, attribution paren) into
// canonical bare markers carrying REAL retrieved ids — by URL/page match,
// not strict id equality. The renderer runs the identical resolve, so the
// gate's verdict equals what the user will see rendered as ↗ links.
const chunkList = [...opts.retrievables.values()]
const text = resolveCitations(latest.content ?? '', chunkList)
const coverage = validateCitationCoverage(text, opts.retrievables)
if (coverage.invalid > 0) opts.bus.emit('hallucinated-citations', { count: coverage.invalid })

// Three independent quality checks, one shared retry budget:
//   1. citation grounding — the answer links at least one retrieved page;
//   2. no code — no fenced/inline code (a mislabeled snippet gets pasted);
//   3. no unlinked reference — no gesturing at "the documentation" in bare
//      prose instead of linking it (a dangling pointer + the forbidden
//      third-person voice). The fix is the same as grounding: LINK, don't
//      gesture. Code is highest severity; report it first when several fail.
const citationPassed = coverage.valid > 0
const code = detectCode(text)
const hasCode = code.hasFence || code.inlineCode.length > 0
const hasUnlinked = hasUnlinkedReference(text)
const passed = citationPassed && !hasCode && !hasUnlinked
const reason: 'citation' | 'code' | 'unlinked' | undefined = passed
  ? undefined
  : hasCode
    ? 'code'
    : !citationPassed
      ? 'citation'
      : 'unlinked'
opts.bus.emit('verify', {
  attempt,
  passed,
  valid: coverage.valid,
  invalid: coverage.invalid,
  reason,
})

if (passed || attempt >= opts.maxRetries) {
  // All checks pass, or out of retry budget — accept the best answer.
  markGenerateDone()
  ctx.ack()
  await next()
  return
}

ctx.iteration is the attempt index; when the budget is exhausted the current attempt is accepted rather than looping forever. Retrieval does not re-run on a retry — the corpus was never the problem; the answer just tripped one of the three checks, so only generation repeats. (The ack/nack invariant and the autoAck battery option are documented under Bring your own LLM.)

What else ships

The spine above is the agent. The rest is plumbing — which is exactly where agent demos quietly become production failures, so it's worth naming:

  • Persistence is delayed until acceptance. The executor stores its assistant message every dispatch iteration, but a rejected attempt must never reach durable history. So storeMessage buffers, and a turnOutputPipeline middleware writes the {user, answer} pair only after the gate has accepted it. It persists the original user question — not the directive-amended copy the gate folded into the live turn — so the corrective text stays ephemeral to the retry loop and never pollutes durable history or the UI. Failed attempts stay dispatch-local and vanish.
  • Memory is cheap; standing instructions are loaded weapons. Memory candidates are extracted with deterministic regex (no extra LLM call) and auto-stored — a wrong memory costs a few ignored tokens. A standing instruction changes every future turn, so it's never auto-stored: it's surfaced to the user for confirmation first. The friction is asymmetric because the blast radius is asymmetric. (ADK's TurnGate / ctx.waitFor() is the product seam for exactly this human-in-the-loop pause.)
  • OPFS is bytes; the app owns the schema. ADK's storage battery is a byte-level spool — read/write/delete. Ask ADK layers conversation manifests, per-message blobs, and memory/standing-instruction records on top. That's the right boundary: the battery shouldn't guess your product's schema.

What it costs

Synthetic RAG is not "add search to chat." It's agreeing that the model is not the authority, and then paying the full tax of becoming the authority yourself: chunk boundaries, index freshness, query rewriting, HyDE drift, hybrid retrieval, reranking, sufficiency thresholds, token budgets, citation validation, retry policy, persistence hygiene, and knowing when to abstain. You give up the fantasy of one clever prompt and accept a pile of deterministic machinery instead.

So stop pretending the model is a brain. It isn't. It's an extraordinarily capable predictive algorithm tuned to autocomplete the next snippet of text from what it saw in training — and pretending it's anything more than that is the delusion at the root of every agent that doesn't work. @nhtio/adk gives you the seams to build the scaffolding that puts that one real capability to work; Ask ADK is the proof that hand-built rails around a small model hold. Every missing chunk, bad threshold, or stale vector is now your bug — which is the good news, because it's a bug you can actually fix, instead of a prompt you can only beg. You can't make the one non-deterministic component deterministic, so you make everything else deterministic and put it in a box the model can't break out of. That is the price of running a real documentation agent in a browser tab, on a 3B model, with no backend — and it's the closest thing to a deterministic agent you can build out of a part that, left alone, is anything but. That's why this is worth showing off: not because the model is magic, but because the magic was removed.