Skip to content
8 min read · 1,506 words

Scrapper

This is a featured battery

Scrapper breaks the stateless-tool mold the same way SearXNG does — it's a configured factory, not a constant — and reading the web is something nearly every agent eventually needs. That earns it its own page. If you read nothing else, read the next section: it's why this battery exists.

Why Scrapper, of all the ways to read a page?

An agent that needs to read the web has two bad options, and Scrapper is the third.

The first option is a renderless fetcher — raw fetch, httpx, readability-on-curl. Clean, stateless, fast. And blind: the moment a page renders its content with JavaScript — which is most of the interesting web now — your fetcher gets an empty shell, a loading spinner serialized to HTML, or a bot wall. You can't read what isn't in the initial document.

The second option is to drive a real browser — Playwright, Puppeteer — yourself. Now you can see the JS-rendered page. But you've dragged a browser's entire stateful machinery into your agent loop: persistent contexts, cookies, localStorage, a disk cache, logged-in sessions. State that leaks from one call into the next, cross-contaminates unrelated requests, and quietly carries credentials across turns you never meant to link. "Read this page" has become "manage a browser's lifecycle," and a tool the model can call arbitrarily many times across arbitrary untrusted URLs is exactly where you least want sticky state.

Scrapper threads the needle. It is a real headless browser — so it renders the JS-heavy pages a renderless tool can't — but it's exposed as a plain stateless HTTP call. Each request runs in a fresh incognito context (incognito defaults on), writes nothing to disk, stores no credentials, and shares no session with any other request. Ephemeral by construction. You get browser-grade reading power with none of the browser-grade statefulness — which is the precise property you want for a tool an agent reaches for again and again, against pages you don't control. This battery wraps that service; it doesn't try to reinvent it.

The two verbs

Scrapper does two things, so the battery is two factories (well, four — each verb has an async and a sync form, more on that below):

  • createScrapperArticleToolGET /api/article: load a page, extract the readable article (title, byline, text, metadata).
  • createScrapperLinksToolGET /api/links: load a page, collect its links — each { url, text }.
typescript
import { createScrapperArticleTool } from '@nhtio/adk/batteries/tools/scrapper'
import { ToolRegistry } from '@nhtio/adk/common'

const article = await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  headers: { 'X-API-Key': process.env.SCRAPPER_KEY! }, // auth to the instance
})

const registry = new ToolRegistry([article])

The model gets a scrapper_article tool whose only required argument is url. Everything else is optional and, unless you say otherwise, the model's to set.

Why async? (and the sync escape hatch)

Same reason as the SearXNG battery: the artifact option accepts a dynamic import, and resolving it is async, so the factory is async. When you reference the artifact class directly (or don't set one), reach for createScrapperArticleToolSync / createScrapperLinksToolSync and skip the await. Everything else is identical; the sync variants simply reject an async artifact resolver (both at compile time and, defensively, at runtime).

Two kinds of "headers" — don't conflate them

This trips people up, so it's worth being explicit. There are two header channels, and they point in opposite directions:

config.headersthe extra_http_headers param
Who sends itthe ADK → the Scrapper instancethe scraper's browser → the target site
What it's forauthenticating to your Scrapper (X-API-Key, Basic)spoofing/auth at the page being scraped
Formata headers object or a (sync/async) resolvera string, 'K:v;K2:v2'
typescript
await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  headers: async () => ({ 'X-API-Key': await mintKey() }), // → the instance
  defaults: { extra_http_headers: 'Referer:https://example.com' }, // → the target site
})

config.headers takes a static object or a resolver — sync or async — that runs on every request, so a refreshing token is always fresh. Your headers merge over the tool's Accept: application/json default, and yours win.

Per-parameter disposition: who controls each knob

Scrapper has a lot of knobs — cache, screenshot, timeout, wait_until, sleep, scroll_down, device, proxy_server, and more. For each one you decide who sets it, with three buckets:

You put it in…In the model's schema?Sent to Scrapper?
fixedNo — the model can't touch itAlways, as your value
defaultsYes, pre-filledYour default, unless the model overrides
neitherYes, optionalOnly if the model sets it
typescript
await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  fixed: { proxy_server: 'http://corp-proxy:3128' }, // locked — model never sees it
  defaults: { wait_until: 'networkidle', timeout: 30000 }, // model may override
  // `screenshot`, `cache`, etc. left open for the model to decide per call
})

url is always a required model argument — pinning the page to scrape would defeat the point. And for the truly long tail — a Scrapper param we haven't modeled, or one a newer instance added — there's fixedQuery, a raw Record<string, string> of kebab-case wire params that are always sent and never shown to the model. That escape hatch is what keeps the battery generic: it works against any Scrapper instance, including versions newer than this code.

Output: shape and artifact

Like the SearXNG battery, output is configurable at two levels. resultFormat: 'normalized' (the default-ish, trimmed shape) or 'raw' (the full Scrapper JSON) can be pinned at the factory — which removes the format argument from the model's schema — or left 'either' so the model picks per call. The normalized article keeps { url, title, byline, excerpt, siteName, lang, length, publishedTime, date, textContent } (plus content/fullContent/screenshotUri when relevant) and drops the meta/query/resultUri noise; normalized links are { url, title, domain, date, links: [{ url, text }] }.

The artifact resolver (default () => SpooledJsonArtifact) decides how output is wrapped — pass () => SpooledMarkdownArtifact (with an output stage that renders markdown into ctx.output) or a dynamic import. It's the open-resolver pattern described on the SearXNG page, shared verbatim.

Pipelines

Both verbs take input and output middleware pipelines — the same (ctx, next) onion the core runners use, built on @nhtio/middleware. Input stages mutate the outgoing url/params/headers (or shortCircuit(result) to skip the fetch entirely — your cache-hit path); output stages reshape ctx.result, mutate ctx.raw, or set ctx.output verbatim.

typescript
await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  outputPipeline: [
    async (ctx, next) => {
      // Trim a long article down to what the model needs.
      if (ctx.result.textContent && ctx.result.textContent.length > 4000) {
        ctx.result.textContent = ctx.result.textContent.slice(0, 4000) + '…'
      }
      await next()
    },
  ],
})

A fresh middleware runner is minted per invocation, so pipelines are safe to reuse across calls.

From results to RAG: the web_retrieval glue

Reading a page is half the job; the other half is putting what you read where the model can use it. The shared web_retrieval module turns scrape (and search) results into Retrievable records:

typescript
import { scrapperArticleToRetrievable, storeRetrievables } from '@nhtio/adk/batteries/tools/web_retrieval'
import { Retrievable } from '@nhtio/adk/common'

// `article` is a normalized Scrapper article (the tool's output, or built in a pipeline).
const raw = scrapperArticleToRetrievable(article, { kind: 'web-article' })
await storeRetrievables(ctx, [raw], { retrievable: Retrievable })

Three things make this glue worth its own module rather than a copy-paste in each battery:

  • The converters are pure. scrapperArticleToRetrievable / scrapperLinksToRetrievables / searxngResultsToRetrievables return plain RawRetrievable data — they never construct a core class, so importing them costs nothing at runtime. The one function that does build a Retrievable, storeRetrievables, takes the constructor through a resolver (Retrievable, () => Retrievable, or a dynamic import).

  • Long pages don't get chunked — they get spooled. An article can be enormous, and Retrievable.content accepts a SpooledArtifact precisely for that: pass a spool hook and the converter hands it the artifact constructor it recommends for the content (an open resolver — markdown for a rendered article, JSON for raw, text otherwise), so the wrapped content keeps the right query tools and the model pages through it on demand instead of swallowing it whole. No chunker, no fixed window — the artifact is the chunking.

    typescript
    const raw = scrapperArticleToRetrievable(
      article,
      {
        asMarkdown: true,
        // Persist the bytes, wrap the reader — content stays out of the permanent heap.
        spool: (id, text, recommended) => {
          const reader = ctx.storeRetrievableBytes(id, text) // your ByteStore
          const Ctor = recommended() // the recommended SpooledArtifact subclass
          return new Ctor(reader)
        },
      },
      { markdown: () => SpooledMarkdownArtifact },
    )
  • Trust is declared, not sniffed. Web content defaults to trustTier: 'third-party-public' — a deliberate constant for open-web data, never inferred from the URL (the ADK forbids URL-based trust inference). Override it when you know better.

The downstream seams are Retrievable Glue and Bring your own retrieval; the trust-tier vocabulary and rendering live there.

When Scrapper says no

Gotchas worth knowing up front

  • scroll_down needs a positive sleep. Scrolling for lazy-loaded content only does anything if the browser waits afterward; scroll_down without sleep is a no-op (an upstream behavior, not ours).
  • resultUri / screenshotUri aren't guaranteed to match your instance. They can come back instance-relative, and http:// even when you called over HTTPS. Don't assume they share instanceUrl's scheme or host.
  • There's no health endpoint. Scrapper exposes /api/article and /api/links; there's no /ping to probe.

Failures degrade gracefully: a non-OK response (Scrapper uses HTTP 422 for a missing/invalid url), a network error, a timeout (requestTimeoutMs, default 65s), or a thrown pipeline stage all come back as Error: strings the model can read — Scrapper's { detail: [{ msg }] } body is parsed into that message. The one thing that throws is bad arguments (E_INVALID_TOOL_ARGS), and bad config throws E_INVALID_SCRAPPER_CONFIG at factory-call time.

Where this sits in an assembly

It's a tool battery: call a factory, register the returned Tool, and the model can read the web. Pair it with SearXNG for the search-then-read loop, and with the web_retrieval glue above to land what you read in the turn's retrievables. See Tools batteries for the rest of the bundled tools and Bring your own tools for the Tool contract these factories produce.