---
url: 'https://adk.nht.io/batteries/tools/scrapper.md'
description: >-
  A web-extraction tool for any Scrapper instance — browser-grade page reading
  without browser-grade state. Custom-header auth, per-parameter disposition,
  async/sync factories, and RAG glue.
---

# Scrapper

## LLM summary — Scrapper battery

* Web-extraction tools for a [Scrapper](https://github.com/amerkurev/scrapper) instance (a headless-browser service), from `@nhtio/adk/batteries/tools/scrapper`. Two verbs: `/api/article` (readable article) and `/api/links` (page links). Each verb has an **async** factory (`Promise<Tool>`, accepts a dynamic-import `artifact` resolver) and a **sync** variant: `createScrapperArticleTool`/`createScrapperArticleToolSync`, `createScrapperLinksTool`/`createScrapperLinksToolSync`.
* Factories, not constants (per-deployment config) — MUST NOT be bulk-registered via `Object.values(batteries)`. Why Scrapper: a real headless browser (renders JS pages a plain fetch can't) exposed as a **stateless** HTTP call — fresh incognito context per request, no stored session/cookies/credentials. Ephemeral by construction.
* `config`: `instanceUrl` (required), `headers` (instance auth — static or sync/async resolver), `requestTimeoutMs` (default 65000), `resultFormat: 'normalized'|'raw'|'either'` (default either), `artifact` (open resolver, default `() => SpooledJsonArtifact`), `name`/`description`, and the disposition trio.
* **Per-parameter disposition**: `fixed` (pinned — sent always, removed from the model schema), `defaults` (model-overridable defaults), and any modeled param in neither is model-settable; `fixedQuery` is a raw kebab passthrough for un-modeled params (always sent, never model-visible). `url` is always a required model arg.
* **Two distinct "headers"**: `config.headers` authenticates to the Scrapper INSTANCE; the `extra_http_headers` PARAM (`'K:v;K2:v2'`) is what the scraper's browser sends to the TARGET site. Do not conflate.
* Model params (snake\_case → kebab wire): `url` (required), `cache`, `screenshot`, `incognito`, `timeout`, `wait_until` (load|domcontentloaded|networkidle|commit), `sleep`, `scroll_down` (needs sleep>0), `device`, `user_agent`, `extra_http_headers`, `proxy_server`; article adds `full_content`; links adds `text_len_threshold`/`words_threshold`. Plus `format` when `resultFormat: 'either'`.
* Article normalized → `{ url, title, byline, excerpt, siteName, lang, length, publishedTime, date, textContent, content?, fullContent?, screenshotUri? }`. Links normalized → `{ url, title, domain, date, links: [{ url, text }], screenshotUri? }`. `raw` → full Scrapper JSON.
* Input/output middleware pipelines (`@nhtio/middleware`, fresh runner per call, `shortCircuit` to skip the fetch). Errors degrade to `Error:` strings (parses `{detail:[{msg}]}`; missing url → HTTP 422); bad args → `E_INVALID_TOOL_ARGS`; bad config → `E_INVALID_SCRAPPER_CONFIG`.
* RAG glue (`@nhtio/adk/batteries/tools/web_retrieval`): `scrapperArticleToRetrievable(article, opts?, recommend?)`, `scrapperLinksToRetrievables(payload, opts?)`, `storeRetrievables(ctx, raws, { retrievable })`. Long article text → reader-backed `SpooledArtifact` via a `spool` hook (no chunker); `trustTier` defaults to `third-party-public`.
* Gotchas: `scroll_down` needs a positive `sleep`; `resultUri`/`screenshotUri` are instance-relative and may be `http://` even over HTTPS — don't assume they match `instanceUrl`.

::: tip This is a featured battery
Scrapper breaks the stateless-tool mold the same way [SearXNG](./searxng) does — it's a configured factory, not a constant — and reading the web is something nearly every agent eventually needs. That earns it its own page. If you read nothing else, read the next section: it's *why* this battery exists.
:::

## Why Scrapper, of all the ways to read a page?

An agent that needs to read the web has two bad options, and Scrapper is the third.

The first option is a **renderless fetcher** — raw `fetch`, `httpx`, readability-on-curl. Clean, stateless, fast. And blind: the moment a page renders its content with JavaScript — which is most of the interesting web now — your fetcher gets an empty shell, a loading spinner serialized to HTML, or a bot wall. You can't read what isn't in the initial document.

The second option is to **drive a real browser** — Playwright, Puppeteer — yourself. Now you can see the JS-rendered page. But you've dragged a browser's entire stateful machinery into your agent loop: persistent contexts, cookies, `localStorage`, a disk cache, logged-in sessions. State that leaks from one call into the next, cross-contaminates unrelated requests, and quietly carries credentials across turns you never meant to link. "Read this page" has become "manage a browser's lifecycle," and a tool the model can call arbitrarily many times across arbitrary untrusted URLs is exactly where you least want sticky state.

[Scrapper](https://github.com/amerkurev/scrapper) threads the needle. It *is* a real headless browser — so it renders the JS-heavy pages a renderless tool can't — but it's exposed as a plain **stateless HTTP call**. Each request runs in a fresh incognito context (`incognito` defaults on), writes nothing to disk, stores no credentials, and shares no session with any other request. **Ephemeral by construction.** You get browser-grade reading power with none of the browser-grade statefulness — which is the precise property you want for a tool an agent reaches for again and again, against pages you don't control. This battery wraps that service; it doesn't try to reinvent it.

## The two verbs

Scrapper does two things, so the battery is two factories (well, four — each verb has an async and a sync form, more on that below):

* **`createScrapperArticleTool`** → `GET /api/article`: load a page, extract the readable article (title, byline, text, metadata).
* **`createScrapperLinksTool`** → `GET /api/links`: load a page, collect its links — each `{ url, text }`.

```typescript
import { createScrapperArticleTool } from '@nhtio/adk/batteries/tools/scrapper'
import { ToolRegistry } from '@nhtio/adk/common'

const article = await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  headers: { 'X-API-Key': process.env.SCRAPPER_KEY! }, // auth to the instance
})

const registry = new ToolRegistry([article])
```

The model gets a `scrapper_article` tool whose only required argument is `url`. Everything else is optional and, unless you say otherwise, the model's to set.

::: info Why async? (and the sync escape hatch)
Same reason as the SearXNG battery: the `artifact` option accepts a **dynamic import**, and resolving it is async, so the factory is async. When you reference the artifact class directly (or don't set one), reach for `createScrapperArticleToolSync` / `createScrapperLinksToolSync` and skip the `await`. Everything else is identical; the sync variants simply reject an async `artifact` resolver (both at compile time and, defensively, at runtime).
:::

## Two kinds of "headers" — don't conflate them

This trips people up, so it's worth being explicit. There are two header channels, and they point in opposite directions:

| | `config.headers` | the `extra_http_headers` **param** |
| --- | --- | --- |
| Who sends it | the ADK → the **Scrapper instance** | the scraper's browser → the **target site** |
| What it's for | authenticating *to your Scrapper* (`X-API-Key`, Basic) | spoofing/auth *at the page being scraped* |
| Format | a headers object or a (sync/async) resolver | a string, `'K:v;K2:v2'` |

```typescript
await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  headers: async () => ({ 'X-API-Key': await mintKey() }), // → the instance
  defaults: { extra_http_headers: 'Referer:https://example.com' }, // → the target site
})
```

`config.headers` takes a static object or a resolver — sync or async — that runs on every request, so a refreshing token is always fresh. Your headers merge over the tool's `Accept: application/json` default, and yours win.

## Per-parameter disposition: who controls each knob

Scrapper has a lot of knobs — `cache`, `screenshot`, `timeout`, `wait_until`, `sleep`, `scroll_down`, `device`, `proxy_server`, and more. For each one you decide *who* sets it, with three buckets:

| You put it in… | In the model's schema? | Sent to Scrapper? |
| --- | --- | --- |
| `fixed` | No — the model can't touch it | Always, as your value |
| `defaults` | Yes, pre-filled | Your default, unless the model overrides |
| neither | Yes, optional | Only if the model sets it |

```typescript
await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  fixed: { proxy_server: 'http://corp-proxy:3128' }, // locked — model never sees it
  defaults: { wait_until: 'networkidle', timeout: 30000 }, // model may override
  // `screenshot`, `cache`, etc. left open for the model to decide per call
})
```

`url` is always a required model argument — pinning the page to scrape would defeat the point. And for the truly long tail — a Scrapper param we haven't modeled, or one a newer instance added — there's `fixedQuery`, a raw `Record<string, string>` of kebab-case wire params that are always sent and never shown to the model. That escape hatch is what keeps the battery **generic**: it works against any Scrapper instance, including versions newer than this code.

## Output: shape and artifact

Like the SearXNG battery, output is configurable at two levels. `resultFormat: 'normalized'` (the default-ish, trimmed shape) or `'raw'` (the full Scrapper JSON) can be **pinned** at the factory — which removes the `format` argument from the model's schema — or left `'either'` so the model picks per call. The normalized article keeps `{ url, title, byline, excerpt, siteName, lang, length, publishedTime, date, textContent }` (plus `content`/`fullContent`/`screenshotUri` when relevant) and drops the `meta`/`query`/`resultUri` noise; normalized links are `{ url, title, domain, date, links: [{ url, text }] }`.

The `artifact` resolver (default `() => SpooledJsonArtifact`) decides how output is wrapped — pass `() => SpooledMarkdownArtifact` (with an output stage that renders markdown into `ctx.output`) or a dynamic import. It's the open-resolver pattern described on the [SearXNG page](./searxng), shared verbatim.

## Pipelines

Both verbs take input and output middleware pipelines — the same `(ctx, next)` onion the core runners use, built on `@nhtio/middleware`. Input stages mutate the outgoing `url`/`params`/`headers` (or `shortCircuit(result)` to skip the fetch entirely — your cache-hit path); output stages reshape `ctx.result`, mutate `ctx.raw`, or set `ctx.output` verbatim.

```typescript
await createScrapperArticleTool({
  instanceUrl: 'https://scrapper.example.org',
  outputPipeline: [
    async (ctx, next) => {
      // Trim a long article down to what the model needs.
      if (ctx.result.textContent && ctx.result.textContent.length > 4000) {
        ctx.result.textContent = ctx.result.textContent.slice(0, 4000) + '…'
      }
      await next()
    },
  ],
})
```

A fresh middleware runner is minted per invocation, so pipelines are safe to reuse across calls.

## From results to RAG: the `web_retrieval` glue

Reading a page is half the job; the other half is putting what you read where the model can use it. The shared `web_retrieval` module turns scrape (and search) results into `Retrievable` records:

```typescript
import { scrapperArticleToRetrievable, storeRetrievables } from '@nhtio/adk/batteries/tools/web_retrieval'
import { Retrievable } from '@nhtio/adk/common'

// `article` is a normalized Scrapper article (the tool's output, or built in a pipeline).
const raw = scrapperArticleToRetrievable(article, { kind: 'web-article' })
await storeRetrievables(ctx, [raw], { retrievable: Retrievable })
```

Three things make this glue worth its own module rather than a copy-paste in each battery:

* **The converters are pure.** `scrapperArticleToRetrievable` / `scrapperLinksToRetrievables` / `searxngResultsToRetrievables` return plain `RawRetrievable` data — they never construct a core class, so importing them costs nothing at runtime. The one function that *does* build a `Retrievable`, `storeRetrievables`, takes the constructor through a **resolver** (`Retrievable`, `() => Retrievable`, or a dynamic import).

* **Long pages don't get chunked — they get spooled.** An article can be enormous, and `Retrievable.content` accepts a `SpooledArtifact` precisely for that: pass a `spool` hook and the converter hands it the artifact constructor it *recommends* for the content (an open resolver — markdown for a rendered article, JSON for raw, text otherwise), so the wrapped content keeps the right query tools and the model pages through it on demand instead of swallowing it whole. No chunker, no fixed window — the artifact *is* the chunking.

  ```typescript
  const raw = scrapperArticleToRetrievable(
    article,
    {
      asMarkdown: true,
      // Persist the bytes, wrap the reader — content stays out of the permanent heap.
      spool: (id, text, recommended) => {
        const reader = ctx.storeRetrievableBytes(id, text) // your ByteStore
        const Ctor = recommended() // the recommended SpooledArtifact subclass
        return new Ctor(reader)
      },
    },
    { markdown: () => SpooledMarkdownArtifact },
  )
  ```

* **Trust is declared, not sniffed.** Web content defaults to `trustTier: 'third-party-public'` — a deliberate constant for open-web data, never inferred from the URL (the ADK forbids URL-based trust inference). Override it when you know better.

The downstream seams are [Retrievable Glue](../vector/retrievable) and [Bring your own retrieval](../../assembly/byo-retrieval); the trust-tier vocabulary and rendering live there.

## When Scrapper says no

::: warning Gotchas worth knowing up front

* **`scroll_down` needs a positive `sleep`.** Scrolling for lazy-loaded content only does anything if the browser waits afterward; `scroll_down` without `sleep` is a no-op (an upstream behavior, not ours).
* **`resultUri` / `screenshotUri` aren't guaranteed to match your instance.** They can come back instance-relative, and `http://` even when you called over HTTPS. Don't assume they share `instanceUrl`'s scheme or host.
* **There's no health endpoint.** Scrapper exposes `/api/article` and `/api/links`; there's no `/ping` to probe.
  :::

Failures degrade gracefully: a non-OK response (Scrapper uses HTTP `422` for a missing/invalid `url`), a network error, a timeout (`requestTimeoutMs`, default 65s), or a thrown pipeline stage all come back as `Error:` strings the model can read — Scrapper's `{ detail: [{ msg }] }` body is parsed into that message. The one thing that throws is bad *arguments* (`E_INVALID_TOOL_ARGS`), and bad *config* throws `E_INVALID_SCRAPPER_CONFIG` at factory-call time.

## Where this sits in an assembly

It's a tool battery: call a factory, register the returned `Tool`, and the model can read the web. Pair it with [SearXNG](./searxng) for the search-then-read loop, and with the `web_retrieval` glue above to land what you read in the turn's retrievables. See [Tools batteries](../../assembly/batteries-tools) for the rest of the bundled tools and [Bring your own tools](../../assembly/byo-tools) for the `Tool` contract these factories produce.
