Media
Primitives covers the eight-primitive overview.
A Media is a typed handle to a binary asset — an image, an audio clip, a video, a document — that the loop can hand to a tool, a message, or a provider without ever having to inline the bytes into a string. Every other primitive on this page carries text; Media is the one that carries bytes. It rides on two surfaces: Message.attachments (the dialogue surface just introduced — a human drops in a screenshot, a model returns generated audio) and ToolCall.results (the action surface, introduced later in the page — a tool returns an image or a PDF the provider can render natively). It is the primitive every modern provider's native image/audio/document content block is asking for, and it is the alternative to the two unhappy paths that exist without it: base64-encoding bytes into a Tokenizable and lying to the model about what is in the buffer, or wrapping bytes in a SpooledArtifact subclass and surfacing handle tools — which works fine for documents the model wants to query, and is wasteful for an image the provider can render inline.
Media vs. SpooledArtifact — pick by what the model is doing with it
Use Media when the provider can render it natively (image/audio/document content blocks). Use SpooledArtifact when the model needs to work with the content through handle tools — grep a log, page through a JSON tree, query a Markdown document by heading. Media is not a strict upgrade over artifacts; it's a different silo for a different job.
Media is dual-peer on purpose. It is silo-peer to Tokenizable: it sits in the ToolCall.results slot alongside Tokenizable and SpooledArtifact as one of the three shapes a result can take, and the executor renders it through its own provider-specific content block (an OpenAI Chat Completions image_url, an input_audio block, a file block; other providers use their own shapes). It is also handle-peer to SpooledArtifact: the bytes are not held on the primitive itself but reached through a MediaReader contract — the framework owns the contract, the implementor owns the storage backend (in-memory buffer, OPFS file, S3 object, signed URL, whatever the case demands). Same posture, tuned for opaque binary streaming rather than line-indexed text. The two reader contracts are deliberately disjoint: there is no useful notion of "the third line of a JPEG" and no useful notion of "the byte-stream of a Markdown grep result," so the framework refuses to overload either reader with the other shape's surface.
Bytes are lazy. A Media instance passed through middleware, persisted via a storage hook, or serialised onto a telemetry event never materialises its bytes unless someone calls Media.stream, Media.asBytes, or Media.asBase64. Media.toJSON emits a metadata-only record (id, kind, mimeType, filename, source, trustTier, modalityHazard, stash) so naive event log serialisation does the safe thing by default. Render code that needs the buffer drains the stream once at the wrap site; render code that can forward the stream pipes it through without buffering; logging code never reads bytes at all.
The construction contract is opinionated about two fields that the framework refuses to default. Media.trustTier (Media.MediaTrustTier: 'first-party' / 'third-party-public' / 'third-party-private') mirrors Retrievable.trustTier — same vocabulary, same question, same answer: where did these bytes come from, and how authoritative should the model treat them. Media.modalityHazard (Media.MediaModalityHazard: 'inert' / 'extractable-instructions' / 'opaque-perceptual') is the second axis, and it is new — there is no text-side equivalent, because text has one extraction path (read the string) and media has many (OCR, ASR transcription, frame analysis, embedded-text extraction, pixel-level vision encoding). A third-party-public JPEG is materially more dangerous than a third-party-public paragraph of text because the model itself extracts instructions during perceptual decoding, and no string-level filter can see them. Both fields are required at construction; the bare constructor refuses to guess, and the ergonomic factories (Media.userAttachment, Media.toolGenerated, Media.retrievedPublic, Media.retrievedPrivate) force the labelling decision at the call site without becoming defaults on the constructor itself. See Trust tiers → Media for how the two axes compose at render time.
Media.stash is the register middleware writes into when it wants to leave a text fallback on the media for consumers that cannot decode the bytes natively: a logger summarising tool output, a battery that does not natively support the modality, a downstream agent running against a text-only model. Each entry stores a { value, trustTier, derivedFromMedia? } triple, so derived text (a caption, an OCR pass, a transcript) carries its own trust tier — routed through its own envelope at render time, independent of the parent media's tier. The framework reserves no keys on stash; which keys a battery looks up for its fallback path is documented in the battery itself, not in the primitive.
The fallback lives in stash rather than a typed description?: string field on Media, and the reason is the second-order question who writes it. A typed field forces an answer at construction, and every answer is wrong. The handler? Then every handler returning a Media captions on the synchronous path — latency and tokens spent on a fallback no downstream consumer may ever read. The framework? Then the ADK is in the image-captioning business, which is not a contract it should own. The renderer? Then the fallback is recomputed once per render, on the hot path, with nowhere to cache or share it. stash dissolves the question by moving it out of band: an output middleware over ToolCall.results detects Media, runs whatever captioner you use (OCR, vision-caption, ASR), and writes the result — once, only when a consumer needs it, in the same pipeline that already owns authorisation, redaction, and telemetry. The "describe the asset" policy belongs there, not baked into the primitive.
Trust is content, not code-path
Media.trustTier is the source of truth for the trust envelope. Tool.trusted does not override it, does not propagate to it, and is not consulted when a battery renders a Media result — the same principle that already governs Retrievable.trustTier inside a trusted tool's output. Trust is a property of where the content came from, not who fetched it. A trusted tool returning a third-party-public image renders that image in the untrusted envelope, every time.
A Message carries Media through its attachments field — both user and assistant roles, as the previous section described. A tool returns Media (or Media[]) directly from its handler when it has the bytes in hand; the ADK writes the value into ToolCall.results without wrapping it in a SpooledArtifact — the shape ToolCall covers below. Either way, the renderer is what reaches into the asset: MediaReader.stream for upload paths that can forward the stream; Media.asBytes / Media.asBase64 for paths that need the inline buffer.
Out of scope: byte hygiene
DLP, antivirus scanning, and media moderation are production responsibilities. Media does not do them for you. There is no scanning hook on MediaReader, no clean/dirty flag on Media, no quarantine state. Wire byte hygiene into your tool implementations, storage adapters, middleware pipeline, or ingress layer — but wire it somewhere if untrusted bytes enter your system. The framework defines contracts; the implementor owns policy.