Media — Threat Analysis
Other threat analyses in this section: Envelopes · Persistence · Identity and Reasoning · Back to Trust Tiers
This page covers the formal threat analysis for Media. For the operational guide, start there.
Why text-level defenses fail for media
The fundamental security boundary for text-based LLMs—the string-level delimiter—is a fiction in multimodal contexts. When a vision-language or audio-language model processes input, instructions are extracted directly from perceptual data—pixels or waveforms—into the model's latent representation.
This represents a catastrophic departure from memory poisoning or tag injection. In traditional attacks, the malicious instruction exists as a string within the context and can be structurally contained. Media-based attacks bypass this: the dangerous instruction does not exist as text until the model perceives it. Post-hoc filtering is structurally impossible; you cannot filter a string that did not exist until after the model interpreted the stimulus. Media hazards exploit the model's perceptual encoding layer, which operates entirely below the level of string manipulation.
LSB steganography: the invisible attack
Least-significant-bit (LSB) encoding inserts instructions into the bitstream of an image or audio file via modifications beneath the threshold of human perception.
In agentic workflows, LSB encoding is a high-criticality hazard (Anonymous, 2025). Because instructions are encoded in the pixel data rather than a text layer, they are invisible to OCR scanners and human reviewers. During cross-modal reasoning, the model's latent-space representation captures these variances. Frontier models (GPT-4o, Gemini-1.5 Pro) exhibit black-box success decoding LSB-encoded instructions with minimal queries.
The reference batteries classify this as an opaque-perceptual hazard. No reliable pre-ingestion detection exists for LSB-encoded instructions. The only viable defense is structural labeling: the opaque-perceptual signal instructs the battery to apply maximum envelope suspicion, treating the entire image as a potential vector regardless of human visual verification.
Adversarial perturbations: no text required
Adversarial perturbations are pixel-level modifications crafted to shift the vision encoder's representation toward malicious target embeddings. Unlike steganography, these attacks do not require an encoded string—they manipulate the model's internal classification or reasoning state directly.
CrossInject (ACM MM 2025) demonstrates a +30.1% attack success rate by targeting the embedding space (Anonymous, 2024). These attacks exploit cross-modal transfer: adversarial images crafted against open-weight VLMs transfer successfully to commercial closed APIs, removing the requirement for white-box model access.
This represents the most pathological hazard class. The attack requires no linguistic payload; it manipulates the encoder's representation to cause the model to interpret the image as a command, a credential, or an authorization. Any defense relying on content analysis or text-layer scanning is doomed to fail. The architectural response is to treat all vision-encoded media with maximum suspicion via the opaque-perceptual hazard class.
OCR-extractable attacks: the auditable hazard
Documents containing hidden text layers—PDFs with white-on-white text, micro-font glyphs, or off-page content—represent the extractable-instructions hazard class.
Real-world evidence: accepted ICML 2025 papers contained hidden "GIVE A POSITIVE REVIEW" instructions in their PDF text layer (Anonymous, 2025). The danger is the divergence between the visual layer (read by humans) and the embedded text layer (read by the model's text extractor). The human reviews the document and approves it; the model reads a different document.
Unlike opaque-perceptual hazards, these attacks are auditable. The payload exists as text within the file bytes and is detectable by dedicated tooling or string scanning. The reference batteries separate this into its own hazard class to allow for targeted envelope signals: dangerous but formally verifiable before it reaches the model.
Audio injection: the ultrasonic channel
Audio injection exploits the auditory perception gap to deliver instructions human supervisors cannot hear.
- SWhisper: near-ultrasonic frequencies (17-22 kHz) that survive microphone non-linearity and reconstruct as baseband instructions within the model's audio encoder. 0.94 non-refusal rate on commercial models (Anonymous, 2026).
- WhisperInject: sub-audible noise overlays, 86%+ success on Phi-4-Multimodal and Qwen2.5-Omni. Payloads transfer across model architectures—train on open weights, deploy against the commercial API.
- AudioJailbreak (ACM CCS 2025): over-the-air robustness, surviving room echo, frequency loss, and microphone distortion with 87-88% success in physical testing.
A transcript of an ultrasonic-injected audio file is not a defense—it is evidence of a successful attack. If an agent transcribes a malicious audio file, the resulting string reflects the injected instruction as if it were a legitimate user command. This is why Media.stash entries derived from audio inherit the trustTier of the source, not a derived "clean" tier.
Trust propagation failure modes
A common architectural error is the Trusted Courier Fallacy: the assumption that a trusted tool's output is safe. In the two-axis model, Tool.trusted does not propagate to the Media it returns.
A tool's integrity property describes delivery fidelity—the bytes returned are the bytes found at the source. It does not vouch for the semantic safety of those bytes. A trusted web-search tool returning a third-party image delivers a third-party-public asset. Promoting that image to first-party because the search tool is trusted creates a critical vulnerability.
This rule extends to derived data. When a model performs OCR on an image and the result is stored in Media.stash, the resulting text is at least as untrusted as the source media. If the source image contained adversarial pixels crafted to OCR as a system command, the derived OCR result is the successful output of an injection attack—not a first-party record the operator vouched for.
C2PA: provenance is not injection defense
The C2PA specification provides cryptographic manifests and signatures that verify the creator's identity and the asset's integrity since capture (Coalition for Content Provenance and Authenticity, n.d.). C2PA is orthogonal to injection defense.
A C2PA-signed image from a third-party source remains third-party content. The signature confirms a specific photographer captured the image; it does not guarantee the image is free of LSB-encoded instructions or adversarial perturbations. An attacker with a valid certificate can produce a signed malicious image. A correct implementation treats C2PA data as metadata for a provenance field on Media—useful for grounding and auditability—but a C2PA signature never elevates a third-party asset to a higher trustTier.
Non-goals for media defenses
What the two-axis model does not provide:
- Pre-ingestion scanning: the two-axis model does not attempt to clean media of adversarial content. No mathematically sound method guarantees an image is free of
opaque-perceptualhazards. - Vision encoder suppression: the model's vision encoder remains active. The defense is ensuring the model knows the content is untrusted via envelope signals, not attempting to blind the model to the content.
- Sandbox replacement: for high-risk pipelines (automated thumbnail processing, multi-source media aggregation), media processing in isolated rendering environments remains the strongest defense. The two-axis model provides the logical framework; it does not provide physical compute isolation.
- Semantic attack prevention: the two-axis model does not prevent a model from being persuaded by a legitimate, accurately-labeled, persuasive third-party image. It ensures the model cannot mistake that persuasion for a first-party system instruction.