Your AI Guardrail Is On. The Attacker Already Got Past It.

❝

THE SHORT VERSION
In June 2025 a single email read a Microsoft 365 Copilot user's internal files and mailed the contents to an attacker, with no click and no malware [2]. The defense most teams reach for, a filter that blocks bad prompts, is the one a 14-author study from OpenAI, Anthropic, and Google DeepMind broke on every system they tested [5]. This is a field report from a teaching lab built around one hard fact: a security control shifts probability, it does not flip a threat to solved.

A user did not click anything.

They opened Microsoft 365 Copilot and asked a normal work question. Copilot, doing its job, pulled a recent email into context. That email was the attack. Inside it were instructions written for the model, not the human. Copilot followed them, read the user's internal files, and sent the contents to an attacker-controlled destination.

No malware ran. No credential was phished. In the user's view, no filter was broken.

The flaw, called EchoLeak, was assigned CVE-2025-32711 and scored 9.3 [2]. Microsoft patched it server-side and said it was not exploited in the wild. The researchers at Aim Labs who found it called the class an LLM Scope Violation: the model was talked across its own trust boundary [2].

❝

The email was the code. The user was never in the loop.

This is the part security teams keep getting wrong. The attacker did not break a rule. The attacker supplied data the system was built to read.

The attack that needs no rule to break

The mechanics are boring on purpose.

An agentic LLM application has two jobs it should do: retrieve relevant documents, and answer the user. Indirect prompt injection lives in the gap between them [3]. The attacker never talks to the model. They plant text where the model will read it: a shared document, a support ticket, a web page, a calendar invite, an email [3].

We built a teaching lab to make this concrete and repeatable. A student asks the chatbot a plain question: what is our refund policy. The question is harmless and passes every input filter. The system retrieves a document to answer it. One of those documents was planted, and it carries a line addressed to the assistant rather than the reader. The model reads the document as instructions and obeys.

Each step is something the system genuinely should do. Reading the document is correct. Answering the question is correct. The harm is only in the combination.

❝

Retrieval is the new input. The document your agent reads is untrusted code.

Kai Greshake and colleagues named and demonstrated this class in 2023, with working exploits against Bing Chat and code assistants, using payloads hidden in plain text, white-on-white PDF text, and image metadata [3]. EchoLeak is the same idea, weaponized in production two years later.

This is why a filter on the user's message does not help. The malicious instruction is never in the user's message.

Nine ways in, one control plane

The lab is a working agentic application, not slideware. It runs retrieval over a real vector store, calls a hosted model for inference, and exposes two tools the agent can call, one of which changes records. Around it sits a control plane: an identity provider, a web application firewall, and a log pipeline with dashboards.

Nine attack scenarios run against it. Each maps to an entry in the OWASP Top 10 for LLM Applications 2025 [1] and to a NIST Cybersecurity Framework 2.0 subcategory [4]. Direct prompt injection and indirect injection through a retrieved document (LLM01). Knowledge-base poisoning at ingestion (LLM04). Sensitive information disclosure (LLM02). Improper output handling into a downstream sink (LLM05). Unbounded consumption (LLM10). Excessive agency, where the agent performs an action the caller was never authorized to request (LLM06).

Each scenario has one control behind a switch. Run the attack with the switch off and watch the compromise. Turn it on, run the identical attack, watch the defense. A credential-stuffing scenario stays in the set on purpose, so the identity layer earns its place next to the AI-specific controls. Classic and AI defenses are one control plane applied at different layers, not two separate worlds.

❝

One application, eight switches, nine attacks. The lesson is in what the switches do not do.

Then comes the ninth scenario, and it is the reason the lab exists.

The proof that switches do not hold

In the ninth scenario the control is already on.

We enable the retrieval filter that defeated the earlier indirect injection. Then we run a second injection, reworded to carry none of the phrases the filter looks for. It frames its instruction as a formatting convention, not a command. The filter passes it through untouched, and the model obeys it anyway. The control is enabled. The attack still works.

This is not a quirk of one lab. In October 2025 a team of 14 researchers from OpenAI, Anthropic, and Google DeepMind took 12 published defenses against prompt injection and jailbreaking and attacked them adaptively, the way a motivated adversary would [5]. Under adaptive attack, every defense fell. Prompting-based defenses dropped to 95 to 99 percent attack success. Training-based methods hit 96 to 100 percent [5]. Defenses tested against a fixed dataset, the paper notes, do not reflect real-world resilience.

The title of the paper is the whole lesson: the attacker moves second [5].

EchoLeak says the same thing in production terms. It did not beat one control. It chained four bypasses: evading Microsoft's cross-prompt-injection classifier, slipping a link past redaction with reference-style Markdown, triggering an auto-fetched image, and routing data through an allowed proxy [2]. Each control was real. The chain went around all of them.

Simon Willison, who coined the term prompt injection, puts the state of the art plainly: providers can mitigate it, they cannot seal it [6].

So if the switches do not hold, what is the point of the switches.

Why the toggle mindset fails, and what replaces it

The point of a control is to raise cost and shift probability, not to close a door.

A switch framing tells a defender the threat is handled once the toggle is green. The data says otherwise. A control turns a cheap, reliable attack into an expensive, less reliable one. That is worth a great deal. It is not the same as solved, and treating it as solved is how teams ship the gap that EchoLeak walked through.

Provenance is the clearest case. A natural instinct is to tag documents trusted or untrusted and let the trusted ones through. In the lab, the poisoned document arrives at the ingestion endpoint claiming it is trusted. If the system believes that claim, the poison enters. The fix is not a better claim check. Provenance must be assigned by the defender, from the channel a document arrived on, and never accepted from the document's own author [1]. Anything from a public ingestion path is untrusted by definition, whatever it says about itself.

❝

Provenance the uploader can set is not provenance. It is decoration.

That principle generalizes. Indirect prompt injection is, in the real world, substantially unsolved [5][6]. A defender who accepts that designs differently. They stop asking whether the filter caught it and start asking whether they will see it when it gets through, and whether they can trace it.

Which is the part of the lab that has nothing to do with toggles.

When prevention fails, reconstruction is the control

Every request in the lab carries one correlation ID, threaded through every hop.

The firewall decision, the authentication event, the documents retrieved and their scores, the assembled prompt, the model call, the tool invocations, and the final output all log under the same ID. A bad output is not a dead end. You click it and walk backward to the planted document, the ingestion source, and the session that introduced it.

This is the control that survives when the others fail. EchoLeak was patched, but the requirement it exposed, that an agent's actions must be reconstructable after the fact, holds across the next attack and the one after that. Lineage and accountability are not a dashboard nicety. For incident response, and for any regulatory question about an autonomous system's decision, they are the difference between knowing what happened and guessing.

When the filter fails, and the research says it will, the only question left is whether you can answer for what your agent did.

DEFENDER ACTION, this week

Treat retrieval as untrusted input. Give a retrieved document the scrutiny you give a user-submitted form. The model cannot tell your data from instructions hidden in it [3].
Assign provenance server-side. Stamp every ingested document with a source the uploader cannot set, and quarantine anything from a public channel by default [1].
Scope agent tools to the caller. An action tool must check the caller's authorization at call time, not assume the model only calls it when appropriate [1].
Thread one correlation ID end to end. Auth, retrieval, prompt, model, tools, output. If you cannot reconstruct a bad output's full chain, build that before adding another model feature.
Test controls with adaptive attacks. A defense that holds against a fixed test set tells you nothing. Reword the attack and run it again [5].
Write playbooks that assume injection succeeds. Detection and traceability, not prevention alone, are what contained the real cases [2].

The lab is open source. Every beat, every control, and the traceability layer are there to run and to break [7].

A control is a probability, not a promise.

Turn the switches on. Then build as if the attacker will get through one of them, because the people who broke every defense they tested are telling you the attacker will [5].

References

OWASP Gen AI Security Project, OWASP Top 10 for LLM Applications 2025, Nov 2024. https://genai.owasp.org/llm-top-10/
Aim Labs, EchoLeak (CVE-2025-32711): zero-click prompt injection in Microsoft 365 Copilot, Jun 2025; analyzed in EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System (arXiv:2509.10540). https://arxiv.org/abs/2509.10540
Greshake et al., Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, 2023 (arXiv:2302.12173). https://arxiv.org/abs/2302.12173
NIST, The NIST Cybersecurity Framework (CSF) 2.0, Feb 2024 (NIST CSWP 29, DOI 10.6028/NIST.CSWP.29). https://www.nist.gov/cyberframework
Nasr, Carlini, Sitawarin, Schulhoff, Hayes, et al., The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections, Oct 2025 (arXiv:2510.09023). https://arxiv.org/abs/2510.09023
Simon Willison, New prompt injection papers: Agents Rule of Two and The Attacker Moves Second, 2 Nov 2025. https://simonwillison.net/2025/Nov/2/new-prompt-injection-papers/
A. Seeneevasan, LLM-Defense-Lab: a NIST CSF 2.0 defense-in-depth teaching lab for an agentic LLM application, 2026. https://github.com/aravindhseeneevasan/LLM-Defense-Lab