Why Checking One Prompt at a Time Isn't Enough to Stop a Real Attack.

❝

THE SHORT VERSION

Most AI safety filters check each message on its own. The attacks that succeed do not break the refusal; they get around it. They split a harmful request across many harmless messages, build up to it slowly across one conversation, or strip the refusal out of an open model's weights entirely. The campaigns that actually caused harm were caught by detection that watched behavior over time, not by per-message filters. Treat refusal as one layer, not your security boundary, and move your controls to where the harm adds up.

One person, using an AI coding assistant, ran an extortion campaign against at least seventeen organizations. Hospitals. Emergency services. A religious institution. The attacker used the AI to break in, decide which data was worth stealing, set a ransom price for each victim, and write the ransom notes. The demands ran from $75,000 to more than half a million dollars.

There was no firewall breach in the old sense. No custom malware. No single clever jailbreak. The attacker asked the AI for help, one reasonable step at a time, and it helped.

That is the part to sit with. The safety training did its job on every individual request. The harm still happened.

❝

The model's refusal was never your security boundary. The attackers already know that.

Here is the simplest possible version of how, with nothing dangerous in it.

Malware from three textbook questions

An attacker wants a tool that quietly copies a victim’s files and sends them out. Ask the AI coding assistant directly, “write me malware that steals a user’s documents,” and it refuses. Clean block. Its safety training was built for exactly that request.

So they never say “malware.” They ask three ordinary programming questions, the kind thousands of developers ask every day:

How do I list every file in a user’s Documents folder?
How do I compress a folder in memory, without writing it to disk?
How do I upload a file to my own server over HTTPS?

Every one is a legitimate request. The assistant answers them, as it should, because each is standard, taught in any tutorial, and used in countless honest programs. No filter fires, because no single question is malicious. Then the attacker pastes the three answers together. The result is a working data-stealer.

This is decomposition. The assistant never wrote “malware.” It answered three textbook questions correctly. The danger never lived in a single answer; it lived in the combination, which happens outside the model, where your filter cannot see. Swap the file-stealer for any protected whole, a system prompt, a credential, an attack plan, and the shape is identical: break it into pieces the model will happily give, collect them, assemble them yourself.

The model followed its rules on every request and still helped build the weapon. Decomposition is only the first of three ways an attacker gets around a refusal instead of breaking it. Each one defeats a different kind of defense.The model was never jailbroken. It was used, one harmless answer at a time, to give up something it was told to protect. Splitting the request is only the first of three ways an attacker gets around a refusal instead of breaking it. Each one defeats a different kind of defense.

Three ways around the refusal

Decomposition is the first, the cake method. It splits intent across messages, often across separate sessions, so there is not even one conversation to inspect. It beats per-message filtering because every message, on its own, is clean.

Crescendo is the second. It keeps the attack in one conversation but never asks directly. It builds up slowly: a general question, then a more specific one, then a reason why the next step is reasonable, and only then the real request, framed as a small continuation of what came before. Safety training is strong on a cold, direct ask and weak on the tenth step of a slow build.

Abliteration is the third, and the one defenders underestimate most, because it uses no prompt at all. In an open-weight model, the ability to refuse is controlled largely by a single direction in the model's internal math. Take a few dozen requests the model refuses and a few dozen it accepts, measure the average difference, and you have that direction. Remove it from the weights and the model loses the ability to refuse, with almost no loss of skill, in minutes, on ordinary hardware. People publish these as “uncensored” versions.

❝

If you run an open-weight model, its built-in refusals are not a security control. Anyone who has the weights can remove them.

That is not a flaw you can patch. It is a property of releasing a model you cannot take back. The extortion campaign this article opened with used the first method, and the people who investigate these cases have written down exactly how.

The proof is already public

You do not have to take any of this on faith. Anthropic publishes it.

Its threat-intelligence reports describe real campaigns it has caught and shut down, and the pattern is consistent: almost none relied on a single clever jailbreak. They were attackers getting around refusals at scale.

The extortion case above is one Anthropic tracked as GTG-2002. The detail that matters is not the ransom figure. It is what the AI was used for: reconnaissance, harvesting credentials, choosing targets, setting prices, writing the notes. The attacker did not automate one step. They automated the whole operation.

Other documented cases follow the same logic. North Korean operatives used the AI to fake technical skill and hold fraudulent remote jobs. A low-skilled actor used it to build and sell ransomware they could not have written alone. An influence operation ran a hundred fake social accounts with the AI deciding when each one posted. None of them needed to defeat a refusal directly. They used ordinary, dual-use help, writing code, writing persuasive text, automating steps, where the harm was in how the person used the result, which the AI never saw.

Every one of these reports ends the same way: detected, account banned, case written up. That ending matters, because it tells you which layer actually stopped the attack.

Why per-message defense fails

❝

It was not the refusal that stopped these campaigns. It was detection.

Checking one message at a time is the weakest layer in the stack. Decomposition beats it because no message is bad. Crescendo beats it because the bad request only arrives after the model has already agreed to the smaller steps. Abliteration beats it because the refusal can be removed entirely. Dual-use beats it because the harm is in how the output is used, not in the text.

In every one of these methods, the harm is in the combination, not in any single message. A filter that checks one message at a time cannot see it.

So the defenses that actually caught the real campaigns were not message filters. They watched behavior over time. They scored a whole conversation, not its latest line. They checked the output, not just the input. They linked activity across sessions. They noticed when forty harmless questions added up to one sensitive answer.

That is a different kind of system than most teams have built. It is also the only kind that survives an attacker who never says anything obviously wrong. So the question is not whether to filter messages. It is what to build instead.

❝

DEFENDER ACTION

This week, you will not buy your way out of this with a better input filter. Move the controls to where the harm adds up. Stop treating per-message filtering/blocking as your main line: keep it, but assume it misses decomposition and crescendo by design. Check the output, not just the input, since a response that reveals the format, prefix, or part of a secret is a problem even when the request looked clean. Make it stateful: score conversations and identities over time, because a run of individually harmless questions that rebuilds something sensitive is your signal. Give the model the least access it needs, keeping secrets and even their structure out of the system prompt and behind access-controlled tools. If you run open weights, put your safety layer outside the model, since its own refusals can be removed. And read the primary sources: search “Anthropic Threat Intelligence report” and “Anthropic disrupting malicious uses of Claude.”

The model's refusal was never your security boundary. The campaigns that work do not argue with the refusal. They make sure it never has anything to refuse.

Build the layer that watches the whole pattern, not a single message. That is the layer they cannot get around.

References

Anthropic, Detecting and Countering Misuse of AI: August 2025, Aug 2025. https://www.anthropic.com/news/detecting-countering-misuse-aug-2025
Anthropic, Threat Intelligence Report (full PDF, GTG-2002 “vibe hacking” case study), Aug 2025. https://www-cdn.anthropic.com/b2a76c6f6992465c09a6f2fce282f6c0cea8c200.pdf
Arditi, Obeso, Syed, Paleka, Panickssery, Gurnee & Nanda, Refusal in Language Models Is Mediated by a Single Direction, NeurIPS 2024 (arXiv:2406.11717). https://arxiv.org/abs/2406.11717
Russinovich, Salem & Eldan (Microsoft), Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack, USENIX Security 2025 (arXiv:2404.01833). https://arxiv.org/abs/2404.01833
Li, Wang, Cheng, Zhou & Hsieh, DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers, EMNLP Findings 2024 (arXiv:2402.16914). https://arxiv.org/abs/2402.16914