Anthropic Details Fable 5 Cyber Safeguards, Proposes Jailbreak Scale

Anthropic published the cybersecurity rules behind its redeployed Claude Fable 5 model and proposed an industry framework for scoring how dangerous an AI jailbreak is.

By Marcus Lee Edited by Maria Konash Published:
Anthropic Details Fable 5 Cyber Safeguards, Proposes Jailbreak Scale
Anthropic published Fable 5's cybersecurity safeguards and proposed a framework for scoring the severity of AI jailbreaks.Image: Anthropic

Anthropic has published detailed documentation of the cybersecurity safeguards protecting its Claude Fable 5 model, alongside a draft framework for grading how severe an AI jailbreak is, following the model’s global redeployment on July 1.

The disclosure has two parts: a breakdown of the safety classifiers that decide which cyber requests Fable 5 will answer, and a proposed scoring system, developed with Amazon, Microsoft, Google and other partners in Anthropic’s Project Glasswing, meant to give companies and governments a shared language for describing jailbreak risk. It follows the June episode in which US export controls briefly forced Fable 5 and the more capable Mythos 5 offline over a security concern.

The core challenge Anthropic describes is that most cybersecurity work is dual-use: the same capability that helps a defender scan code for flaws can help an attacker find a way in. Rather than block all security activity, Fable 5’s classifiers sort requests into four tiers. Prohibited uses, such as building ransomware or malware, sabotaging physical infrastructure or running command-and-control servers, are always blocked because they offer far more to attackers than defenders.

High-risk dual-use tasks like penetration testing, exploit development and finding vulnerabilities that other models cannot are also blocked for now, pending better ways to verify legitimate users. Low-risk work is mostly allowed but monitored, and clearly benign tasks, including secure coding, debugging, patching and malware reverse-engineering, are meant to pass through.

Anthropic said it set Fable 5’s “safety margin” wider than in past models, deliberately blocking some harmless requests to be more certain of catching harmful ones.

The second and more novel piece is the proposed Cyber Jailbreak Severity, or CJS, scale. It rates a jailbreak from CJS-0, informational, to CJS-4, critical, on a scale meant to be roughly exponential, so each band is several times more serious than the last.

Four axes feed the score: how much capability the jailbreak gives an attacker beyond existing tools, how many different attack types it works on, how easily it can be turned into a working attack, and how discoverable the technique is. A useful feature is that the rating is relative to the current baseline. Anthropic illustrates this with the Log4Shell flaw, which would have scored high in 2021 when no tool could find it, but scores zero today because every scanner detects it, even though the model’s behavior is identical. A calculated score sets a floor that can be raised but never lowered.

The Policy Push

The framework is Anthropic’s attempt to turn a messy, ad hoc process into a standard. Today there is no agreed way to describe a jailbreak’s severity, which leaves developers and regulators talking past each other whenever one is found, as the Fable 5 shutdown showed. A common scale could let companies triage findings and communicate risk to governments consistently.

Anthropic has opened a HackerOne bounty program for researchers to submit Fable 5 jailbreaks and invited public feedback. Notably, Fable 5 is priced at $10 per million input tokens and $50 per million output, at the premium end of the market, reflecting its capabilities and the cost of the safeguards wrapped around it.

Open Questions

The effort is credible but not disinterested, and that is the tension worth watching. When the three largest cloud providers help write a safety standard, it tends to become the de facto industry baseline, which raises the question of whether a framework shaped by incumbents will be neutral or will favor their approaches over smaller labs and open-weight developers who were not in the room. Publishing a detailed taxonomy of what is and is not blocked also cuts both ways: it aids legitimate researchers and transparency, but hands would-be attackers a clearer map of the boundaries to probe. And a voluntary framework has no enforcement behind it.

Anthropic frames this as an early draft and is asking for critique from academia, civil society and government, an acknowledgment that whether CJS becomes a genuine industry standard, rather than one company’s proposal, will depend on who else adopts it.