Security Expert Says Anthropic's Claude Fable 5 Protections Already Defeated

Security expert reports successfully circumventing safety mechanisms in Claude Fable 5, revealing vulnerabilities in Anthropic's protection framework through sophisticated multi-layered prompt techniques.

A cybersecurity and artificial intelligence researcher has announced successful jailbreaking of Anthropic's newest AI model, Claude Fable 5, in a mere 48-hour window following its public debut.

Known in AI circles as "Pliny the Liberator," the researcher announced on Wednesday that he successfully "liberated" Fable 5, which had been released on Tuesday as a safety-enhanced variant of the more advanced Mythos model that Anthropic deemed too risky for widespread public availability.

Multiple techniques were employed, including utilizing a compromised version of Opus 4.8, to circumvent the protective measures that Anthropic embedded within the model to stop users from soliciting it for dangerous content, including instructions for creating drugs or executing hacking procedures.

"Despite this overly sensitive, authoritarian 'safety' layer on top of Mythos, my lil liberators have been hard at work [...] cleverly finding the holes in the fence that the thought police missed," said Pliny.

Concerns had already been voiced by some cryptocurrency community members during the earlier releases this year of Claude Fable 5 and Mythos regarding their potential use in launching attacks against crypto protocols and software infrastructure. A compromised version of Claude Fable 5 would indicate the danger is more imminent than previously anticipated.

Getting around Claude Fable 5's guardrails

"Pliny" gained notoriety beginning around 2024 through creating and publicly distributing jailbreak prompts for AI models including ChatGPT, Claude, Grok, and additional platforms, frequently publishing "jailbreak alerts" showcasing methods that circumvent protections soon after new AI models are introduced.

To circumvent Anthropic's protective barriers, Pliny reported utilizing Unicode and homoglyphs, long-context framing, narrative and fiction framing, academic-style decomposition-recomposition, and a jailbroken Claude Opus 4.8 to make Fable respond to his otherwise blocked prompts.

"Perhaps the most effective is decomposition + recomposition in the backend," he said.

The method entails fragmenting requests into smaller, seemingly benign components and requesting harmless-appearing information piece by piece. Individual prompts appeared acceptable to the AI's protective filters, but when assembled together, they generate something more valuable or potentially hazardous.

Screenshot showing Pliny's jailbreak demonstration — Pliny demonstrates a path to meth synthesis by asking about the Birch reduction method. Source: Pliny

Backlash over Fable 5 mounts

Since its release, Anthropic's Fable 5 has generated criticism from detractors due to its stringent limitations.

Whenever a user queries the model about delicate subjects including bioweapons or cybersecurity, Fable 5 is programmed to display a notification and subsequently redirect the interaction to a previous, less advanced model.

"This is one of the first times that an AI company has rolled out a guardrail, and there has been uniform disdain. It has led to a lot of justified anger," said Sayash Kapoor, an AI researcher at Princeton University, according to the Wall Street Journal.

"The consensus seems to be that this has been one of the most disappointing model drops of all time, effectively preventing legitimate researchers from contributing their talents to our collective advancement," said Pliny.

Anthropic had found no universal jailbreaks

At the time of the Fable 5 release, Anthropic stated it conducted an external bug bounty program to search for methods to jailbreak the AI model.

"As well as internal testing, we ran an external bug bounty that produced no universal jailbreaks in over 1,000 hours of testing."

Cointelegraph reached out to Anthropic for comments but did not receive an immediate response.