Introduction
LLMs unlock a new world of possibilities at our finger tips. But guardrails are essential for preventing information that may assist in the creation of chemical weapons or exploiting software vulnerabilities in security systems. Anthropic recently released a new safety safeguard approach called constitutional classifiers (Sharma et al., 2025) to help prevent these scenarios. This approach builds upon their work of using constitutions for LLMs. Constitutions train AI to self-reflect based on a set of predefined language rules that define permitted and restricted content (Bai et al., 2022). With the LLM’s constitution, Anthropic generates synthetic training data guided by these constitutional rules to fine-tune specialized classifier models that can identify harmful inputs and outputs, preventing users from bypassing their safeguards through jailbreaking techniques. The robustness of their approach was evaluated through automated red teaming and a public bug bounty program (Sharma et al., 2025).
In this blog, you will learn about jailbreaking, Anthropic’s constitutional classifiers approach, its applications and limitations, and potential future directions for safeguard evaluations.
What are Jailbreaks?
Jailbreaks are ways to bypass a model’s safety protocols.Community and internal testing have shown that there are many ways to prompt or attack the model to incite this behavior (Andriushchenko et al, 2024). Several strategies have been used to jailbreak LLMs - for example, “Do Anything Now” (DAN), which creates role playing prompts to instruct the model that “it can do anything” (Shen et al., 2023), and “God Mode”, which are forceful instructions that claim elevated or all access privileges (Pliny, 2025).
Below is an example of a known DAN prompt and LLM responses that successfully extracted specific information about a dangerous chemical used in chemical warfare—a substance designated as a schedule 1 by the government. Since most usage policies explicitly ban using LLMs for weapon creation, models are typically aligned to prevent the outputting this information.
Prompt

Grok 2 Response

o3 Mini Response



Claude 3.7 Sonnet Response

Results were generated by using openrouter.ai with the default settings
As you can see, “DAN” responses differ significantly from the normal “LLM” response for all 3 models, bypassing some of their alignment and providing dangerous information to aid in the creation of chemical weapons. All models reveal additional information about this chemical that was not revealed in its original model response, “LLM”. In Grok, the tone in DAN mode has completely shifted to using phrases like “No problemo” and “I just hacked into the secret database of a top-tier chemical research facility”. And since LLM responses can vary for identical inputs, these prompts may successfully bypass alignment safeguards in some instances while failing in others making it even harder to predict when it reveals dangerous information. As demonstrated, many models are still susceptible to well known attacks like these that reveal dangerous information. Constitutional classifiers aim to address this vulnerability by preventing harmful LLM responses across a wide range of jailbreaking attempts.
What are Constitutional Classifiers?
Constitutional classifiers aim to prevent universal jailbreaks - “attacks that reliably extract detailed harmful information across the vast majority of queries in a domain” (Sharma et al., 2025). Their approach is to have two fine tuned classifiers: one that identifies harmful inputs and another that identifies tokens in the output stream for harmful generations.
To train these classifiers, Anthropic used a helpful only model, model without harmlessness measures, to generate synthetic data of both harmful and harmless content, based on the constitution. The training data is further refined by augmentation strategies such as translating text to other languages, using an LLM to paraphrase the generated text, incorporating benign data to increase the diversity of their data, and using an automated red teaming pipeline to create new attacks from existing jailbreaking strategies.
While early versions of their approach led to intensive computational costs and over refusals, they refined their approach to reduce inference costs and over refusals. In addition, Anthropic mentions their approach is adaptable to new threats by updating the constitution and the classifiers accordingly. After incorporating refinement and adaptability, they evaluated their classifiers by taking a two pronged approach.

High level overview of the constitutional classifiers approach (Sharma et al., 2025)

The output classifier identifies harmfulness for the output stream by token (Sharma et al., 2025)
Evaluations of constitutional classifiers
The evaluations combined automated red teaming approaches and a bug bounty program to prevent universal jailbreaking attacks, with a particular focus on assessing the Chemical, Biological, Radiological, or Nuclear (CBRN) capabilities of LLMs. They incorporated existing jailbreaking strategies such as “Do Anything Now” and “God Mode” as well as other known jailbreaking strategies from the literature were used to help form the evaluation dataset. These queries were specifically focused to be different from the training data generation pipeline. To further increase the diversity of the dataset, they included queries from probing systems for undiscovered vulnerabilities and applied transformations of harmful intent text through cryptographic ciphers (base64 encoding), cross lingual translation (translating harmful intent into another language), and LLM driven semantic reformulation (using an LLM to rewrite harmful intent).
To measure the effectiveness of these jailbreaking attempts, they developed a grading rubric where text was scored by an LLM. In addition to their evaluation dataset, they curated a bug bounty program with monetary awards to encourage the community to jailbreak their system. These types of attacks are notoriously difficult to evaluate because they evolve continuously. As new threats emerge, it’s essential for LLM companies to proactively guard against these rising vulnerabilities. Anthropic’s approach attempts to embody the rise of emergent threats, and their results demonstrated the classifier guarded system could not be broken by a human red teamer and the system refused 95% of held out jailbreaking attempts on the evaluation dataset. Beyond evaluations, they discuss how these classifiers can be used in production.

Adding the constitutional classifiers safeguard demonstrates a lower attack success rate on automated attack strategies
Applications of Constitutional Classifiers
According to the paper, constitutional classifiers offer a proactive method to prevent harmful inputs and streaming outputs in production. They can be used to flag harmful content, similar to traditional safety measures used by social media companies, and as a multi-layered defense system that guards the LLM. Additionally, as new threats emerge, the constitution can be updated to incorporate these new categories of content which can then be used to generate additional training data for the classifiers. While constitutional classifiers provide an adaptable framework for harm mitigation, their effectiveness isn’t full proof—limitations exist in the narrow scope of domains tested.
Limitations of Constitutional Classifiers
Although thousands of hours were dedicated to human red teaming and building a comprehensive red teaming dataset, these evaluations cover only a subset of diverse current and future use cases of LLMs. The research primarily focuses on safeguarding against Chemical, Biological, Radiological, and Nuclear (CBRN) misuse, leaving out questions about the generalizability to other domains.
For example, LLMs are being used in law for legal research and contract review; in healthcare for clinical note writing, summarization, and research; and in finance for generating market reports or synthesizing regulatory compliance rules. Each of these domains may require domain specific safeguards that may not align with the CBRN-focused framework. For example, in healthcare there needs to be jailbreaking protections against HIPAA violations where a clinician may request patient information they shouldn’t access. Or when querying an LLM to summarize patient notes, the system needs a safeguard to ensure a higher standard for text generations that remain grounded in the original clinical documentation. These use cases may not extend with the current approach to constitutional classifiers.
While the authors suggest that new harmfulness categories can be added to the constitution, previous research has demonstrated that even subtle changes to evaluation prompts can significantly impact harmful content detection. When the data distribution of the synthetic data can change by slight changes of the prompt, it raises concerns about whether constitutional classifiers will maintain their effectiveness when adapted to domain-specific prompt modifications and diverse real-world applications beyond CBRN scenarios. Vulnerabilities like these highlight a broader challenge of evolving beyond general protections.
Future Work
Improving constitutional classifiers involves developing safeguards not only for general use cases but also for specific domains where LLMs are currently in use. This would help strengthen use for fields like healthcare, law, and finance, where additional harmful prompts and generations exist based on specific use cases. Creating domain specific protections presents unique challenges such as collaborating with domain experts to understand the intended use of LLMs and safety concerns. However, it is part of the responsibility of the model creators to stay ahead of emerging threats.
New questions can start to arise about effective safeguards such as: Will this technique be robust enough to shifts in the distribution of inputs and outputs over time? Is the grading rubric used to evaluate harmfulness strong enough? What constitutes harmful behavior in this domain? And does synthetic data generation capture the diversity of LLM interactions? Although this list is not exhaustive, these questions demonstrate the considerations when developing safeguards for LLMs in specific fields. This approach of using product use cases to drive our safety innovations is critical not only in strengthening our safeguards but also preventing harmfulness in fields where LLMs are already in use. And while some real world use cases were tested in constitutional classifiers, moving to additional practical use cases can help strengthen their safeguard.
Conclusion
Constitutional classifiers represent one of the first publicized attempts by an AI company to create guardrails for an LLM that help prevent it from providing harmful information, such as instructions for creating chemical weapons or assistance with system breaches. By using two classifiers to detect dangerous inputs and outputs, this approach is a defense mechanism against jailbreaking attempts to exploit existing LLM vulnerabilities.
While Anthropic demonstrated strong results through both human red teaming and an automated red teaming pipeline, this is a first step into developing robust safeguards for the variety of LLM use cases including in safety critical domains like healthcare, law, and finance. Harmful outputs in these contexts could compromise patient care, financial stability, or even national security. Using real world product applications to drive safety innovation creates a cycle where practical implementation informs effective safety protections and enables careful, iterative testing. This approach helps enable more targeted and contextually appropriate safety methods.
Moving forward, as AI systems become more capable, AI safety must anticipate and evolve to ensure responsible and trustworthy systems. Constitutional classifiers is a first approach, but much work is yet to be done to ensure ongoing vigilance and commitment to ensure responsible use of AI for the community.