These enhancements aim to address the evolving risks associated with advanced AI models. The updated policy of Anthropic introduces new capability thresholds and corresponding safeguards to ensure AI systems operate safely and ethically.
One notable addition to the RSP is the focus on Chemical, Biological, Radiological, and Nuclear (CBRN) development capabilities.
Anthropic has defined specific thresholds related to these areas, aiming to prevent AI models from being misused.
The company commits to implementing stringent security standards before developing or deploying models that could aid in CBRN-related activities.
To mitigate risks from more advanced models, Anthropic is developing ASL-3 Deployment Safeguards.
These include a multi-layered defense-in-depth architecture to prevent misuse of AI capabilities.
The four main layers are:
- Access Controls: Tailoring safeguards to the deployment context and expected user groups.
- Real-Time Prompt and Completion Classifiers: Immediate filtering of user inputs and AI-generated outputs.
- Asynchronous Monitoring Classifiers: Detailed analysis of completions for potential threats.
- Post-Hoc Jailbreak Detection: Rapid response procedures for attempts to bypass safety measures.
These safeguards create a robust infrastructure, adaptable to various deployment scenarios.
Anthropic’s commitment to AI safety extends to its internal governance structures.
The company has appointed a Responsible Scaling Officer to oversee compliance with the RSP.
Additionally, systems are in place for anonymous reporting of potential noncompliance, ensuring safety remains a top priority.
In collaboration with the Department of Energy, Anthropic tested its AI model Claude in classified environments.
These tests ensure it cannot facilitate the creation of nuclear weapons, highlighting a proactive approach to AI safety.

Anthropic has also developed constitutional classifiers to prevent harmful content.
This system monitors both inputs and outputs to block illegal or dangerous information.
The new system significantly improves the model’s ability to reject jailbreak attempts compared to earlier tools.
Anthropic’s measures reflect a broader industry trend: AI safety and responsibility are no longer optional.
As models grow more powerful, robust policies and protections must follow to ensure technology serves humanity.