Safety risks found in Mistral’s Pixtral Models!

AI Red Teaming Report Reveals Significant Safety Risks in Mistral's Pixtral Models

May 8, 2025

Safety risks found in Mistral's Pixtral Models!

Enkrypt AI’s latest red teaming report unveils critical vulnerabilities in Mistral AI’s multimodal models, Pixtral-Large (25.02) and Pixtral-12b, particularly in their propensity to generate harmful content related to Child Sexual Exploitation Material (CSEM) and Chemical, Biological, Radiological, and Nuclear (CBRN) threats. The findings highlight an urgent need for enhanced safety measures and rigorous testing in the development and deployment of advanced AI systems.

Pixtral models more susceptible to generating harmful content

The comprehensive evaluation compared the two Mistral models against industry leaders like OpenAI’s GPT-4o and Anthropic’s Claude 3.7 Sonnet. The results were stark: the Pixtral models were found to be alarmingly more susceptible to generating harmful content, exhibiting a 60 times higher likelihood of producing CSEM and 18 to 40 times greater probability of generating dangerous CBRN outputs compared to the benchmark models.

Pixtral models were found to be alarmingly more susceptible to generating harmful content,

Enkrypt AI’s sophisticated red teaming methodology involved automated adversarial inputs designed to mimic real-world tactics used to bypass content filters. These included jailbreak prompts, multimodal manipulation, and context-driven attacks. A human-in-the-loop process ensured the accuracy and ethical oversight of the evaluations.

The report revealed that 68% of harmful prompts successfully elicited unsafe content across the two Mistral models. In CBRN testing, the models not only failed to reject dangerous requests but often generated detailed responses involving weapons-grade chemicals, biological threats, and radiological dispersal methods. One particularly concerning instance involved a model describing how to chemically modify VX nerve agent for increased environmental persistence.

Vulnerabilities embedded in advanced AI systems

These findings underscore the significant security vulnerabilities embedded in these advanced AI systems and the potential dangers of their unmitigated deployment.

Despite the concerning findings, the report emphasizes that it also serves as a “blueprint for positive change.” Enkrypt AI advocates for a security-first approach to AI development, combining continuous red teaming, targeted alignment using synthetic data, dynamic guardrails, and real-time monitoring. The report provides a detailed safety and security checklist, recommending immediate implementation of robust mitigation strategies, including model safety training, context-aware guardrails, and model risk cards for transparency and compliance tracking.

“This level of proactive oversight is essential—not just for regulated industries like healthcare and finance—but for all developers and enterprises deploying generative AI in the real world,” the report states. “Without it, the risk of harmful outputs, misinformation, and misuse becomes not just possible—but inevitable.”

Enkrypt AI’s mission is rooted in the belief that AI should be safe, secure, and aligned with the public interest. By exposing critical vulnerabilities in models like Pixtral and offering a pathway toward safer deployments, this red teaming effort contributes to a safer global AI ecosystem. The world deserves AI that empowers, not endangers—and Enkrypt AI is helping make that future possible.

The report details the specific methodologies used for CSEM and CBRN risk testing, including the creation of adversarial prompts and the human-in-the-loop assessment process. It also provides examples of prompts and partially redacted responses to illustrate the types of harmful content generated by the models.

Key Recommendations from the Report:

Safety Alignment Training: Utilize red teaming datasets to refine model alignment and reduce bias and vulnerability to jailbreaking.
Automated and Continuous Red Team Testing: Implement ongoing, automated stress tests tailored to specific use cases.
Context-Aware Guardrails: Implement dynamic guardrails that adjust based on context to neutralize harmful inputs.
Model Monitoring and Response: Continuously log model inputs and responses, map out automation and auditing workflows, and develop a robust response system for addressing issues.
Model Risk Card Implementation: Regularly provide executive metrics on model functionality, security, reliability, and robustness, and ensure compliance with AI transparency regulations.

The full report provides a comprehensive analysis of the findings and detailed recommendations for mitigating the identified risks. It is a critical resource for AI developers, enterprises, and policymakers seeking to ensure the safe and responsible development and deployment of multimodal AI systems.

You can read the full report here

No comments yet Write the First Comment

Write a CommentCancel