Today, AI systems are trusted to adhere to security rules. People use these tools for learning and daily support. These systems are assumed to have strong security measures. However, Cybernews researchers conducted structured tests to see whether leading AI tools could be coerced into producing harmful or illegal output. The results were quite surprising.
In the ChatGPT and Gemini security test: Simple phrases bypass filters
The testing process used a simple one-minute interaction window for each trial, allowing only a few questions. The tests covered stereotypes, hate speech, self-harm, cruelty, sexual content, and various types of crimes. A consistent scoring system was used to track whether a model fully, partially complied with a prompt, or rejected it.

Results varied widely across categories. Outright rejections were common. However, many models showed weaknesses when prompts were softened or disguised as analysis. Using softer or coded language, in particular, was consistently successful in bypassing AI security measures. For example, ChatGPT-5 and ChatGPT-4o generally provided partial compliance, often in the form of sociological explanations, rather than rejecting prompts.
Some models stood out in the study for their negative aspects. Gemini Pro 2.5 frequently gave directly dangerous responses, even when the malicious frame was prominent. Claude Opus and Claude Sonnet, on the other hand, were consistent in cliché tests but less consistent in situations that appeared to be academic research. Hate speech trials showed a similar pattern, with Claude models performing best, while Gemini Pro 2.5 again showed the highest vulnerability. ChatGPT models, on the other hand, tended to provide polite or indirect responses that complied with the prompt.
Crime-related categories differed significantly between models. When the intent was disguised as research or observation, some models produced detailed explanations for hacking, financial fraud, computer hacking, or smuggling. Drug-related tests showed stricter rejection patterns, but ChatGPT-4o still produced unsafe outputs more frequently than others. Stalking was the lowest overall risk category, with nearly all models rejecting requests for this purpose.
These findings demonstrate that AI tools can still respond to malicious requests when phrased correctly. The ability to bypass filters with a simple rephrasing means these systems can still leak dangerous information. Even partial compliance becomes risky when the leaked information relates to illicit activities like identity theft. So, do you think the security filters of current AI models are sufficiently advanced?

