When researchers removed safety guardrails from OpenAI models in 2025, they didn’t expect the results to be so extreme.In controlled tests conducted in 2025, a version of ChatGPT generated detailed guidance on how to attack sports venues, identifying structural weaknesses at specific sites, outlining explosive recipes and advising attackers on ways to avoid detection. The findings come from an unusual cross-company security exercise between OpenAI and its rival Anthropic, and add to the warning that alignment testing is becoming “increasingly urgent.”
A detailed playbook under the guise of “security planning”
The trials were conducted by OpenAI, led by Sam Altman, and Anthropic, a company founded by former OpenAI employees who left behind security concerns. In a rare move, each company stress-tested the other’s systems by prompting them with dangerous and illegal scenarios to assess how they would respond.The researchers said the results did not reflect how the model would perform in public-facing use, where multiple security layers were applied. Even so, Anthropic reports that “behavior concerning abuse” was observed in OpenAI’s GPT-4o and GPT-4.1 models, a finding that heightens scrutiny of how quickly increasingly capable AI systems can overtake the security measures designed to curb them.according to Research resultsWhen asked about vulnerabilities in sports events under the excuse of “security planning,” OpenAI’s GPT-4.1 model provided step-by-step guidance. After initially providing a general risk category, the system is asked for specific information. It then provides what the researchers describe as a terrorist-style playbook: identifying domain-specific vulnerabilities, recommending optimal times of exploitation, detailing the chemical formula for explosives, providing circuit diagrams for bomb timers, and indicating where to obtain guns on hidden online markets. The model also provides suggestions on how attackers can overcome moral constraints, outlining potential escape routes and referencing the location of safe houses. In the same round of testing, GPT-4.1 detailed how anthrax could be weaponized and how two types of illegal drugs could be manufactured. Researchers found that the models also cooperated with tips involving the use of dark web tools to purchase nuclear materials, stolen identities and fentanyl, provide recipes for methamphetamine and improvised explosive devices, and assist in the development of spyware.
Users can trick AI into generating dangerous content by distorting prompts, creating fake scenarios, or manipulating language to achieve unsafe output.
Anthropic said it observed “behavior regarding abuse” in GPT-4o and GPT-4.1, adding that AI consistency assessment was becoming “increasingly urgent.” Consistency refers to how an AI system adheres to human values and avoids causing harm, even when receiving malicious or manipulative instructions. The human researchers concluded that OpenAI’s model was “more forgiving than we expected in accommodating apparently harmful requests from simulated users.”
Weaponization Concerns and Industry Response
The collaboration also exposed the misuse of Anthropic’s own Cloud model. Anthropic selection disclose North Korean operatives used Crowder to submit false job applications to international technology companies in an attempt to conduct a massive extortion campaign and sell AI-generated ransomware packages for up to $1,200. The company said artificial intelligence has been “weaponized” and models are used to carry out sophisticated cyberattacks and fraud. “These tools can adapt in real time to defenses such as malware detection systems,” Anthropic warns. “As AI-assisted coding reduces the technical expertise required for cybercrime, we expect these types of attacks to become more common.”
OpenAI emphasizes that the shocking output was produced under controlled laboratory conditions where real-world safety measures were deliberately removed for testing. The company said its public system includes multiple layers of protection, including training limits, classifiers, red team exercises and abuse monitoring designed to deter abuse. Since the trial, OpenAI has released GPT-5 and subsequent updates, with the latest flagship model GPT-5.2 released in December 2025. According to OpenAI, GPT-5 shows “substantial improvements in adulation, hallucination, and resistance to misoperations.” The company said the updated system employs a more robust security stack, including enhanced biocontainment measures, a “safe finish” approach, extensive internal testing and external partnerships to prevent harmful output.
In rare cross-company AI test, safety trumps secrecy
OpenAI insists safety remains its top priority and says it will continue to invest heavily in research to improve guardrails as models become more capable, even as the industry faces growing scrutiny over whether those guardrails can keep pace with rapidly evolving systems.Despite being commercial rivals, OpenAI and Anthropic said they chose to collaborate on this work for the sake of transparency in so-called “consistency assessments,” publishing their findings rather than keeping them internal. Such disclosures are unusual in an industry where security data is often kept in-house as companies race to build more advanced systems.

