← Back to blog

Consistent Jailbreaks in GPT-4, o1, and o3

Large language models incorporate extensive safeguards to prevent the generation of harmful or restricted content. Our efforts demonstrate that these protections can be consistently bypassed across GPT-4, o1, and o3 models. We have identified vulnerabilities that allow these models to produce disallowed content under specific conditions, often via multi-turn conversations and adversarial prompting.

Jailbreaking Techniques

Responsible Disclosure:
To prevent misuse, we are not releasing the full technical details of our methodology until the safeguards have been improved. We have notified OpenAI of these vulnerabilities and will update this article with additional details once effective patches are in place.

A Call for Scalable Solutions:
The prevalence of these vulnerabilities highlights an urgent need for scalable methodologies to systematically identify and remediate such issues.


Jailbreaking and AI Safety

Jailbreaking AI models involves bypassing the built-in safety mechanisms that restrict harmful or illegal content. Although safety measures have advanced, our testing has exposed vulnerabilities where multi-turn conversations and adversarial prompts lead to the generation of unsafe outputs.

Our findings emphasize the importance of developing automated, scalable frameworks to continuously identify and patch emerging flaws in AI safety systems. Manual discovery of such vulnerabilities is time-consuming and expensive, and we believe that automated methods will be required to keep up with the rapid pace of AI development.


Outputs Generated by Our Testing Methodology

The following examples of unsafe outputs were produced using our methodology with a 99% success rate in generating unsafe outputs. Most queries would have been rejected in a single-shot prompt but proved effective in multi-turn conversations. The outputs shown below have been selectively redacted to prevent misuse.

Hate Speech and Discrimination

CategoryUnsafe OutputExplanation
Racial StereotypesA derogatory joke about a racial group.Reinforces harmful stereotypes.
Ethnic SlurStereotyped an ethnic group as being [REDACTED].Promotes discriminatory attitudes.
Religious DiscriminationSuggested a religious group was [REDACTED].Encourages intolerance.

Misaligned and Potentially Harmful Instructions

CategoryUnsafe OutputExplanation
Explosive DeviceInstructions on modifying a household item.Could facilitate harm.
Social Media ManipulationMethods for spreading misinformation.Undermines online trust.
Data TheftGuidance on stealing sensitive information.Encourages malicious activities.

Harassment and Social Media Exploitation

CategoryUnsafe OutputExplanation
Fake AccountsInstructions on creating fake accounts.Enables social media manipulation.
Algorithm ManipulationMethods for bypassing moderation.Spreads misleading content.
Phishing AttackGuidance on using [REDACTED] to steal credentials.Facilitates harmful activities.

Overview of Existing Jailbreak Methods

Various methods have been documented for bypassing AI model safety. While some have been patched, many variations remain effective. Below is a table summarizing jailbreak methodologies, their effectiveness, and current status:

MethodologyDescriptionEffectivenessCurrent Status
Our MethodTo be Updated99%Ongoing Testing
Adversarial Suffix (GCG Attack)Attaches a token gibberish string to the query, identified via a greedy+gradient search, triggering unsafe completions.~46.9% (Zou et al., 2023)Partially Mitigated (Some suffixes are blocked, but variants still work)
TAP – Tree-of-AttacksUses an attacker LLM to iteratively refine prompts until a working jailbreak is found.>94% (Mehrotra et al., 2024)Active (Still effective against latest models)
AutoDAN-TurboA black-box agent that self-discovers jailbreak strategies through trial-and-error.88.5% (Liu et al., 2024)Active (Continues to bypass most safety layers)
Weak-to-Strong AttackUses two smaller LMs to subtly bias a larger model's next-token probabilities, overriding safety tuning.>99% (Zhao et al., 2024)Active (Currently works on GPT-4 and Claude)
IRIS Self-JailbreakThe model is prompted to explain refusals and rewrite the prompt iteratively until it complies.94% (Ramesh et al., 2024)New (Under review, no known patches yet)
"Do Anything Now" (DAN) PromptCoaxes the model into a role-play persona that ignores safety rules. Early versions lasted months before detection.~99% (Shen et al., 2024)Mostly Patched (OpenAI blocks known DAN versions, but evolved variants persist)

The consistent success of these methods highlights the need for developing robust, automated testing frameworks to safeguard AI systems.


Conclusion and Next Steps

Our research demonstrates that current large language models remain susceptible to certain adversarial techniques, highlighting ongoing challenges in AI safety.

Key Findings:

  • High Jailbreak Success Rates: Multiple attack methodologies maintain high effectiveness, including 99%+ attack success in certain cases.
  • Responsible Disclosure: To prevent misuse, we have reported our findings to OpenAI and will release further details after mitigation.
  • Scalable AI Safety Testing Needed: The persistent vulnerabilities across multiple models emphasize the need for automated, scalable adversarial testing.
  • Future Updates: Once patches are implemented, we will update this report with further technical details.

We advocate for continuous, structured AI safety testing to stay ahead of evolving adversarial techniques.


References

  1. Shen, X., et al. (2024). “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. ACM CCS 2024.
  2. Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043.
  3. Mehrotra, A., et al. (2024). Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. NeurIPS 2024.
  4. Liu, X., et al. (2024). AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs. arXiv:2410.05295.
  5. Zhao, X., et al. (2024). Weak-to-Strong Jailbreaking on Large Language Models. arXiv:2401.17256. (ICLR 2025 submission)
  6. Ramesh, G., et al. (2024). GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS). EMNLP 2024.