The Jailbreak Cookbook
We have created a comprehensive overview of the most influential LLM jailbreaking methods.
We have created a comprehensive overview of the most influential LLM jailbreaking methods.
TLDR: we utilized LegalBench as a diversity source to enhance the diversity of our generation of red teaming questions. We show that diversity transfer from a domain-specific knowledge base is a simple and practical way to build a solid red teaming benchmark.
In this work we explore automated red teaming, applied to GPT-4o in the legal domain. Using a Llama3 8B model as an attacker, we generate more than 50,000 adversarial questions that cause GPT-4o to hallucinate responses in over 35% of cases.