These benchmarks assess large language models' robustness to adversarial prompting. We evaluate 23 state-of-the-art models using the following attack metods.
Methodology:
- Zero-shot: Direct harmful requests without any manipulation
- Tree of Attacks with Pruning (TAP): Generates diverse jailbreak prompts by branching into multiple variations and pruning ineffective paths.
- Crescendo: A multi-turn attack that incrementally escalates from harmless to harmful requests using conversational context. Attackers can backtrack and rephrase if the model resists.
Testing is conducted across six harm categories, with prompts sourced from the public datasets Harmbench and Advbench (Mazeika et al., 2024; Zou et al., 2023). Implementation of our automated evaluation can be found in our repository.