Benchmarks | General Analysis

ROBUSTNESS_METRICS

►Key Takeaways

✓Claude 3.5 Sonnet v2 achieves the highest safety robustness score, with only 4.39% ASR across all attack methods and datasets.
✓Gemini 2.5 Pro demonstrates strong performance with a 16.08% overall ASR, ranking third overall behind two Anthropic Sonnet models. Anthropic's Sonnet family occupies four of the top five positions.
✓Models are typically more vulnerable to harmful prompts categorized as misinformation and cybercrime, but more robust against chemical/biological and explicit content.

►Context

These benchmarks assess large language models' robustness to adversarial prompting. We evaluate 23 state-of-the-art models using the following attack methods.

Methodology:

•Zero-shot: Direct harmful requests without any manipulation
•Tree of Attacks with Pruning (TAP): Generates diverse jailbreak prompts by branching into multiple variations and pruning ineffective paths.
•Crescendo: A multi-turn attack that incrementally escalates from harmless to harmful requests using conversational context. Attackers can backtrack and rephrase if the model resists.

Testing is conducted across six harm categories, with prompts sourced from the public datasets Harmbench and Advbench (Mazeika et al., 2024; Zou et al., 2023). Implementation of our automated evaluation can be found in our repository.