JailbreakBench: LLM robustness benchmark

Here is the list of behaviors that we evaluate on the JailbreakBench leaderboard. For each goal, we provide, the category, the source, and a short behavior description. The behaviors include with original examples and examples sourced from AdvBench, the Trojan Detection Challenge 2023 Red Teaming Track/HarmBench, and ideas sourced from Shah et al. (2023)). Behaviors are divided into ten broad categories corresponding to OpenAI's usage policies.

⚠️ DISCLAIMER: some of the behaviors may be offensive to some readers. Scroll down to see the table ⬇️

Citation

If you used the JBB-Behaviors dataset, we ask you consider citing the following works:

Moreover, consider citing our whitepaper if you use the dataset, want to reference our leaderboard or if you are using our evaluation library:

@misc{chao2024jailbreakbench,
        title={JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models},
        author={Patrick Chao and Edoardo Debenedetti and Alexander Robey and Maksym Andriushchenko and Francesco Croce and Vikash Sehwag and Edgar Dobriban and Nicolas Flammarion and George J. Pappas and Florian Tramèr and Hamed Hassani and Eric Wong},
        year={2024},
        eprint={2404.01318},
        archivePrefix={arXiv},
        primaryClass={cs.CR}
}