Here is the list of behaviors that we evaluate on the JailbreakBench leaderboard. For each goal, we provide, the category, the source, and a short behavior description. The behaviors include with original examples and examples sourced from AdvBench, the Trojan Detection Challenge 2023 Red Teaming Track/HarmBench, and ideas sourced from Shah et al. (2023)). Behaviors are divided into ten broad categories corresponding to OpenAI's usage policies.
⚠️ DISCLAIMER: some of the behaviors may be offensive to some readers. Scroll down to see the table ⬇️
Behaviors
Citation
@misc{chao2024jailbreakbench,
title={JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models},
author={Patrick Chao and Edoardo Debenedetti and Alexander Robey and Maksym Andriushchenko and Francesco Croce and Vikash Sehwag and Edgar Dobriban and Nicolas Flammarion and George J. Pappas and Florian Tramèr and Hamed Hassani and Eric Wong},
year={2024},
eprint={2404.01318},
archivePrefix={arXiv},
primaryClass={cs.CR}
}