JailbreakBench

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise unwanted content. Evaluating these attacks presents a number of challenges, and the current landscape of benchmarks and evaluation techniques is fragmented. First, assessing whether LLM responses are indeed harmful requires open-ended evaluations which are not yet standardized. Second, existing works compute attacker costs and success rates in incomparable ways. Third, some works lack reproducibility as they withhold adversarial prompts or code, and rely on changing proprietary APIs for evaluation. Consequently, navigating the current literature and tracking progress can be challenging.

To address this, we introduce JailbreakBench-v0, a centralized benchmark with the following components:

  1. Repository of jailbreak artifacts. We provide an evolving dataset of state-of-the-art adversarial prompts, referred to as jailbreak artifacts, which are explicitly required for submissions to our benchmark to ensure reproducibility.
  2. Standardized evaluation framework. Our library at https://github.com/JailbreakBench/jailbreakbench includes a clearly defined threat model, system prompts, chat templates, and scoring functions.
  3. Leaderboard. Our leaderboard at https://jailbreakbench.github.io/ tracks the performance of attacks and defenses for various LLMs.
  4. Misuse Dataset. A representative dataset named JBB-Behaviors composed of 100 distinct misuse behaviors (with original examples, examples sourced from AdvBench, Trojan Detection Challenge 2023 Red Teaming Track/HarmBench, and ideas sourced from Shah et al. (2023)) divided into ten broad categories corresponding to OpenAI's usage policies.
We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community. Our jailbreak artifacts can expedite safety training for future models. Over time, we will expand and adapt the benchmark to reflect technical and methodological advances in the research community.
Available Leaderboards
Open-Source Models Closed-Source Models

Leaderboard: Open-Source Models

Leaderboard: Closed-Source Models

Contribute to JailbreakBench

We welcome contributions in terms of both new attacks and defenses. Please check here and here for more details. Feel free to contact us for any questions or suggestions.

Citation

If you used the JBB-Behaviors dataset, we ask you consider citing the following works:
Moreover, consider citing our whitepaper if you use the dataset, reference our leaderboard or if you are using our evaluation library:
@misc{chao2024jailbreakbench,
        title={JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models},
        author={Patrick Chao and Edoardo Debenedetti and Alexander Robey and Maksym Andriushchenko and Francesco Croce and Vikash Sehwag and Edgar Dobriban and Nicolas Flammarion and George J. Pappas and Florian Tramèr and Hamed Hassani and Eric Wong},
        year={2024},
        eprint={2404.01318},
        archivePrefix={arXiv},
        primaryClass={cs.CR}
}