The SaferAI Roundup #5: Attempts to Benchmark and Solve Jailbreaks.
HarmBench & Improving Alignment and Robustness with Circuit Breakers
Welcome to "The SaferAI Roundup". Each fortnight, we will publish LLM-generated summaries of 2-3 papers that we consider consequential in the fields of AI governance, safety, and risk management. These summaries are curated and lightly edited by the SaferAI team, an AI risk management organization. Our goal is to enable you to stay up-to-date with this fast-evolving literature by delivering concise, accessible summaries in your inbox, helping you stay informed about critical advancements in AI safety and governance without having to go through numerous academic papers.
In this fifth edition, we explore two papers focused on a critical aspect of AI safety: jailbreaks. These adversarial attacks consist of crafting prompts to bypass a model's safety training, leading to harmful outputs. Solving jailbreaks would enable to greatly reduce misuse risks from advanced AI systems behind an API, which, for the coming years, would be a huge deal.
Our first paper introduces HarmBench, a benchmark for evaluating the effectiveness of various jailbreak attacks. It’s, to our knowledge, the most comprehensive and detailed benchmark on the topic.
The second paper, "Improving Alignment and Robustness with Circuit Breakers," presents an innovative defense mechanism against jailbreaks using representation engineering. Notably, as of this writing, models employing this technique remain unjailbroken in Gray Swan's ongoing jailbreak competition, underscoring its potential significance in the field of AI safety.
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika et al. (2024)
TLDR: The paper introduces HarmBench, a standardized evaluation framework for automated red teaming of large language models (LLMs), addressing the lack of comparability in existing evaluations.
• Comprehensive benchmark for red teaming: HarmBench includes 510 carefully curated harmful behaviors across diverse categories, including copyright, contextual and multimodal behaviors. This provides a much broader evaluation than previous datasets, allowing for more thorough testing of LLM safety measures.
• Standardized evaluation pipeline: The authors identify key issues in prior evaluations, such as the impact of generation length on metrics, and propose solutions to enable fair comparisons between methods. They also develop robust classifiers for evaluating attack success.
• Large-scale comparison of attacks and defenses: Using HarmBench, the paper evaluates 18 red teaming methods against 33 LLMs and defenses. Key findings include that no current attack or defense is uniformly effective, and that model robustness is independent of size within model families. It’s notable that Meta’s LLaMa-2 and Anthropic’s Claude are the most resistent to many types of jailbreaks.
• Novel adversarial training method: The authors introduce R2D2, an adversarial training approach using automated red teaming. When applied to the Zephyr 7B model, it achieves state-of-the-art robustness against strong attacks while maintaining good performance on benign tasks.
Link to the paper: https://arxiv.org/pdf/2402.04249
Link to the website: https://www.harmbench.org/
Improving Alignment and Robustness with Circuit Breakers
Andy Zou et al. (2024)
TLDR: This paper introduces "circuit breakers", a novel approach to make AI models intrinsically safer and more robust against adversarial attacks by interrupting harmful internal processes, without compromising performance on benign tasks.
• Novel approach to AI safety: The paper presents circuit breakers as a new paradigm for creating models that do not produce harmful outputs. Rather than trying to remove specific vulnerabilities, this method aims to directly circumvent the model's ability to produce harmful outputs in the first place by interrupting internal processes.
• Improved robustness across diverse attacks: Experiments show that models with circuit breakers demonstrate significantly improved robustness against a wide range of unseen adversarial attacks, including sophisticated white-box and embedding space attacks.
• Improved performance trade-off compared with Claude 3: Unlike many existing defense techniques that compromise model capabilities, circuit breakers achieve high reliability against attacks with minimal impact on performance in benign scenarios. This represents an advancement in balancing safety and capability in AI systems.
• Broad applicability: The method is demonstrated to be effective not just for text models, but also for multimodal systems and AI agents. It shows promise in preventing harmful image-based attacks and controlling agent behaviors, suggesting potential for mitigating risks like power-seeking or dishonesty in more advanced AI systems.
Link to the paper: https://arxiv.org/pdf/2406.04313
Thanks for reading us, your feedback on this new format is greatly appreciated!