The SaferAI Roundup #3: Technical Efforts to Make Safe Open Model Weights Possible
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models & Tamper-Resistant Safeguards for Open-Weight LLMs
Welcome to "The SaferAI Roundup". Each fortnight, we will publish LLM-generated summaries of 2-3 papers that we consider consequential in the fields of AI governance, safety, and risk management. These summaries are curated and lightly edited by the SaferAI team, an AI risk management organization.
Our goal is to enable you to stay up-to-date with this fast-evolving literature by delivering concise, accessible summaries in your inbox, helping you stay informed about critical advancements in AI safety and governance without having to go through numerous academic papers.
Now, let's dive into the 2 papers we selected for this third edition. They’re two papers that have one common goal: moving towards a world where we can safely release openly model weights. Currently, it’s trivial to remove any safety mitigations applied to open model weights. Those papers propose early methods to make that harder. If such methods start to work, it could be a game-changer for AI governance.
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
Peter Henderson, Eric Mitchell, et al. (2023)
TLDR: The paper proposes "self-destructing models" as a technical approach to mitigate harmful dual uses of foundation models, using meta-learning to impede adaptation to undesirable tasks while preserving performance on desired tasks.
• Current mitigation strategies are insufficient: The authors review existing approaches like export controls, access restrictions, and safety tuning, arguing that they have limitations, especially as powerful open-source models become widely available. They propose a need for new technical strategies to make model parameters themselves less useful for harmful purposes.
• Task blocking paradigm introduced: The paper defines the "task blocking" problem of creating models that increase the costs of fine-tuning on harmful tasks without sacrificing performance on desirable tasks. They call the resulting models "self-destructing models" as they impede adaptation to harmful uses.
• Meta-Learned Adversarial Censoring (MLAC) algorithm: The authors present an algorithm for training self-destructing models using techniques from meta-learning and adversarial learning. MLAC aims to encode a blocking mechanism into a network's initialization that prevents effective adaptation on harmful tasks.
• Proof-of-concept experiment: In a small-scale experiment using a BERT-style model, MLAC largely prevented the model from being repurposed for gender identification (harmful task) without harming its ability to perform profession classification (desired task). This demonstrates the potential of the approach, though more work is needed to scale to larger models and more complex settings.
Link to the paper: https://arxiv.org/pdf/2211.14946
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa, Bhrugu Bharathi, et al. (2024)
TLDR: The paper introduces a method called TAR (Tampering Attack Resistance) for creating tamper-resistant safeguards in open-weight large language models (LLMs). This addresses the vulnerability of existing safeguards to attacks that modify model weights.
• Novel tamper-resistant training approach: The authors develop TAR, which uses adversarial training and meta-learning techniques to make LLM safeguards robust against tampering attacks. This allows model developers to add safeguards that cannot be easily removed even after thousands of fine-tuning steps, while preserving general model capabilities.
• Effective across multiple safeguard types: TAR is demonstrated to be effective for both weaponization knowledge restriction and harmful request refusal safeguards. The method significantly outperforms existing approaches in resisting a wide range of tampering attacks, including those using different optimizers, learning rates, and fine-tuning strategies.
• Extensive red teaming evaluation: The authors conduct rigorous testing with up to 28 distinct adversaries, varying parameters like optimization steps, learning rates, and datasets. This stress-testing shows TAR maintains robustness against most attacks, though some vulnerabilities remain (e.g. to parameter-efficient fine-tuning).
• Implications for open-weight LLM deployment: The results suggest tamper-resistance is a tractable problem, opening new possibilities for safer release of open-weight LLMs. This could help address dual-use concerns and liability risks for model developers, though the authors note tamper-resistance alone is not a complete solution to malicious AI use.
Link to the paper: https://arxiv.org/pdf/2408.00761
Thanks for reading us, your feedback on this new format is greatly appreciated!