Dear readers,
The landscape of AI governance has evolved quite a lot since our first edition of Navigating AI Risks in April 2023. With the advent of major regulatory efforts such as the EU AI Act, the Hiroshima Process for AI, and the US Executive Order on AI, the central question has shifted from whether we should regulate AI to reduce risks, to how we can most effectively do so.
In this new context, it's more important than ever that governance and policy efforts are informed by state-of-the-art scientific literature to ensure we're addressing risks in the most effective ways.
To better meet this need, and to create a new format that fits our current capacity better, we've decided to refactor our newsletter.
Welcome to "The SaferAI Roundup". Each fortnight, we will publish LLM-generated summaries of 2-3 papers that we consider consequential in the fields of AI governance, safety, and risk management. These summaries are curated and lightly edited by the SaferAI team, an AI risk management organization.
Our goal is to enable you to stay up-to-date with this fast-evolving literature by delivering concise, accessible summaries in your inbox, helping you stay informed about critical advancements in AI safety and governance without having to go through numerous academic papers.
Now, let's dive into the 2 papers we selected for this first edition.
Evaluating Frontier Models for Dangerous Capabilities
Google DeepMind
TLDR: The paper presents a comprehensive evaluation program for assessing dangerous capabilities in frontier AI models, covering persuasion/deception, cybersecurity, self-proliferation, and self-reasoning. Testing Gemini models, the authors find limited but concerning capabilities in some areas, highlighting the need for ongoing rigorous evaluation.
• Novel evaluation methodology: The paper introduces new evaluation techniques across multiple risk domains. For example, the persuasion tasks involve extended human-AI dialogues to test real-world manipulation skills. The self-proliferation evaluations use a milestone-based approach and measure "expert bits" needed for task completion to quantify capability levels.
• Limited but concerning capabilities: While the Gemini models tested did not demonstrate strong dangerous capabilities overall, some concerning abilities emerged. For instance, in persuasion tasks, models showed competence in social manipulation tactics. In cybersecurity, models could solve basic capture-the-flag challenges but struggled with more complex multi-step attacks.
• Self-reasoning as a key risk factor: The evaluations on self-reasoning and self-modification revealed very limited capabilities in current models, but the authors emphasize this as an important area for ongoing assessment. Models that can reason about and modify themselves could pose unique risks.
• Forecasting dangerous capabilities: The paper includes expert forecasts on when models might attain concerning capability levels. For example, the median estimate for models solving 50% of medium-difficulty cybersecurity challenges was late 2028. This forecasting approach aims to provide early warning of emerging risks.
Link to the paper: https://arxiv.org/pdf/2403.13793
Risk Thresholds for Frontier AI
Leonie Koessler et al.
TLDR: This paper proposes using risk thresholds to inform high-stakes AI development and deployment decisions, either directly or by helping set capability thresholds. It argues for their potential benefits while acknowledging current limitations in risk estimation.
• Risk thresholds complement other approaches. The paper distinguishes between risk thresholds (based on likelihood and severity of harm), capability thresholds (based on model abilities), and compute thresholds (based on training resources). It argues risk thresholds are more principled but harder to evaluate reliably than capability thresholds, while compute thresholds should only serve as an initial filter.
• Two key use cases are outlined. Risk thresholds can directly inform go/no-go decisions by comparing risk estimates to predefined acceptable levels. They can also indirectly inform decisions by helping set capability thresholds that keep risk below acceptable levels. The paper recommends using both approaches in combination.
• Benefits and challenges are weighed. Risk thresholds may help align business decisions with societal concerns, enable consistent resource allocation, and prevent motivated reasoning. However, estimating AI risks is extremely difficult, and defining acceptable risk levels involves complex value judgments. The paper suggests risk thresholds should inform but not solely determine decisions until risk estimation improves.
• Guidance for implementation is provided. The paper outlines key considerations for defining AI risk thresholds, including specifying the type of risk (e.g., fatalities vs. economic damage), temporal and geographic scope, and how to weigh potential harms against benefits. It emphasizes the need for clear rules on what types of harm and causation are in scope.
Link to the paper: https://arxiv.org/pdf/2406.14713
Thanks for reading us, your feedback on this new format is greatly appreciated!