The SaferAI Roundup #4: Capabilities Improvement and Safety Testing of GPT-4o and Claude 3.5
GPT-4o System Card & Claude 3.5 Sonnet Model Card Addendum
Welcome to "The SaferAI Roundup". Each fortnight, we will publish LLM-generated summaries of 2-3 papers that we consider consequential in the fields of AI governance, safety, and risk management. These summaries are curated and lightly edited by the SaferAI team, an AI risk management organization.
Our goal is to enable you to stay up-to-date with this fast-evolving literature by delivering concise, accessible summaries in your inbox, helping you stay informed about critical advancements in AI safety and governance without having to go through numerous academic papers.
Now, let's dive into the 2 papers we selected for this fourth edition. This edition features model cards for two of the most recent Large Language Models: GPT-4o and Claude 3.5 Sonnet. These documents detail the capability improvements of both models, along with the safety testing procedures they underwent and the results of these assessments.
GPT-4o System Card
OpenAI (2024)
TLDR: OpenAI has released GPT-4o, an advanced AI model capable of processing and generating text, audio, image, and video inputs/outputs. This system card outlines its capabilities, limitations, and safety measures.
• Model capabilities and training: GPT-4o is an end-to-end trained omni model that can handle multiple input/output modalities. It shows improved performance across various benchmarks, especially in non-English languages and audio-visual tasks. The model was trained on diverse datasets up to October 2023, including web data, code, math, and multimodal information.
• Risk assessment and mitigation: OpenAI conducted extensive risk identification, assessment, and mitigation processes. This included external red teaming with over 100 experts dispatched into 4 phases:
These experts were tasked to assess potential new capabilities and risks, and stress test the implementation of safety measures.
• Safety evaluations: The model underwent safety evaluations, including the Preparedness Framework assessments for cybersecurity, biological threats, persuasion, and model autonomy. Third-party assessments were also conducted to validate key risk areas, with GPT-4o classified as medium risk overall. For example, the biological threat evaluation protocol consists of evaluating the increase in success rate given by using an AI model on 5 subtasks covering all the main stages in the biological threat creation process: ideation, acquisition, magnification, formulation, and release. We see that GPT-4o provides an uplift of a few percent of success at completing each subtask compared to using internet only. One interesting observation is that it makes the task ‘magnification’ possible, which is not the case at all without using the model.
• Societal impacts: The system card discusses potential societal impacts, including risks of anthropomorphization and emotional reliance, implications for healthcare and scientific research, and performance improvements in underrepresented languages. While the model shows promise in various fields, the authors emphasize the need for continued monitoring and research into long-term effects and ethical considerations.
Link to the paper: https://cdn.openai.com/gpt-4o-system-card.pdf
Claude 3.5 Sonnet Model Card Addendum
Anthropic (2024)
TLDR: Anthropic introduces Claude 3.5 Sonnet, an AI model that outperforms its predecessor Claude 3 Opus while being faster and more cost-effective. The model card details improvements in capabilities and safety evaluations.
• Improved performance across benchmarks: Claude 3.5 Sonnet demonstrates superior performance on various industry-standard benchmarks for reasoning, coding, and question-answering. It sets new standards in graduate-level science knowledge (GPQA), general reasoning (MMLU), and coding proficiency (HumanEval). The model also excels in vision-related tasks, outperforming previous Claude models on five standard vision benchmarks.
• Enhanced agentic coding abilities: In an internal evaluation testing the model's capability to understand and modify open-source codebases, Claude 3.5 Sonnet solves 64% of problems, a significant improvement over Claude 3 Opus's 38%. This test mimics real-world software engineering scenarios, involving multiple file interactions and iterative self-correction, showcasing the model's advanced coding capabilities.
• Improved safety and refusal mechanisms: Safety evaluations show that Claude 3.5 Sonnet has better refusal capabilities than its predecessors. It demonstrates fewer incorrect refusals for harmless prompts while maintaining appropriate caution with harmful content.
• Safety testing: Anthropic conducted safety evaluations according to their Responsible Scaling Policies, focusing on Chemical, Biological, Radiological, and Nuclear (CBRN) risks, cybersecurity, and autonomous capabilities. The document doesn’t provide information on the results of the risk assessment except that the model is classified as AI Safety Level 2 (ASL-2), indicating no risk of catastrophic harm. Additionally, the UK AI Safety Institute conducted pre-deployment assessments and shared their results with the US AI Safety Institute.
Link to the paper: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model*Card*Claude*3*Addendum.pdf