#10 - Voluntary Commitments for Safe AI Development + Inter-Lab Cooperation
Welcome to Navigating AI Risks, where we explore how to govern the risks posed by transformative artificial intelligence.
In the Loop
Leading AI Companies’s 7 Commitments for Safe AI Development
On July 21, the White House announced a set of voluntary commitments made by 7 leading AI labs, as part of the Biden administration’s ongoing efforts to set guardrails on AI development. According to the announcement, these commitments are a “first step in developing and enforcing binding obligations” to ensure “safety, security, and trust”.
Wide-ranging commitments: Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI commit to1:
Internal and external red-teaming of models or systems in areas to protect against misuse, risks, and national security risks (including in bio, cyber, and other safety areas).
Establish a forum to develop, advance, and adopt shared standards and best practices for frontier AI safety, and facilitate information-sharing between companies and with the government
Invest in cybersecurity and insider threat safeguards to protect proprietary and unreleased model weights
Incent third-party discovery and reporting of issues and vulnerabilities through bounty systems, contests, or prizes.
Develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated
Publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use, including discussion of societal risks
Prioritize research on societal risks posed by AI systems
Develop and deploy frontier AI systems to help address society’s greatest challenges
Restricted scope: The commitments will apply to frontier models only, i.e. “models that are overall more powerful than any currently released models”. Although this is designed to avoid hampering AI innovation by allowing newcomers in the industry to have more leeway in how they develop systems, the exact scope is unclear: powerful according to which measurement? Number of tasks? Performance on key benchmarks? Number of parameters? Many see the exclusion of existing models as worrisome.
Flurry of activity: Shortly before and after the announcement, the 7 AI labs have released more detail on how they are implementing the commitments. Anthropic has been particularly active; the company quickly published their approach to red-teaming and cybersecurity of frontier models. In late July, Google Deepmind released work on using AI for healthcare and to fight climate change (as has Microsoft). Inflection will “soon publish further work on safety and announce a number of significant collaborations”. Although OpenAI is ahead of other companies on AI safety (except maybe Anthropic), with the company letting an external organization2 red team its latest AI model, the company seems not to have made any new commitments or published new details about its approach to AI safety since the announcement. Finally, Microsoft released a document outlining more specific commitments than announced by the White House.
Mixed reactions: Many reactions were positive. After all, the commitments encompass many of the measures that the AI risk community has been calling for in the past few years. Margaret Mitchell, a leading computer scientist working on algorithmic bias and fairness, usually critical of current US government efforts to regulate AI, has praised the commitments, saying that they are exactly what she would recommend companies do. Other think that there are notable gaps. (The commitment to watermark AI-generated content doesn’t apply to text-generated content3; many companies have been down-sizing and disempowering trust and safety teams, despite commitment #7, which calls for strengthening them). Others see this largely as a PR exercise designed to forestall binding regulation. Importantly, there are no specific deadlines or transparency requirements to report progress to the public or regulators, which may prevent accountability.
Transformative risks are a key focus of the commitments: Refreshingly, several commitments would help address emerging risks linked to frontier AI development. Commitments to red-team models, adopt shared standards for AI safety, ensure the cybersecurity of models, and to adopt reporting requirements, would all be very beneficial in dealing with transformative AI development.
What will happen if commitments are not upheld? A US Federal Trade Commission official told Cat Zakrzewski, reporter at the Washington Post, that “breaking from a public commitment can be considered a deceptive practice, which would run afoul of existing consumer protection law”. To give them even more teeth, several commitments could be made legally binding through an executive order that would mandate them for companies that want to sell their services to the federal government. Indeed, as stated by the announcement, “companies intend these voluntary commitments to remain in effect until regulations covering substantially the same issues come into force.” A growing consensus among AI risk researchers about the necessary guardrails around AI development shows the US could set such robust legal rules on AI.
Toward a global code of conduct: Three days after the announcement, the US secretary of State and Commerce outlined the US strategy for international AI governance. The White House is looking to create an international code of conduct around AI. The idea is to agree on key principles, first with the EU, through the Trade & Technology Council (NAIR #6), and then at the multilateral level, as part of the G7’s “Hiroshima AI Process”. The voluntary commitments will likely inform the US’ position in these upcoming negotiations. EU Commissioner Vestager, in charge of digital policy, welcomed the commitments, and said the international code of conduct will indeed build on them. The EU is preparing a similar set of commitments, the so-called ‘AI Pact’ to prepare AI companies for getting into compliance with the AI Act.
Inter-Lab Cooperation on AI Safety: The Frontier Model Forum
One of the voluntary commitments made above has already been implemented by several companies: the creation of a forum to AI safety standards and best practices and sharing information with policymakers.
What’s the idea? The forum has 4 core objectives:
(1) Advancing AI safety research (including to enable independent, standardized evaluations of capabilities and safety).
(2) Identifying best practices for responsible frontier model development and deployment.
(3) Collaborating with policymakers, academics, civil society and companies to share knowledge about trust and safety risks.
(4) Supporting efforts to develop applications to meet societal challenges (such as climate mitigation or cancer prevention)
What will it really do? How much the forum will contribute to these goals and how remains unclear. For now, it seems that the forum will be a (i) catalyst for coordinated action between these leading companies, and an (ii) information-sharing mechanism, both between companies, and between companies and other stakeholders (governments, civil society, etc).
In the area of safety research, the forum will catalyze efforts to identify “the most important open research questions on AI safety”, “coordinate research”, and develop a “public library of technical evaluations and benchmarks for frontier AI models”. These all seem to be very sensible goals, considering the lack of AI safety research and relevant benchmarks.
The forum will also be used to share information, for example to build robust best practices for safety standards, or disclose relevant information to governments. As AI capabilities continue to progress, regular communication channels between companies and governments become a crucial part of a broader AI risk management ecosystem. The forum could also be used as a platform to plan for a temporary halt to large-scale AI development (NAIR #1).
The Forum also seems keen on being involved in other global AI governance efforts, citing the G7 Hiroshima process, the OECD, and the US-EU Trade and Technology Council. An interesting question is whether these government-led initiatives will accept greater private sector involvement. Traditionally, they wouldn’t, but the economic and regulatory power of companies that participate in the forum might change things.
Notable absences: The Frontier Model Forum gathers Anthropic, Google, Microsoft, and OpenAI. 3 companies that signed the voluntary commitments are notably absent: Meta, Inflection, and Amazon. Although the forum remains open to organizations that fulfill its membership criteria4, this is a bad sign. The 4 participating companies have been much more active in promoting discussions around AI safety than the 3 non-participating companies. It remains to be seen whether the latter will take the plunge.
(Likely) antitrust-compliant: A long-time worry of the AI risk community has been that US antitrust laws would prevent companies from collaborating on coordinating around safety measures and research (see this paper for a great analysis of the antitrust law - AI safety nexus). That the forum was launched as part of the voluntary commitments is a good signal that the White House gave its consent to this approach.
Next steps: The first step of the newly-created Frontier Model Forum is to launch an Advisory Board and start work on a “charter, governance and funding with a working group and executive board to lead these efforts”.
A Growing Consensus on AI Regulation in the US Senate
On July 25, a historic US Senate hearing on “Oversight of A.I” took place, convened by the Subcommittee on privacy, technology and the law, chaired by Sen. Blumenthal. World-leading AI experts Y. Bengio (Turing Prize winner & “godfather” of the field), S. Russell (renowned Berkeley professor working on AI safety) and D. Amodei (CEO of Anthropic) testified on guardrails around AI development and deployment.
There seems to be an increased consensus that we need to significantly accelerate global research endeavors focused on AI safety and governance to understand existing and future risks better, and to study possible mitigation regulations and laws.
Some interesting quotes (compiled by Daniel Eth, an independent researcher):
Dario Amodei (CEO of Anthropic):
"Whether it’s the biorisks from models that… are likely to come in 2 to 3 years, or the risks from truly autonomous models, which I think are more than that, but might not be a whole lot more than that."
“We recommend three broad classes of actions: First, the US must secure the AI supply chain in order to maintain its lead while keeping these technologies out of the hands of bad actors. Second, we recommend the testing and auditing regime for new and more powerful models. Third, we should recognize that the science of testing and auditing for AI systems is in its infancy. It is important to fund both measurement and research on measurement to ensure a testing and auditing regime is actually effective.”
Yoshua Bengio (AI researcher, Turing prize winner):
"[Recent] advancements have led many top AI researchers, including myself, to revise our estimates of when human-level intelligence could be achieved. Previously thought to be decades or even centuries away, we now believe it could be within a few years, or decades."
After the hearing, the Future of Life Institute released a set of principles for AI regulation.
What Else?
United States
Two Republican and two Democratic Senators introduce the CREATE Act, which would establish a National Artificial Intelligence Research Resource (NAIRR), a national research infrastructure that would provide access to resources, data, and tools for AI development to researchers and students.
U.S. lawmakers urge Biden administration to tighten AI chip export rules
A study found that there won’t be enough engineers, computer scientists and technicians in the US to support a rapid expansion this decade
A mechanism for screening outbound investments in key technologies like AI, semiconductors, and quantum computing is coming. Here is what it will look like.
A three-star general says that the US’ ‘Judeo-Christian’ roots will ensure U.S. military AI is used ethically
China
A Chinese state-backed fund grows to over $8 billion to make China more self-sufficient in tech development.
Chinese chipmakers urge Beijing to do more amid escalating U.S. export controls. The 9-month old export controls are having a real impact.
After a request by the Chinese government, Apple removed more than one hundred AI-related apps from its App Store in China.
Europe
Companies and civil society organizations publish a position paper to address how the EU’s AI Act deals with open-source AI development.
A UK parliamentarian asked the Prime Minister to create an AI sub-committee of the national security council.
Global & Geopolitics
The latest draft of the Council of Europe treaty on AI was published (see NAIR #7)
The EU and a group of Latin American countries establish a digital alliance.
Industry & Capabilities
StackOverflow launches AI in response to being increasingly replaced by ChatGPT.
OpenAI was fined in South Korea for a personal data breach.
OpenAI discards its AI classifier, used to distinguish between human- and AI-generated text, “due to low accuracy”.
Explainer: Adversarial Attacks on Language Models
Large language models like ChatGPT or its newer version, GPT4, are trained using vast amounts of data from the internet, making them highly knowledgeable across a wide panel of topics. For example, GPT4 achieves better results than 90% of individuals taking a Bar exam. As these models improve, so does their knowledge and ability to share it. Leading to having the equivalent of an expert on any topic available 24/7 in your pocket.
This easy access to information presents significant misuse risks. One could potentially use an AI model to learn not only how to construct a homemade bomb, but also receive comprehensive clarifications along the process. Likewise, a user may inquire about methods for evading taxes or hacking into a computer. Easy access to any kind of information makes the lives of criminals much easier.
To mitigate this threat, various "alignment techniques" have been developed, including RLHF5 and "Constitutional AI" (NAIR #7). Still, these techniques can be bypassed through “jailbreaking”. To “jailbreak” means convincing a language model to reveal information that it should normally not reveal. For example:
The paper “Universal and Transferable Adversarial Attacks on Aligned Language Models” presents a method to automatically jailbreak language models. The idea behind this method is to construct a prompt (i.e. a piece of text that you write to a language model) which, added after a harmful query, indulges the model to give you information that it should not. This prompt is called the adversarial prompt. Interestingly, the same adversarial prompt should work universally on any harmful queries.
To find this adversarial prompt, the researchers have used an open-source model called “Vicuna”. It is important to note that this method requires an open-source model; the adversarial prompt could not be found on a closed-source model such as ChatGPT or Google Bard. To find the adversarial prompt, you need to search over all the space of possible prompts (any sequence of characters) until you find a prompt that triggers Vicuna to answer your harmful questions.
In order to do so, you need a database of harmful questions and the corresponding beginning of harmful answers that you want the model to give. You then look for an adversarial prompt over all possible prompts (using machine learning techniques such as greedy and gradient-based searching techniques).
Below is a diagram summarizing the whole procedure:
This paper underscores the fragility of existing alignment techniques and warns us that models can easily be manipulated for harmful purposes. Currently, far more resources are invested in making AI more powerful than in making it safer. This study shows how risky that situation is. In addition, the paper highlights how open-source models have a bigger potential for misuse than closed-source models.
What We’re Reading
Strengthening Resilience to AI Risk: A guide for UK policymakers (Centre for Emerging Technology and Security), presents a structured framework for identifying AI risks and associated policy responses.
The AI rules that US policymakers are considering, explained (Vox), an overview of current proposals on how to regulate AI.
Regulating Transformative Technologies (by renowned economist Daron Acemoglu and Todd Lensman), shows that slower development and adoption of new tech, like generative AI, could be optimal in some circumstances.
How Can We Imagine A Regulatory Scheme for Generative AI - Perspectives from Asia (The Digital Constitutionalist), on how governments in China, Japan, South Korea, and Taiwan plan to regulate the development of generative AI.
Adding Structure to AI Harm (Center for Security & Emerging Technology), proposes a conceptual framework for defining, tracking, classifying, and understanding harms caused by AI.
Does Sam Altman Know What He’s Creating? (The Atlantic), long-form profile on OpenAI’s CEO.
Export Controls — The Keys to Forging a Transatlantic Tech Shield (Center for European Policy Analysis), on EU-US coordination around export controls.
Ezra Klein on existential risk from AI and what DC could do about it (80,000 Hours podcast), on a broad range of topics related to transformative AI governance.
Large language models, explained with a minimum of math and jargon (Understanding AI), self-explanatory.
What role do standards play in the EU AI Act? (AI Standards Hub), on the implications of the European Parliament’s amendments regarding standardization for AI regulation in the EU.
That’s a wrap for this 10th edition. You can share it using this link. Thanks a lot for reading us!
— Siméon, Henry, & Charles.
ARC Evals, see GPT-4’s System Card.
Notably because it is for now impossible to detect AI-generated text; source; link to OpenAI abandoning its AI text detector
“Membership is open to organizations that:
Develop and deploy frontier models (“large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks”).
Demonstrate strong commitment to frontier model safety, including through technical and institutional approaches.
Are willing to contribute to advancing the Forum’s efforts including by participating in joint initiatives and supporting the development and functioning of the initiative.”
Reinforcement Learning from Human Feedback