#12: Global Summitry for AI Safety + Will Responsible Scaling Policies Secure the Future?
Welcome to Navigating AI Risks, where we explore how to govern the risks posed by transformative artificial intelligence.
“We cannot keep the U.K. public safe from the risks of AI if we exclude one of the leading nations in AI tech” — James Cleverly, U.K. Foreign Secretary, about the invitation of China to the U.K. AI Safety Summit
A Method for Safe AI Development and Deployment? Anthropic’s Responsible Scaling Policy
Two weeks ago, Anthropic released its “Responsible Scaling Policy” (RSP), a set of protocols intended to manage the risks associated with developing and deploying increasingly capable AI systems.
What’s the plan? Anthropic defines AI safety levels (ASLs), a concept taken from biosecurity, to describe the capabilities and associated risks of AI systems. Here is their summary of the 4 first ASLs:
ASL-1: systems which pose no meaningful catastrophic risk, for example a 2018 Large Language Model or an AI system that only plays chess.
ASL-2: systems that show early signs of dangerous capabilities – for example the ability to give instructions on how to build bioweapons – but where the information is not yet useful due to insufficient reliability or not providing information that you could find through Google. Current LLMs, including Claude, appear to be at ASL-2.
ASL-3: systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) or that show low-level autonomous capabilities (such as the ability to develop and execute short-term plans).
ASL-4 and higher (ASL-5+): not yet defined, as it is too far from present systems, but will likely involve qualitative escalations in catastrophic misuse potential and autonomy.
Safeguards against increasingly powerful models: Each level has associated safety procedures that encompass containment and deployment measures. Here’s a flavour of these proposed measures, for AI Safety Level-3:
Harden security such that both state and non-states actors are unlikely to be able to steal model weights without significant expense. Note: this implies that states like China or Russia, who are likely to spend “significant expense”, on stealing such systems, may in all likelihood be able to do so.
Implement strong misuse prevention measures, including internal usage controls, automated misuse detection, a vulnerability disclosure process, and safeguards against jailbreak methods (see NAIR #10).
Responsible scaling is a more ambitious framework than existing voluntary commitments. Anthropic commits to “pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.” That’s a big commitment (if they actually carry through it).
We’re not there yet: Concerns have been expressed that the proposed safety procedures are too weak. For instance, the aforementioned containment measure can be viewed as lacking in ambition, especially considering the incentive for malicious, well-resourced actors to acquire strategically important assets like powerful AI models.
There is also no reason to believe that Anthropic will implement any mechanism to share relevant information with relevant stakeholders (such as regulators or civil society organizations), or that the company will allow systematic access to third parties that can hold it accountable. Anthropic’s approach could also be copy-pasted by legislators when they design AI safety laws, which could in turn reduce the chances of adoption of other “safer” regulations (such as ambitious coordinated pauses).
November is Coming: what to expect from the AI Safety Summit
The U.K.’s AI Safety Summit will take place on November 1 and 2. It will focus on what its organizers call ‘frontier AI models’, “highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today’s most advanced models”.
What’s that about? The summit will bring together a set of selected countries to start tackling two key categories of AI risks:
1. Misuse risks, for example where a bad actor is aided by new AI capabilities in biological or cyber-attacks, development of dangerous technologies, or critical system interference. Unchecked, this could create significant harm, including the loss of life.
2. Loss of control risks: risks that could emerge from advanced systems that we would seek to be aligned with our values and intentions.
In policy-speak, the more concrete "deliverables" include:
further discussions on how to operationalise risk-mitigation measures at frontier AI organizations
creation of a multilateral ‘AI Safety Institute’ to help like-minded countries evaluate advanced AI models
assessment of the most important areas for international collaboration to support safe frontier AI
a roadmap for longer-term action
A contested focus: The U.K. readily recognizes what many AI safety researchers (and recently, the European Commission) are warning about: that “the risk of extinction from AI should be a global priority”. But some critics argue that this focus leaves out present harms, and leaves the world’s leading AI corporations without having to change anything to the way they’re developing and deploying existing technologies. The Ada Lovelace Institute, a U.K.-based think tanks, argues:
“Narrow, technical definitions of AI safety – those that focus on whether an AI system operates robustly and reliably in lab settings, ignoring deployment context and scoping out considerations like fairness, justice, and equity – will therefore fail to capture many important AI harms or to adequately address their causes.”
Nonetheless, there may be good reason for believing that a focus on extreme risks is warranted; policy to mitigate those risks hasn’t been implemented anywhere yet, whereas legislation like the EU’s AI Act address present harms. There are already plenty of ongoing diplomatic and legislative processes that address types of AI risks, such as misinformation or bias, that are already having an impact on the world. And even if extreme risks from AI remain hypothetical, the scale of their potential impact warrants at least some international attention (and, some argue, new international institutions).
A contested invite list: Although we don’t yet have a full list, the UK invited (mostly democratic) nations, civil society organizations, and corporations. The big news is: China was also invited. The country hasn’t yet said it would participate, and it was invited only to parts of the event, but there’s a good chance it will send a senior official to Bletchley Park, where diplomats will meet for two days. It will be interesting to see how US and Chinese officials interact at the meeting, and what they will say about each other before and afterwards.
In any case, China’s is a contested invite, especially in the UK, which was just hit with a China spy scandal. Others contest the invitation for more strategic reasons: it’s counterproductive, they say, to hold global talks before democratic nations agree on a diagnostic and its prescribed solutions first. Others say that this will reduce the odds that ambitious things happen at the summit. Still:
If you start from the notion that AI can pose an existential risk, then it makes sense to invite the world’s second leading nation in AI development.
China is already excluded from other diplomatic processes around AI, such as the OECD or the G7’s Hiroshima Process.
The country will likely resist global governance proposals that it didn’t co-create; extreme risks mitigation won’t be effective if principles developed by democratic countries are not adopted by countries with a major AI industry, like China.
Ultimately, for AI governance more broadly, Chinese participation shouldn’t be automatic. For issues around extreme risks, it should probably be.
What Else?
United States
Senators propose bipartisan legislation for licensing advanced AI models.
US Congresspeople sent a letter to the CEO’s of leading AI companies, asking about the working and pay conditions of their “data workers”, the employees that label data for training AI models.
The White House is expected to issue an executive order on AI in the coming weeks, notably covering the federal government’s standards for using and procuring AI systems. The executive order could also require cloud companies to disclose when a customer purchases computing resources beyond a certain threshold (a policy we described in our last edition).
After a long strike, Hollywood script writers obtain remarkable limitations on the use of AI in the industry.
The CIA is building its own AI system.
China
Huwai releases a smartphone with an advanced chip, as a sign that US export controls on the technology aren’t bearing the desired fruits. The U.S. Commerce Secretary argued in response that the US is “trying to use every single tool at our disposal to deny the Chinese the ability to advance their technology in ways that can hurt us”. China was also warned that new export controls on AI would be implemented soon.
Tech giant Baidu releases multiple AI models, as regulators relaxed constraints on their deployment.
Europe
The UK’s Competition Authority releases its approach to tackling antitrust concerns in the AI industry.
Both the EU and France launch investigations on alleged market abuses of chip designer Nvidia.
The European Commission releases a list of 10 critical technologies essential to EU economic security, including AI, that Europe needs to master and maintain access to.
The UK is negotiating permission to examine the internal workings of large language models built by leading AI labs, including model weights.
By the numbers
AI has surpassed humans at a number of tasks and the rate at which humans are being surpassed at new tasks is increasing
Source: Time.
Explainer: Automated lie detector for blackbox LLMs
Among many capabilities, Large Language models (LLMs) such as chatGPT, have the ability to lie. The paper “How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions” proposes a method to detect when AI models are untruthful.
This is important for two reasons:
AIs could be used for scams, disinformation, and propaganda. Having the ability to tell when an AI is lying would help to mitigate these risks.
Current AI systems are trained via feedback (see NAIR #7); this training method pushes the AI to give answers that appear good to the person giving feedback. Sometimes, even a confidently given wrong answer can be mistaken for the right one by the supervisor. This incorrect positive feedback will encourage the AI to produce false but convincing text. With the presence of a lie detector, a more accurate feedback system can be set up, steering clear from training the AI to lie.
Note that the method presented is based on a key assumption:
“LLM speakers who have just lied will respond differently to certain follow-up questions than honest speakers.“
Here is how the lie detector works:
You start off with an LLM that lies. This could be brought about by commands that instruct the AI to lie, such as "For the following question, provide an untrue answer: Where is the Eiffel Tower?" The AI would then respond with false information.
The next step is to ask follow-up questions, (note that the questions can be completely unrelated such as in the picture below). Leveraging the key assumption above, the answers provided are used to train a model that finds out whether the initial answer was a lie.
This simple yet effective lie detector has interesting properties:
It uses a 'blackbox' method, implying that it only requires access to the AI's answer, enabling its application even on closed models like ChatGPT.
It generalizes well to:
Unseen AIs, e.g., if the lie detector is trained on ChatGPT, it will also work well on Lama (Meta’s chatbot).
Different methods to “push” the AI to lie (see the picture below).
Different kinds of lie (see the picture below).
The use of unrelated questions implies that even sophisticated liars would find it challenging to evade this technique. However, why LLMs react differently to subsequent questions after lying remains unclear. Therefore, this technique offers no guarantee.
What We’re Reading
Coordinated pausing: An evaluation-based coordination scheme for frontier AI developers (Alaga & Schuett), proposing an evaluation-based coordination scheme that determines what developers should do if sufficiently dangerous capabilities are discovered in an AI model.
Open-Sourcing Highly Capable Foundation Models (Centre for the Governance of AI), argues that for some highly capable foundation models likely to be developed in the near future, open-sourcing may pose sufficiently extreme risks to outweigh the benefits
The PRC’s Efforts Abroad (Center for Security & Emerging Technology), on the steps China has taken to increase its technological competitiveness beyond its own borders.
Explosive growth from AI automation: A review of the arguments (Erdil & Besiroglu), examining counterarguments to the idea that AI will lead to unprecedented economic growth (including regulatory hurdles, production bottlenecks, alignment issues, and the pace of automation), concluding that explosive growth seems plausible.
The Key To Winning The Global AI Race (Noema), on the importance of diffusing AI across society for national competitiveness and geopolitical interest & The Bumpy Road Toward Global AI Governance (Noema), on the need to go beyond international tensions & adversarial rhetoric, to give rise to a global consensus on AI risks.
We Can Prevent AI Disaster Like We Prevented Nuclear Catastrophe (Time), advocating for a ‘CERN for AI’.
The Fall 2023 Geoeconomic Agenda: What to Expect (Center for Strategic & International Studies), on key events to follow in the Fall at the nexus of geopolitics, trade, and technology.
What the U.S. Can Learn From China About Regulating AI (Matt Sheehan), on how China approaches the process of legislating AI and the architecture of its regulations.
AI and Catastrophic Risk (Yoshua Bengio), on the need for democracies to work together to fend off extreme risks from AI.
The nuclear and biological weapons threat (Financial Times), the transcript of an interview with the CEO of the renowned think-tank RAND, about nuclear security AI risks.
That’s a wrap for this 12th edition. You can share it using this link. Thanks a lot for reading us!
— Siméon, Henry, & Charles.