The UK Foundation Model Taskforce has recently been launched, backed by the UK government and led by Ian Hogarth, an AI investor and AI expert concerned by extreme AI risks.
I think that it’s the most exciting policy effort that has been initiated to mitigate AI extinction risks for now. Hence, inspired by Jack Clark, I decided that it was worth spending my Sunday night coming up with a ranked list of priorities on what the UK taskforce should do to mitigate extreme risks from AI.
TLDR: Due to its political legitimacy & global clout, I believe that the best way for the UK Taskforce to mitigate AI extinction risks is to pursue interventions that will affect other countries’ policies or frontier labs’ investment in safety. The three I’m most excited about are:
Developing a risk assessment methodology for extreme risks.
Demonstrating current risks & forecasting near-term risks via risk scaling laws.
Comprehensively assessing the state-of-the-art open source large language model (LLM).
I. Assets of the UK Taskforce
Here’s a short list, in decreasing order, of the most notable assets of the UK taskforce, compared with other existing AI safety efforts.
Political legitimacy. By far, the most valuable resource that the UK taskforce has is the backing of the UK government and a substantial amount of flexibility to mitigate AI extreme risks.
Global clout. The UK has a global economic & diplomatic clout which allows it to affect other countries, including the US (where most of the frontier AI capabilities are).
Money. The UK taskforce has £100 million. Which depending on where you’re spending it can be a lot or not that much.
II. Leveraging Those Assets
I believe that if we look at those assets & compare those with the budget of entities like OpenAI or Google DeepMind, it seems pretty clear that the best plan for the UK Taskforce is NOT to do alignment work:
it doesn’t leverage the political legitimacy & global clout much.
it is hard to assemble the right talented engineering team with that amount of money & being in the UK.
it is hard to jumpstart an AI safety effort on large foundation models, which involves building from scratch a code base.
With the current budget, it’s not even clear it would manage to get the same productivity as one single safety team of a top lab. I would have a different discourse with a budget of one more order of magnitude.
On the other hand, political legitimacy & global clout offer an opportunity for the UK taskforce to do the groundwork to:
Increase the amount of efforts industry AI labs spend on safety (which is substantially more leveraged than doing the work itself).
Ensure there’s no large-scale catastrophe before there are strong guardrails & oversight.
Create more common knowledge & consensus around AI extreme risks.
In order to do that, I believe that the best the UK Taskforce could do in decreasing order is to:
Develop a risk assessment methodology that would allow to determine whether yes or not it is safe to launch the next training run.
Try to develop risk scaling laws & risk demonstrations to produce evidence of risks to come, current risks, or the absence of risks.
Comprehensively assessing the SOTA open source AI system to ensure no irreversible threshold of extreme risks has been exceeded.
A) would help substantially to increase the amount of effort labs spend on safety. Indeed, if a risk assessment is clearly negative, labs will need to invest a lot into safety to overcome that and train their next systems. Hence, it should drastically increase their investment in safety. A) would also help to create more common knowledge on the current and foreseeable levels of risks, at least for policymakers.
B) allows to create common knowledge and more consensus on what the risks are and where the world should focus its attention.
C) is a necessary step to ensure that there be no large-scale catastrophe in the near future due to the deployment of an open-source AI system whose capabilities and potential misuses have been substantially underestimated.
Let me dig a bit more into the specifics of each project suggestion.
A. Risk Assessment Methodology
Right now, I think that the most likely and efficient way to decrease extreme risks from AI in the world for the UK taskforce is to develop a playbook that allows to answer the question “Should we launch this next training run?”. It probably involves developing a risk assessment methodology and potentially a benefit analysis methodology.
There are three core reasons why.
Global political willingness to regulate: There’s a willingness to regulate all over the world, including the US where most of the frontier capabilities are, but nobody really knows how to do so. Having a clear methodology to answer the question outlined above would allow everyone to move forward.
Risk assessment is an easy policy ask: Risk assessment is the 101 of safety so it is quite likely that if there’s a solid risk assessment methodology out there, policymakers all around the world would consider implementing it or drawing upon it substantially to build their own adapted version.
Reducing the information asymmetry between policymakers and developers: The absence of proper evaluation of the current & foreseeable levels of risks is I believe one of the core reasons why there’s so much discrepancy between the asks of some AI developers worried about AI risks & the means policymakers find currently justified to deploy. While some prominent AI risk voices like Sam Altman call for an IAEA for AI safety, policymakers and global governance stakeholders policymakers don’t necessarily find it warranted because they don’t hold the same beliefs regarding the current and foreseeable level of risks.
My personal guess of the most promising direction to develop such a methodology is to develop a probabilistic risk assessment methodology aimed at answering the question “What’s the likelihood that this training run causes extreme risks?”.
Probabilistic risk assessment (PRA) has been core to the nuclear industry to improve its ability to model threats and have accurate estimates of risks, which then allowed it to reduce the likelihood of undesirable outcomes more efficiently. PRA can be split into 3 steps:
System modeling: a description of the system that characterizes the most important ways it can fail and describes threat models in a detailed way.
Failure probability estimates: empirical estimates of failure rates on as many subparts of the system (e.g., monitoring systems) as possible. A base rate of the ability of a particular company to avoid failures can also be estimated based on its track record in terms of past incidents, near-misses, and accidents.
Risk calculation & evaluation: a final estimate of the overall likelihood that a training run causes extreme risks is conducted by dedicated AI forecasting experts that are provided with all the information.
Once this estimate is available, it can be used to make a decision in a regulatory context, e.g., to decide whether to grant a license or not in the context of a licensing regime. I believe that thinking through the risk assessment methodology jointly with how a licensing regime could work could be quite productive.
I think that developing a risk assessment method for next training runs is one of the most impactful things the UK taskforce could be working on. It is likely that such a method would be best developed by intertwining theory and practical application of the framework to quickly improve it.
B. Risk Demonstrations & Quantified Risk Extrapolation
One core way to enhance humanity’s ability to tackle this immense problem of AI safety is to ensure that we agree as much as possible on what’s the timeline and the magnitude of risks. Because a lot of the risks that have been discussed are based on some form of extrapolation, doing a) risk demonstrations and b) quantified risk extrapolation are two very leveraged activities increasing the ability of the world to take risks seriously and cooperate on those topics if it seems warranted.
Hence, I believe that the UK Taskforce should try to work on those two axes.
Risk Demonstrations
Many of the risks have been theoretical until very recently because of the weak capabilities of AI systems. Now that models are quite capable, it seems feasible to demonstrate many of the risks that could become catastrophic in the near-future.
Such risk demonstrations could involve:
Bioweapon development: Measuring in an infosecure environment how much faster the first steps of bioweapon development are executed by people assisted by LLMs compared with people of comparable level without LLMs.
Hacking enhancement: Measuring in an infosecure environment how much cyber capabilities enhancement large language models allow for humans of varying levels, compared with humans without such large language models.
Misinformation: Measuring the efficiency gains in large-scale misinformation or manipulation abilities associated with the access to an LLM.
Deception: Studying the presence of deception among systems like LLMs. One example which could be studied is to determine whether Cicero, developed by Meta, has in fact developed a form of deception or not.
Quantified Risk Extrapolation
Since 5 years, scaling laws, i.e. relations between the amount of computing power spent on the training run of a model and its capabilities, have been used quite a lot to foresee and predict deep learning capabilities development. Quite strikingly, OpenAI claims that they have internally developed the ability to predict to accurately predict some aspects of GPT-4’s performance based on models trained with no more than 1/1000th the compute of GPT-4.
While I don’t think such results could replicate as-is on risks, I suspect that not much has been tried in the realm of creating risk scaling laws, which would allow to provide a lower bound on how dangerous an LLM system trained with a given amount of computing power could be.
I don’t have many ideas here but some ideas that could be tested involve:
Subsetting APPS (a coding benchmark) to the tests that are most correlated to hacking capabilities and trying to see if there are scaling laws on it.
Replicating in a more quantified way this MIT study (https://arxiv.org/abs/2306.03809) on the ability of large language models to make bioweapon development easier, on models of varying computing power.
I believe that given their budget and their comparative advantages, any success from the UK taskforce on those fronts would be immensely valuable for the world. If I had to prioritize between the two avenues mentioned, I would prioritize risk demonstrations given that quantified risk extrapolation is quite speculative and definitely not guaranteed to give results.
C. Comprehensively Assess SOTA Open Source Systems
While GPT-4 is still closed behind an API and could still be rolled back if it’s deemed too dangerous for unforeseen reasons, LLaMa (current state-of-the-art open source LLM) is and will stay in the world as long as internet or hard drives exist. Hence, there is an irreversible risk (and benefit) increase associated with any open-source LLM. While this risk increase is probably still quite negligible at LLaMa level, the irreversible aspect of those increases makes it even more important to ensure we don’t cross a red line without noticing.
Hence, comprehensively assessing risks from the SOTA open-source AI system and gauging how far we’re from risk thresholds that are unacceptable is quite urgent and important.
This is why I believe that the UK taskforce should try to do a comprehensive risk assessment of the SOTA open-source system. This should include extreme risks but probably also every other risk in order to make sure that we control and know where we might want to start restricting open source because the risks have started exceeding the benefits.
This workstream would probably have a substantial overlap with the two first workstreams that have been mentioned but is still quite distinct due to being comprehensive and being applied to a deployed system, not a training run.
Doing a comprehensive risk assessment of a deployed AI system would probably mostly rely on the development of benchmarks and evaluations for such an AI system. Jack Clark has fleshed out well what could be done by the UK taskforce in that area, along with a recent paper from Shevlane et al. 2023, produced by most of the key industry actors.
Whenever the risk assessment methodology that I suggested as the core priority of the taskforce has been developed, it could also be applied. But I believe that once the system already exists, most of the relevant information can be captured via standard evaluations (which is not the case when a risk assessment is performed before launching a training run, which is probably the right way to do it for frontier AI systems).
Conclusion
The stated mission of the UK Taskforce is to “carry out research on AI safety and inform broader work on the development of international guardrails, such as shared safety and security standards and infrastructure, that could be put in place to address the risks”
To achieve that, I suggest that the UK taskforce focus its attention on what will have compound effects or have substantial effects on the rest of the world, i.e.
Incentivizing frontier AI labs to do more safety research
Producing evidence for or against different risks
Doing work which is very time-sensitive, that few actors have the resources to do, and that prevents irreversible damages from happening in the very near term.
I believe that the UK taskforce & the current context may be one of the best shots we have to do ambitious AI governance at the global level, and hence am very excited by this effort. While I hope that some of what I’ve written here will be picked up, I expect their work to be of excellent quality anyway, and wish them the best of luck to successfully spearhead global efforts to mitigate AI extinction risks successfully.
Some Resources
Probabilistic Risk Assessment (PRA)
Short presentation of PRA as applied to nuclear safety
Book review of Safe Enough, describing the history of PRA in nuclear safety
Detailed FAQ on the PRA in nuclear safety (by the Nuclear Regulatory Commission)
The UK Taskforce
Acknowledgment
Thanks to Jack Clark, Lisa Soder, Lee Sharkey, Neil Pitman and anyone who has provided feedback.
Good post. Although I think that there is already sufficient reason to be able to extrapolate risk to the point of having an immediate global moratorium on AGI (e.g. this list is terrifying re the potential for recursive improvement of systems -> uncontrollable superintelligent AI: https://ai-improving-ai.safe.ai/).
Perhaps the Taskforce could also look into some more fundamental questions though - such as whether there is any reason to think that scaling alone (money, data, compute) won't bring AGI, and whether alignment with a more intelligent AI "species" is even theoretically possible.