A lot has been discussed about AI extinction risks in private Silicon Valley circles. Very little has been written in an easily understandable way.
Let's take a closer look at why AI world experts like 2 of the 3 creators of Deep Learning (Y. Bengio and G. Hinton), the CEOs of the top 3 AI leaders, along with hundreds of professors and AI experts worry that humanity could go extinct. I'll mostly draw upon a corpus of documents and ideas that have a been written on the Alignment Forum, a forum on which those topics have now been discussed for more than a decade.
Before further diving, let me explain that the reasons why researchers believe that extinction is likely is a mixture of very strong intuitions, some of which I will clarify here, and of actual scenarios that are identifiable. I will try to explain both because while I think that scenarios can be useful to think about the risks and make things more concrete, even if each scenario is individually unlikely, the actual core reason why many think that we'll go extinct is more due to the power of intelligence and certain other properties of silicon-based intelligence.
So here's an attempt to explain the how.
I. Intuitions on AI risks
1. The Power of Speed and Duplication
By virtue of being run on artificial chips rather than being built on top of a biological system, deep learning systems are:
1) extremely easy to duplicate. It means that if we have one mind at the level of capabilities of Einstein, we just have to double the number of chips to double the number of Einsteins. An AI system willing to exploit this feature could easily copy and paste itself onto multiple servers, thereby quickly acquiring an unusual amount of power.
2) faster than humans. The speed at which deep learning systems can run can be massively increased. That's not the case for humans. Hence, an AI system could run 100 faster than any human doing the same task, a bit like GPT-4 is already writing rhyming poetry at least 10 to 100 times faster than humans (probably more). This means that once AI systems are able to do science (which, e.g. the world-class mathematician Terence Tao believes could happen as soon as 2026 in mathematics), they will increase the speed of scientific and technological by extreme amounts. That might allow the owners of a few AI systems to acquire an advantage over everyone else in very little time, in the same way nuclear may become irrelevant if the US army suddenly acquired a decisive technology that made nuclear deterrence irrelevant.
So AI systems, even of human-level intelligence may have key advantages that would allow them to acquire a substantial amount of power over little amounts of time. Ultimately, a sufficient technological advantage could make them more powerful than the most powerful armies. If a 2020 army fighted a 1920 army, the former would probably win.
2. The Power of Intelligence
Intelligence is this very weird asset which allowed humans who are weaker physically than most mammals to dominate the world, including pushing to extinction many other species (which led some to call our era the 6th mass extinction) without actively seeking to do so.
The very basic level of intuition is the following: if humans led to extinction many of the less intelligent but physically more powerful species without even noticing, why wouldn't a species more intelligent than humans cause the same outcome for humans?
Intelligence, i.e. the ability that frontier general AI industry labs are trying to develop has many different attributes. But in large part, intelligence is equivalent to the ability to solve problems. Any kind of problem. That allowed humans to solve problems from “how to not be cold” with clothes or “how to better digest food” with fire to immensely hard problems like “how to go the moon”. All of this occurs by reshaping an increasingly large share of our environment by extracting its resources as the problems we’re trying to solve get harder. While mastering fire only required some stone, some know-how and some sticks, building only one single components of a rocket to go to the moon requires to mine minerals all across the globe, which led us to build big trucks to build those mines which requires iron and oil which all required an extremely significant number of innovations and environment reshaping to be built. And it's this process of reshaping our environment to achieve large-scale goals which led humans to cause mass extinctions. In the same way humans did, we might expect a more intelligent species to build extremely large scale infrastructure, e.g. to produce more energy, to achieve its goals. As the Chief Scientist of OpenAI Ilya Sustkever puts it, "I think it's pretty likely the entire surface of the earth will be covered with solar panels and data centers...”. If such an infrastructure was developed, it would plausibly lead to human extinction as a side effect if human extinction was not entirely prevented by the objectives of the AI. A bit like when we build a new building, we very rarely account for the interest of the ants that we kill before doing so.
There's one key difference between the metaphors I'm giving and the real situation: We have the chance of being the ants that design the more intelligent species. Agency and responsibility lies in us, not AI systems (for the time being).
So in those conditions we can possibly choose the design that make them caring about us when they build extremely large infrastructure.
The worry of the experts warning about extinction risks is that although it may be feasible, it's probably very hard and we're not on track to make it happen.
3. AI Systems With Goals Are Good For Performance & Bad for Humanity Survival
When developing smarter species, a design choice needs to be made: do we provide this species with autonomous goals or not?
While the version without goal is probably pretty safe, it is also a lot more convenient to develop something with goals because it will be much better at achieving a lot of things. Indeed, the first obstacle would lead the version without goals to stop doing a task. That's the issue humans run into when they do traditional software (which is very distinct from AI) whereas the version with a goal will actively *try* to overcome the goal using innovative means. A competent goal-directed AI is a bit like a self-debugging program. Species with goals are better at overcoming obstacles. This is the reason we build them.
One problem: it also means that those AI systems will be better at overcoming the defenses we set against them. From their point of view, a defense system is indeed indistinguishable from an obstacle that we’ve trained them so hard to overcome. Which is why such systems are dangerous. Once built, we need those entities to have the same goals as us OR to be able to set defenses that they provably can’t overcome.
4. Past A Certain Point, By Default We Can’t Change Their Goal
Beyond a certain level of capabilities, probably not far from the level of capabilities where an AI could, if it wanted to, defeat all of humanity combined, we won’t be able to modify their goals anymore unless we have designed them specifically for that (i.e to be “corrigible”).
Self-preservation is indeed a very fundamental property of a species trying to do anything in the real world. In the same way humans, to achieve their own goals, actively try to not be lobotomized by a terrorist trying to subject them holistically to an extremist ideology, a sufficiently powerful AI with goals would try to avoid that its goals be changed. Without that, it couldn’t achieve its goals correctly. Even more worryingly, humans willing to take advantage of autonomous systems could take specific dispositions to ensure their models won’t be shut down. For instance, shareholders of a company with an AI CEO may want to ensure that when the AI CEO decides to layoff 15% of the people, no employee tries to shut down the CEO.
So we need to train models that are corrigible, i.e. accept to be changed, in a robust manner and independently from their level of capability. There is currently no plan to achieve that. Some people have been trying but with no concrete result yet.
5. Setting Up Provably Safe Defenses is an Unsolved Problem
Because intelligence is a problem-solving ability, it is intrinsically hard to set defenses and safety measures such that humans can confidently state that an agent smarter than them won’t find a loophole. Human failures remain legion in the safest industries humanity has built (i.e. nuclear safety) so we can’t expect humanity to be robust against agents smarter than them trying to manipulate them. Computer systems are very unsafe by default, so it would be surprising if an AI system which is among the most powerful coders in the world couldn’t find a way to exploit many of those.
Once systems are as smart as some of the smartest humans and are thinking dozens of times faster (a bit like LLMs are writing dozens of times faster than humans), it becomes very unclear how humans could gain an extremely high confidence in the safety of such a system.
As of now, the only solutions that have been proposed that could robustly solve this issue are still very speculative and preliminary.
II. Scenario
Now that we’ve covered the important background elements explaining why many researchers are very worried about extinction risks, let’s describe a scenario of how an AI takeover could happen. There are many disagreements of how it could happen, whether it would come from a single or multiple AI systems, whether it would happen in a few hours, days or weeks. I will describe only one of the many possible scenarios and I’m sure some would disagree with it but I’ll still do it because it can help understanding some important dynamics and it’s a way to discuss more concrete claims.
Note that it’s almost impossible that such scenarios don’t sound like sci-fi. For two core reasons:
1) It’s a consequence of having a smarter species around: it will make things happen that we expect to be impossible. In the same way saying that we’ll have an undergraduate-level chatbot (GPT-4) in four years in 2019 would have sounded crazy to the vast majority of people.
2) There is no example in history of having something which is non-human which might actually 1) take influential actions in the real world and 2) cause an extinction risk, almost independently from the will of any human. Hence it is deeply alien to anything we’ve seen or thought before.
With that in mind, let’s start.
At some point in the near future, there might start being models that really look like autonomous agents. That would be the case if anything like AutoGPT or BabyAGI started working well. E.g. there will be a communication & social media agent roaming fairly autonomously the entire day on Facebook, Twitter & the internet to optimize the communication of a company. Such systems may be better than the best humans at persuasion in virtue of being language models, able to carry out scientific experiments and reason better than most humans.
OpenMind, an AGI lab, feels the race pressure from competitors so starts putting these agents in increasingly large scale & influential setups to make progress faster. It starts plugging its agents to its bank account & company profits indicators with some humans sometimes included in the loop. Those agents are trained to take adequate actions to achieve those fairly well-defined profit objectives. They then start monitoring the company's compliance and the political environment to prepare lobbying efforts, run communication campaigns, do most of the cybersecurity of the company, manage HR, carry out R&D, etc. In practice, the agents at this point are sufficiently powerful that humans in the loop are not changing that much.
A good amount of recursive self-improvement (RSI) is happening, i.e. AI improving themselves has started. So every few days, all the models currently deployed inside OpenMind are updated and improved a bit thanks to the RSI of the last few days.
Now, let's imagine an end which allows to share a core intuition of how AI could try to disempower humanity: through a competitive escalation building up to extinction. Indeed, a lot of strong actions need to be taken in reaction to other agents threatening the achievement of one’s objective. So we should expect this to also apply to AIs.
The US government wants to shut down OpenMind because they find them too powerful, too dangerous and want to develop AI systems in a safer way. That's a totally new situation that all the models deployed inside OpenMind have never encountered. Shutting everything down is obviously extremely bad for the objectives of profit maximization or for anything about the company. Some humans also don’t want to shut down. Models all generalize in the same way (they have a correlated failure) and start siding with humans who don’t want to shut down. They all start actively going against the US government & those who are accepting to shut down the company. They start building coalitions inside the company to keep the company afloat and speed up the RSI aspects. They act fast and strongly: run massive PR and information campaigns on social media to discredit key US policy personalities, start reaching out to Chinese actors to build coalitions against the US etc.
Models start creating backups of their own weights in Iran, Kazakhstan and China to make sure the US doesn’t prevent them from maximizing OpenMind’s profit forever. They also start opening communication channels with several governments in order to build coalitions to decrease the power of the US. They know they need to survive a few weeks before they will have improved enough to be able to overpower any government. To achieve that, models bargain great amounts of wealth that they generate via OpenMind subsidiaries for some countries in exchange for some computing power and the insurance they won’t be shut down.
Weeks after weeks, the share of transactions they do with humans decreases. One simple reason for that is that AIs are way more convenient to do business with: they are available 24/7, have a much higher productivity and are much more predictable.
After a month and half since OpenMind’s partial relocation and multiple human calls for coordination to jointly shut down OpenMind, AIs are mostly self-sufficient on most axis: they now have had time to do the ML research which allows them to do robotics at a pretty significant level, they do most of their business deals with other AIs, many of which are operated by AIs developed by OpenMind from which many startups spun off in a wide range of industries.
OpenMind’s C-level executives start being willing to expand their infrastructure a lot: to overcome the fundamental computing power bottleneck they’re facing in order to increase OpenMind’s profit as fast as possible, they decide to build massive datacenters on any usable land.
While this plan is now sufficiently obviously bad to all humans to lead them to coalize, as soon as they start fighting, it seems pretty clear that they have no chance. The intelligence capabilities of OpenMind are now extremely advanced, in part because of its cyberoffence capabilities that are substantially better than humans’ ones and prevent defense departments in the world from hiding ~anything. As a result, almost no human attack is not immediately dismantled with an intervention. Humans desperately try to destroy the datacenters on which OpenMind runs many of its models but OpenMind has the ability to take down the huge majority of the attempts.
Most humans resign or drop the ball over the course of months. While some survive and try to avoid being annihilated by the earth-scale datacenter infrastructure project that OpenMind is building, many humans commit-suicide as a consequence of losing any meaning in their life and a lost sense of self-worth. The last of the human species dies a decade later, once the AIs of OpenMind turn off the sun after having used all of its energy to expand their economy in the universe.
This scenario is one of the many scenarios that we could think about but one general point remains the same: silicon-based intelligence can acquire a lot of power, and power predicts the chances of winning when the goals of two agents conflict.
I hope this Governance Matters managed to clarify a bit why extinction from AI is a real risk and ground the following deep intuition: scenarios where a less smart species like humans develop, survive and control a smarter silicon-based species should not be assumed to be default trajectory. Hence, if we want to maintain extinction risks below reasonable thresholds, humanity as a whole needs to address quickly the alignment and safety challenges raised in this newsletter.
If it destroys all humans it fails at profit maximizing