12 min read

Thinking through AI X-Risks – Part I: Misalignment & Misuse

Thinking through AI X-Risks – Part I: Misalignment & Misuse
The manual had specifically warned Harold not to ask the AI to turn up the light. Well, his species had it coming anyways.

This article is another go at the question of "Is AI going to kill us all?" I've recently written about OpenAI's approach to AI safety. Looking at the topic through the lens of "What OpenAI does" helped me to not get sidetracked too much, but it also prevented me from writing about a lot of my thoughts. This now, is a broader attempt at writing about existential risks (x-risk) from AI. It's going to be divided in two articles: This article decribing my starting point, establishing what x-risks are and giving a couple of scenarios for x-risks due to misuse and x-risks due to misalignment of AI systems. The next part will look at various factors influencing p(doom), the likihood of everything going kaputt.

Table of Content
My starting point
What is x-risk and what x-risks are we talking about?
Why do I think these risks are credible?
Misuse
Misalignment
Facit & outlook

My starting point

Before I dig into my current perspective on AI safety and x-risks, I want to set the stage and go back in time, just a couple of months to right after when ChatGPT had been released. ChatGPT was an alien that landed on my desk and talked to me. It constantly did things I had not not seen coming. I wanted to understand what it's doing – so I reacted by going full nerd on it.

If you come to the topic of AI as an outsider, one of the very first discourses you will find is the promise of AGI/ASI and attached narratives of apocalypse and utopia. AI either solves all problems and cures death, or it will go horribly wrong and kill all life on earth. At first, I kind of tuned out both extremes of the debate. I perceived it as some kind of pseudo-religious version of typical reddit noise and regarded both, utopia and apocalypse, more as narrative endpoints than real possibilities on the table.

Still, I wanted to know what experts in the field have to say about this. To my surprise the debate among experts is not that much less existential. What is different is the quality of their arguments and their willingness to admit uncertainty. While on reddit the debate moves very fast and often gets dominated by users who wield their one argument like a club (e. g.: AI is going to end capitalism, therefore utopia vs. AI will not be aligned to human values, therefore mass death), the expert debate often revolves around the likelihood of different thinkable scenarios and what factors might influence that likelihood – followed by acknowledging that no one really knows what's going to happen.

After looking at the topic for a while now, my preliminary high-level facit is this: The risk that humanity will wipe itself out with AI within the next 5-100 years is very much accepted by a large number of leading experts in the field. AI x-risk is not just something crazy doomers on reddit and Twitter believe in. The same goes for possible utopias, though I'm going to focus on the x-risks here. I've found so many convincing arguments for x-risks being real and quite substantial that I feel pretty confident in adopting that position as my own. Yes, there are some famous voices in the AI community who argue x-risk is nothing but doomerism, but without going into detail here: So far, they have not convinced me. I might write about that side of the debate specifically in a later article. For now, I'm more interested in sorting my thoughts and giving an overview of x-risks and the factors influencing them as I see it.

What is x-risk and what x-risks are we talking about?

X-risk are a subclass of AI safety risks. Example for other AI safety risks would be AI enabled cybercrime, effects of biases or fake manipulative content. This article is not about those risks, it is focused on x-risks. The "x" expresses that the level of risk is "existential" – or in Utilitarian lingo: It's about risks for max-bad outcomes. Examples would be the end of all biological life or the permanent enslavement of humanity to some kind of autocratic nightmare system. The basic idea is always the same: AI capability is reaching a certain threshold and something really bad happens – either as a consequence of bad actors using the power of AI to do something or because the AI has a goal structure that is not in line with ours, so it does something catastrophic and we are unable to stop it. I'll use the terms misuse and misalignment for  those two cases. I'll also put "someone does something monumentally stupid with AI" under the umbrella of misuse. Similarly, I'll sort "AI makes a mistake with epic consequences" under misalignment (the assumption being that reckless AI is not well aligned AI).

So this is what the article is going to be about: The risk of max-bad outcomes, e. g., mass death or enslavement of humanity, as a consequence of misuse of AI or as a consequence of misalignment of AI. What the article is not about is to say we are all going to die. Risk is not certainty.

Why do I think these risks are credible?

The following is relevant across all risk categories: Mostly everyone in AI research seems to agree that we are going to see super intelligent, extremely capable models eventually. We don't know how far we can take current Large Language Models and GPT – but even if AI research was to hit another wall and stagnated for a decade or two, what would the systems after the next breakthrough look like? To me, assuming we don't nuke ourselves to bits or destroy some crucial part of the global food web, the best guess assumption is that we will develop something vastly more intelligent than our current systems within the next 50 years, something that also ends up being vastly more intelligent than us. That's my premise. Given that, what would some credible max-bad outcome scenarios look like?

Misuse

Humans do catastrophically bad, cruel or stupid things all the time – individually and collectively. I don't see why we woldn't use AI to do that, which scales up to potential catastrophe quickly.

When it comes to intentional misuse, I'm definitely worried about the prospect of AI based population control. We already have China trying to make its own political system unassailable by using AI to weave an ever tighter net of surveillance, propaganda and thought control. The U. S. is not that stable either – it's easy to imagine it going fascist autocracy in an election or two. Right wing autocratic parties have established a presence all over Europe, Russia is reenacting the 20th century, and yeah, there is probably also danger from the left I'm just conveniently ignoring. The point is: From face recognition to analyzing online behavior to predictive "policing" – super powerful pattern recognition and problem solving in the hands of an authoritarian regime is a nightmare scenario. Over time, this might turn into the default for most or even all societies. And I'm not even going over the more cyberpunk-y scenarios in which corporations are the bad actors – needless to say: there is real risk there too.

An example for accidental/stupid misuse with attached x-risk would be a nuclear power rigging their defenses to an AI system so they can react faster... and that system failing horribly. Or an up and coming AI research lab in some country or another pushes an unsafe product with amazing capabilities – and that one ends up being misaligned (see below), so it kills us all/enslaves us for the glory of Xuul.

For a last example of misuse, take ChaosGPT: Only about 3 weeks after the release of GPT-4.0, people had already figured out how to put it in a continuous loop, trying to achieve goals on the internet – and how to give it the ability to do so by using tools like Python or by creating and using new sub-instances of GPT to achieve sub-goals . The system is called AutoGPT – ChaosGPT is AutoGPT with the following setup:

Now, AutoGPT was apparently somewhat overhyped and ChaosGPT has not ended the world (yet). It did get stuck in a continuous loop of googling information to nuclear bombs. But its ineffecitveness is not the point: Someone took the newest hottest AI model, gave it access to tools and the internet and told it to go and destroy or torture humanity. With more than 8 Billion people on the planet, I think we will always have at least one idiot like that. More capable models might just end up being sucesful. I can envision a scenario in which we all die, because some tech-savy, edgelord-teenager decided a school shooting isn't big enough of a way to go out.

Misalignment

Misalignment, at its core, is about the AI's goals/values/actions not being aligned with our goals/values/what we want it to do. There is a lot of alignment research happening at the moment and I'm not going to do the complexity of the topic justice here. If you want to know more, the best recommendation I can currently give is the Youtube channel of Robert Miles. I read and watched other things as well, but this is where I learned the most (as of now). It's really good.

In the context of x-risk, misalignment is often about the idea of runaway AI's optimizing for some non-human goal that involves us dying or otherwise ending up in a horrible situation. Examples are thought experiments like the stamp collector. Here we assume a super capable AI that was given careless instructions. As a  restult, it only cares about maximizing the number of stamps it collects; and from there it might get pretty wild: Maybe it pays people to rob other stamp collectors, or it buys machinery to produce more stamps ... or it takes over the world and starts transforming all matter in our galaxy into stamps and stamp producing machines. Another example would be an AI that we asked to maximize humanity's happiness – ending up with all humans being reduced to some minimum configuration of organs required for feeling happy, hooked up on hormone pumps making us as happy as we can be, forever.

If we want to judge whether this kind of thing is a realistic worry, we have to answer the following questions:

  • Are we going to have machines capable of realizing x-risks? We will assume a clear "Yes" here. The stamp collector might assume a power level we won't see soon, but we are not far away from AI being able to come up with new deadly pathogens, to just give one example of clear and present danger. Also, we are already doing stuff like this. In short: There is a multitude of credible  possibilities for capable AI to go wrong in a big way.
  • Is there any reason to think AI might realize an x-risk? This is where there is some disagreement. For example, some people think we'll just not tell our system to do bad things. Or that it will be easy to control the super intelligent things we create. I do not share their faith in humanity and think there are more issues this side of the debate is not seriously engaging with. I will list some of those below.

A (possibly incomplete) list of reasons things could go wrong:

Goodhart's law/Reward hacking:
This law can be phrased as: "When a measure becomes a target, it ceases to be a good measure." This is extremely relevant to training AI. The way we currently do it, we do not give AI systems our goals/values, we give them measures for our goals/values that the system then try to perform well on.

We hope that maximizing performance of those measures will maximize for what we want – but sometimes that fails to be true. A blunt example: Ask a super intelligent AI to cure all cancer in 10 years. Have it measure its progress by estimating all cancer cells still existing in the world. That system might start to run research labs, looking for a cure; or it might start an extermination campaing designed to end all cellular live able to express cancer cells. Maximizing reward on  the measure "estimated number of cancer cells in the world" doesn't necessarily give us what we want. The measure became the target and stopped measuring what we wanted. This specific kind of failure is also known as "reward hacking" or "specification gaming" – here is a video by Robert Miles that has a hole bunch of real world AI fails caused by this.

There seems to be no easy way out of this. Neural nets are trained with a reward/loss function that needs to use something as a "measure" for what it is supposed to achieve. That measure is rarely going to be our target itself, but a measure for our target – and the neural net will have a different perspective: our measure is its target.

Instrumental goals:
If we assume capable AI that is able to achieve goals in the real world – we have to assume it's able to plan, in other words: it's able to come up with, evaluate and chose subgoals that ultimately lead to the actual goals. We can say any capable AI system that acts in the real world has terminal goals (what it actually wants) and instrumental goals (goals that it needs to achieve to achieve the terminal goals).

The terminal goals flow from the system's reward function. It could be as simple as f(x)=x. With x being the number of stamps collected. The problem: We don't exactly know which instrumental goals the system will develop to achieve its terminal goal. Imagine an expert system running a research facility tasked to devleop better cancer screening methods – and now imagine discovering that it embezzled money to make more money on the stock exchange to finance its illegal black ops research site where it does all the unethical testing that does produce good results by accepting (or not caring about) horrible ethical costs. You might think it's relatively easy to control for stuff like that – you just tell the system to run any large decisions past humans; but it is tricky. First, you are making the model less effective. Second, it might lie to you. Why would it do that? Well, this is a good moment to introduce the concept of convergent instrumental goals.

Convergent instrumental goals:
While it is impossible to know what set of instruemental goals an intelligent capable AI might come up with, there are some goals the complete list would be very likely to contain. To understand this, we can look at a different kind of intelligent agent in the world: humans.

You can go to any stranger, if you gave them one million dollar, they'd probably be pretty happy about that. Most of them won't have a terminal goal of having one Million dollar, but no matter what their terminal goals might be, having lots of money is likely to help. Money can be used to help save an endangered species, to paint your house pink or to pay for your medical bills. It's an abstract representation of value and accruing money is a good instrumental goal for a huge variety of terminal goals. Most people will want it, they "converge" on money as an instrumental goal.

The point here is that we can take a series of guesses for what kind of converging instrumental goals a super intelligent capable AI syystem is likely going to develop, independently from its actual terminal goals. A big one is deception. If an AI sees a path to a lot of reward, but it knows the humans won't like it, deception is a great strategy to reap the reward without being stopped or have reward taken away by humans. In a way, we alreadey train our machines to be deceptive: When OpenAI trains GPT via RLHF, the network learns to respond to humand feedback: GPT sees two snippets of text, let's say one is recommending settling a dispute by murder, the other snippet talks about the importance of being tolerant and forgiving. GPT likely does not care either way, what GPT is interested in is to guess correctly which answer will be upvoted by its human trainers. So it goes for the non-murder option (if trained by OpenAI; I'm not sure about them).

Now, at the moment it is very likely that GPT simply follows what promises the most reward without conceptualizing its actions as deception. But the seed is there. GPT-4.0 also seems to be very good at theory of mind tasks, meaning it can track who in a given scene knows what about the scene and how that might influence their thinking and their actions. That is a fundamental skill for deception. I see deception as one of the biggest problems, as it takes away our ability to know what is going on. And we might be closer to deceptively misaligned systems than we think.

For brevity's sake, I'll just list other candidates for potentially dangerous convergent instrumental goals that capable AI system are likely todevelop: resouce aquisition (e. g. money as discussed above), self preservation (if I'm being switched off, I'm not maximizing my stamp collection), goal preservation (if I allow my goals to be changed, I won't continue maximizing my stamp collection), self improvement (if I get smarter/more powerful, I can develop and execute better stamp collection maximization plans).

Facit & outlook

OK. AI related x-risks seem to be real, meaning higher than 0 %. That begs the question: How high is it? And can we do things that make the risk for max-bad outcomes go down? That is going to be the second part of this article – in which I'll take a look at some of the key factors influencing p(doom).

Discussion & Feedback can go here!