Rohin Shah: What’s been happening in AI alignment?

While we haven’t yet built aligned AI, the field of alignment has steadily gained ground in the past few years, producing many useful outputs. In this talk, Rohin Shah, a sixth-year PhD student at UC Berkeley’s Center for Human-Compatible AI (CHAI), surveys conceptual progress in AI alignment over the last two years.

While Rohin started his PhD working on program synthesis, he became convinced that it was important to build safe, aligned AI, and so moved to CHAI at the start of his fourth year. He now thinks about how to provide specifications of good behavior in ways other than reward functions. He is best known for the Alignment Newsletter, a popular weekly publication with content relevant to AI alignment.

Below is a transcript of Rohin's talk, which we've lightly edited for clarity. You can also watch it on YouTube and discuss it on the EA Forum.

The Talk

Hi, everyone. My name is Rohin Shah. I'm a sixth-year PhD student at the Center for Human-Compatible AI at UC Berkeley. My research is generally on what happens when you try to do deep reinforcement learning in environments that involve humans. More broadly, I work on technical AI safety. I also write the Alignment Newsletter.

Today, I'll cover what's been happening in AI alignment. I should warn you: While this talk doesn't assume any technical knowledge of AI, it does assume basic familiarity with the arguments for AI risk.

I'll be surveying a broad swath of work rather than focusing on my personal interests. I'm hoping that this will help you figure out which parts of AI alignment you find exciting and would like to delve into more deeply.

A lot of the talk is based on a literature review I wrote a few months ago. You can find references and details in that review.

With that, let's get started. Taking a high-level, outside view, the reason that most people work on AI safety is that powerful AI systems are going to be a big deal. They're going to radically transform the world that we live in. Therefore, we should probably put some effort into making sure that this transformation goes well.

In particular, if AI systems are smarter than we are, then they could become the dominant force on the planet, which could be bad for us — in the same way that gorillas probably aren’t [thrilled] about how we have taken over all of their habitats. This doesn't necessarily mean that [AI will create] be an x-risk [existential risk]. It just means that we should have a sound technical reason to expect that the powerful AI systems we build are actually beneficial for us. And I would argue that we currently do not have such a reason. Therefore, the case for working on AI alignment is that we really should be creating this reason.

I want to note that there’s a lot of disagreement over specific sub-questions in AI safety. That will become more evident over the rest of this talk. But my impression is that virtually everyone in the field agrees with the basic, high-level argument [that we should have a good reason for expecting AI systems to be beneficial].

What are the specific risks we're worried about with AI? One issue is that humans aren't ready to deal with the impacts of AI. People tend to be in conflict a lot, and the US-China relationship is a big concern [in the AI community]. AI will enable better and better ways of fighting. That seems pretty bad. Maybe our fights will lead to bigger and bigger impacts; at some point, that could result in extinction-level events. Or perhaps AI leads to technological progress at such a fast pace that we’re unable to [adjust]. As a result, we could lock in some suboptimal values [that AI would act on for the rest of humanity’s future]. In both of these scenarios, the AI system wouldn’t intentionally cause x-risk, but it nonetheless would happen.

I'm not going to focus too much on this, but will note that some people are talking about preference aggregation. This is the idea that the AI system aggregates preferences across all stakeholders and does its thing — and then everyone agrees not to [oppose] the results. Similarly, we could try to [arrive at a] better metaphilosophy to avoid problems like value lock-in.

Another outside view that people take, aside from “AI is powerful and a big deal,” is that optimization leads to extreme outcomes. To take a very simple example, men in the US are, on average, about five feet, 10 inches tall. But very few basketball players, who are selected for height, are five feet, 10 inches. Most are well over six feet. When you select for something and have optimization pressure, you tend to get extreme outcomes. And powerful AI systems are going to be powerful optimizers. As a result, we probably shouldn't expect our everyday reasoning to properly account for what these optimizers will do.

Therefore, we need to [cultivate] more of a security mindset and look for arguments that quantify every possibility, as opposed to the average possibility. This mindset inspires researchers, especially at MIRI [the Machine Intelligence Research Institute], to try to understand how intelligence really works, so that we can then make well-designed AI systems that we understand. This has led to research on embedded agency, partial agency, and abstraction.

A bit about embedded agency: This is one of MIRI’s main research programs. The basic idea is that, according to the standard model of reinforcement learning and [our understanding of] AI more generally, an environment takes in actions and produces [observable phenomena] and rewards. Then, completely separate from the environment, an agent [observes these phenomena] and takes actions as a result. But that’s not how agents work. I’m an agent, yet I am not separate from the environment; I am a part of it. This leads to many philosophical problems. I would love to go into more detail, but don't have too much time. There's a great sequence on the AI Alignment Forum that I strongly recommend.

——

The next problem I want to talk about is one that I call “the specification problem.” It's also called “outer alignment.” Basically, the way we build AI systems right now is by assuming that we have some infallible specification of the optimal behavior in all possible situations, as though it were handed down to us from God. Then, we must figure out how to meet that specification. But of course, we can never actually get such a specification. The classic paperclip maximizer thought experiment shows that it's quite hard to specify the behavior of an AI making paperclips in a reasonable and sane way. This is also the main problem that Stuart Russell discusses in his book Human Compatible. Organizations [whose work includes addressing] this specification problem include CHAI, OpenAI, DeepMind, and Ought.

The main proposed way of solving the specification problem is to do some form of value learning. One thing I want to note: Value doesn't necessarily mean “normative value.” You don't necessarily need to be thinking about population ethics. For example, a robot that learned how to clean your room, and then reliably did so, would count as [an example of] value learning. Maybe we should be calling it “specification learning,” but value learning seems to be the name that has stuck.

The types of value learning include CIRL (or “assistance games”). CIRL stands for “cooperative inverse reinforcement learning.” This is a particular formalization of how you could approach value learning, in which the world contains a single human who knows the reward function — the true specification — but, for some reason, can't communicate that explicitly to the agent. There is also an agent whose goal is to infer what the human’s specification is, and then optimize for it. And because the agent no longer has a definite specification that it's trying to optimize, and it's instead uncertain over what it's trying to optimize, this results in many nice properties.

For example, the agent might ask you about what you want; it may try to clarify what your preferences are. If you try to shut it down, it will reason that it must have been doing a poor job of helping you. Therefore, it's going to allow you to shut it down, unlike a classic unexpected utility maximizer, which will say, “No, I'm not going to shut down, because if I am shut down, then I can't achieve my goal.”

The unfortunate thing about assistance games is that they are [exceptionally] computationally intractable. It's very expensive to solve a CIRL game. In addition, it requires a good model of how human preferences relate to human behavior, which — as many of the social sciences show — is a very difficult problem. And there is a theorem that says it is impossible to prove in the super-general case. Although, of course, we don't actually need the super-general case; we only need the case that applies in the real world. Instead of being impossible to prove, [the real-world case] is merely very, very difficult.

Next, we have [strategies based on agents] learning human intent. This is a broad category of possible communication protocols that a human could use to communicate the specification to the agent. So perhaps a human could demonstrate the optimal behavior to the agent, and from that, the agent could learn what it's supposed to do. (This is the idea behind inverse reinforcement learning and imitation learning.) Alternatively, perhaps the human could evaluate proposed hypothetical behaviors that the agent might execute, and then the agent could reason out what it should be doing.

Now we come to intent alignment, or “corrigibility.” This is somewhat different. While the previous approaches try to specify an algorithm that learns values, with intent alignment we instead build an agent that tries to do what we want it to do. Put another way, we're trying to bake into the agent the motivation to be helpful to us. Then, if we have an agent [whose sole motivation] is to be helpful to [a human], that will naturally motivate it to do many other things that we want. For example, it's going to try to clarify what my [travel] preferences are in the same way that a good personal assistant would, so that it doesn’t have to bother me when I ask it to book me a flight.

That covers a broad spectrum of approaches to value learning. However, there are still a few problems that arise. Intuitively, one big one is that, since the agent is learning from our feedback, it's not going to be able to do better than we can; it won’t be able to scale to superhuman performance. If we demonstrate the task to the agent, it won’t be able to perform the task any better than we could, because it’s receiving no information on how to [go about that]. Similarly, if we're evaluating the agent's behavior, it won't be able to find good behaviors that we wouldn't recognize as good.

An example is AlphaGo's move 37 [in its match against Go champion Lee Sedol]. That was a famous move that AlphaGo made, which no human ever would have made. It seemed crazy. I think it was assigned a less than one-in-10,000 chance of succeeding, and yet that move ended up being crucial to AlphaGo's success. And why could AlphaGo do this? Because AlphaGo wasn't relying on our ability to determine whether a particular move was good. AlphaGo was just relying on a reward function to tell it when it had won and when it had lost, and that was a perfect specification of what counts as winning or losing in Go. So ideally, we would like to build superintelligent AI systems that can actually exceed human performance at tasks, but it's not clear how we do this with value learning.

The key idea that allows current approaches around this is: Our AI systems are never going to exceed the supervision that we give them, but maybe we can train our AI systems to approximate what we would do if we had an extremely long time to think. Imagine I had 1,000 years to think about what the best thing to do was in a certain scenario, and then I shared that with an AI system — and then the AI system properly approximated my suggestion, but could do so in a few minutes as opposed to 1,000 years. That would presumably be a superintelligent AI.

The details for how we take this insight and arrive at an algorithm so that we can try it soon — not in 1,000 years — are a bit involved. I'm not going to go into them. But the techniques to look for are iterated amplification, debate, and recursive reward modeling.

Another problem with value learning is the informed oversight problem: Even if we're smarter than the agent that we're training, we won’t be able to effectively supervise it in the event that we don't understand why it chose a certain action. The classic example is an agent tasked to write a new novel. Perhaps it has access to a library where it's supposed to learn about how to write books, and it can use this in order to write the novel, but the novel is supposed to be new; [the task requires more than] just memorizing a novel from the library and spitting it back out again. It’s possible that the agent will look at five books in the library, plagiarize chunks from all of them, and put those together into a book that reads very nicely to us, but doesn't really solve the task because [the novel is unoriginal]. How are we supposed to tell the agent that this was bad? In order to catch the agent looking at the five books and stealing sentences from them, we'd have to read the entire library — thousands of books — and search for evidence of plagiarism. This seems too expensive for oversight.

So, it may be significantly more costly for us to provide oversight than it is for the agent to take actions if we cannot see how the agent is taking those actions. The key to solving this is almost obvious. It's simply to make sure you know how the agent is taking their actions. Again, there are many details on exactly how we think about this, but the term to look for is “ascription universality.” Essentially, this means that the supervisor knows everything that the agent knows, including any facts about how the agent chose its output.

[In the novel-writing example], if we were ascription-universal with respect to the agent, then we would know that it had taken sentences from five books, because the agent knows that. And if we knew that, then we could appropriately analyze it and tell it not to plagiarize in the future.

How do we create this property? Sadly, I'm not going to tell you, because again, I have limited time. But there's a great set of blog posts and a summary in the Alignment Newsletter, and all of those items are in my literature review. Really, I just want you to read that link; I put a lot of work into it, and I think it's good.

——

Let's move on to another top-level problem: the problem of mesa optimization. I'm going to illustrate mesa optimization with a non-AI example. Suppose you're searching for a Python program that plays tic-tac-toe well. Initially you find some programs that have good heuristics. Maybe you find a program that always starts at the center square, and that one tends to win a little more often than the others. Later, you find a program that makes sure that anytime it has two spots in a row and the third spot is empty, it plays in that third spot and wins. One that does that in a single step starts to win a bit more.

Eventually, you come across the minimax algorithm, which plays optimally by searching for the best action to take in every situation. What happened here was that in your search for optimal Python programs, you ended up finding a program that was itself an optimizer that searched possible moves in tic tac toe.

This is mesa optimization. You have a base [or “outer”] optimizer — in this case, the search over Python programs — and in the course of running that base optimizer, you find a new optimizer, which in this case is the minimax algorithm.

Why is this weird example about programs relevant to AI? Well, often we think about AI systems that are trained using gradient descent. And gradient descent is an optimization algorithm that searches over the space of neural net parameters to find some set of parameters that performs well on a loss function.

Let's say that gradient descent is the outer optimizer. It seems plausible that mesa optimization could happen even with gradient descent, where gradient descent finds an instantiation of the neural net parameters, such that then the neural net itself, when it runs, performs some sort of optimization. Then the neural net would be a mesa optimizer that is optimizing some objective, which we would call the mesa objective. And while we know that the mesa objective should lead to similar behavior as the original objective on the training distribution, because that's what it was selected to do, it may be arbitrarily different [outside] the training distribution. For example, if you trained it on tic tac toe, then you know it's going to win at tic tac toe — but if you switch to Connect Four, it might do something crazy. Maybe in Connect Four, it will continue to look for three in a row instead of four in a row, and therefore it will lose badly at Connect Four, even though it was working well with tic tac toe.

Let’s say that this happened with gradient descent, and that we had a very powerful, intelligent neural net. Even if we had solved the specification problem, and had the ideal reward function to train this agent, it might be that the neural net model that we come up with optimizes for a different objective, which may once again be misaligned with what we want. The outer-inner distinction is why the specification problem is called “outer alignment,” and why mesa optimization is called “inner alignment.”

How do people solve mesa optimization? There's one main proposal: adversarial training. The basic idea is that in addition to training an AI system that's trying to perform well on your specifications, you also have an adversary — an AI system or AI human team that's trying to find situations in which the agent you're training would perform badly, or would optimize for something other than the specification problem.

In the case where you're trying to get a corrigible AI system, maybe your adversary is looking for situations in which the AI system manipulates you or deceives you into thinking something is true, when it is actually false. Then, if you can find all of those situations and penalize the agent for them, the agent will stop behaving badly. You'll have an agent that robustly does the right thing across all settings. Verification would [involve using] that agent to verify another property that you care about.

Ideally, we would like to say, “I have formally verified that the agent is going to reliably pursue the specification that I outlined.” Whether this is possible or not — whether people are actually optimistic or not — I'm not totally clear on. But it is a plausible approach that one could take.

There are also other areas of research related to less obvious solutions. Robustness to distributional shift is particularly important, because mesa optimization becomes risky with distributional shift. On your training distribution, your agent is going to perform well; it's only when the world changes that things could plausibly go badly.

——

A notable thing that I haven’t talked about yet is interpretability. Interpretability is a field of research which entails trying to make sure that we understand the AI systems we train. The reason I haven't included it yet is because it's useful for everything. For example, you could use interpretability to help your adversary [identify] the situations in which your agent will do bad things. This helps adversarial training work better. But interpretability is also useful for value learning. It allows you to provide better feedback to the agent; if you better understand what the agent is doing, you can better correct it. And it's especially relevant to informed oversight or description universality. So while interpretability is obviously not a solution in and of itself, it makes other solutions way better.

There's also the option of trying to prevent catastrophes. Someone else can deal with whether the AI system will be useful; we're just going to stop it from killing everybody. Approaches in this area include impact regularization, where the AI system is penalized for having large impacts on the world. Some techniques are relative reachability and attainable utility preservation. The hope here would be that you could create powerful AI systems that can do somewhat impactful things like providing advice on writing new laws, but wouldn't be able to do extremely impactful things like engineer a pandemic that kills everybody. Therefore, even if an AI system were motivated to harm us, the impact penalty would prevent it from doing something truly catastrophic.

Another [area of impact regularization] is oracles. The idea here is to restrict the AI system's action space so that all it does is answer questions. This doesn't immediately provide safety, but hopefully it makes it a lot harder for an AI system to cause a catastrophe. Alternatively, you could try to box the AI system, so that it can’t have much of an impact on the world. One example of recent work on this is BoMAI, or boxed myopic artificial intelligence. In that case, you put both the human and the AI system in a box so that they have no communication with the outside world while the AI system is operating. And then the AI system shuts down, and the human leaves the box and is able to use any information that the AI system gave them.

So that's most of [the material] I’ll cover in this problem-solution format. There's also a lot of other work on AI safety and alignment that's more difficult to categorize. For example, there's work on safe exploration, adversarial examples, and uncertainty. These all seem pretty relevant to AI alignment, but it’s not obvious to me where, exactly, they fit in the graph [above]. So I haven't put them in.

There's also a lot of work on forecasting, which is extremely relevant to [identifying] which research agendas you want to pursue. For example, there has been a lot of disagreement over whether or not there will be discontinuities in AI progress — in other words, whether at some point in the future, AI capabilities shoot up in a way that we couldn't have predicted by extrapolating from past progress.

Another common disagreement is over whether advanced AI systems will provide comprehensive services. Here’s a very short and basic description of what that means: Each task that you might want an AI system to do is performed by one service; you don't have a single agent that's doing all of the tasks. On the other hand, you could imagine a single monolithic AI agent that is able to do all tasks. Which of these two worlds are we likely to live in?

A third disagreement is over whether it is possible to get to powerful AI systems by just increasing the amounts of compute that we use with current methods. Or do we actually need some deep insights in order to get to powerful AI systems?

This is all very relevant to deciding what type of research you want to do. Many research agendas only make sense under some possible worlds. And if you find out that one world [doesn’t seem very likely], then perhaps you switch to a different research agenda.

That concludes my talk. Again, here’s the link to the literature review that I wrote. There is both a short version and a long version. I really encourage you to read it. It goes into more detail than I could in this presentation. Thank you so much.