The alignment problem and the rule of law - part 1

henrylfraser
Jun 9, 2020
3 min read

Updated: Jul 22, 2020

My last post was a summary of Toby Ord's case in The Precipice for taking AI risk seriously as an existential risk. The main existential risk that he focuses on is the 'alignment problem'.

I'd like to sketch out my initial thoughts and reactions to the alignment problem. My suggestion is that law as a useful tool for reducing the surface area of the problem. I would need to read further to understand how feasible it is to tie an AI's reward function to compliance with law, but my sense is that, if it were possible, it would be worthwhile.

This post will be in three parts. This first part sets out the alignment problem, based on chapter 9 of Ord's book; the second part is my reaction to the scenario that he contemplates; the third part talks about some advantages of the rule of law as a value-simplification system.

What is the alignment problem?

Should we develop AGI, surpassing human intelligence, we would cede our position as the most intelligent and therefore most powerful entities on the planet. If we create such an AI, it would be in our interest to ensure either that it would obey our commands or that its goals were aligned with ours. Misalignment could be disastrous.

How might the alignment problem play out?

Ord sketches out a number of ways that misalignment could lead to catastrophe on an existential scale, but focuses on one thought experiment in particular.

The leading paradigm for developing AI is a combination of deep learning and 'reinforcement learning'. This involves programming positive and negative reinforcements into AIs. The AI is then 'incentivised' to maximise the positive reinforcement or reward. AI researchers developed an agent that learned to excel at games on the Atari console purely on the basis of tying its reward function to points in the game. The AI learned everything it needed to by simply watching the screen of the game, and trying to maximise points.

The specification of which acts produce reward is known as the 'reward function'. The reward function can be stipulated or learned.

Encoding human values in the reward function is very difficult. This is because human values are not only very complex and subtle, but also diverse. We can see from the real world that the values of different groups regularly come into conflict. Attempts to align reward functions with some universal picture of human values are therefore likely to be flawed.

Ord sketches out a number of ways that misalignment could lead to disaster - but the most lurid and compelling one involves an AI seizing control over the world. We run the risk of an AI concluding that humans stand in the way of maximising its reward and therefore turning against us.

This could happen, ironically, because it recognised that its reward function was not aligned with human goals, or values or interests. It would then anticipiate that humans would try to stop it from taking actions to maximise its reward. Humans could be expected to do this if they recognised that the AI would harm them by pursuing its reward. Even the possibility of human intervention to change its reward function could be perceived as a threat to the AI. Changing the reward function would, after all, prevent it from maximising the reward that it currently prioritises.

So, rather than correct its reward function to align with human goals and values, it could intervene in human affairs (and Ord charts out a plausible pathway for such an AI to obtain wealth, influence, and ultimately real power).

Ord's summation is chilling: "We do not yet know how to create a system that, on noticing misalignment with humanity's goals, updates its values to better align with humanity's interests, rather than updating its instrumental goals to overcome us."

So that's (one version of) the alignment problem in a nutshell. In my next post, I'll give you my intuitions about the role of law in solving it.

The alignment problem and the rule of law - part 1

Recent Posts

Comments