Algorithms are becoming more powerful, and might in future become so powerful that they form superintelligences , the goals of which would determine the course of human civilisation. Even those AIs that are less powerful could have considerable impacts on human society – as indeed they already have.
To avoid potential disruption, intelligences must be aligned with human preferences , goals, and values. This is the ‘alignment problem’, and its resolution is by no means easy. Human values are hard to define. Asking people about their preferences typically elicits ‘stated’ preferences, which are inadequate to explain behaviour  – people often don’t state what they really want. ‘Revealed’ preferences, identified by regarding people’s actions as sufficiently informative of preference , assume that people are rational decision-makers, yet people are often far from rational in their decisions . Humans can be influenced by ‘irrelevant’ changes to the architecture of choice-making [5, 6], making revealed preferences likewise insufficient indicators of actual preference.
Knowledge of preferences is important to companies111 For corporations, serving unmodified revealed preferences can result in reduced long-term customer satisfaction, leading to profit-loss (operational risk), individual and societal harm (reputational risk), and legal battles involving civil suits and regulator lawsuits (compliance risk). Therefore, serving a customer’s true – rather than revealed – preferences is in the long-term interest of a for-profit company.
and other organisations, and has been extensively studied by those engaged in researching machine learning. Many attempts have been made to obtain useful, correct, and compact representations of preferences, a pertinent example of which was the original ‘Netflix Prize’ . This was based on user ratings of DVDs by consumers – stated preferences – but Netflix ultimately found revealed preferences to be more useful for its business objectives, and employed user engagement, and eventually retention, for their recommendation system . Yet even this approach is problematic, as many organisations have tried to correct preference-related problems in their recommendation systems and other algorithms one at a time, rather than addressing them systematically .
Preferences and irrationalities – ineffective attempts to achieve those preferences – together determine human behaviour, which is alternatively referred to as human ’policy’. But the determination is one way only. Preferences cannot themselves be deduced from policy: the collection of human preferences, values, and irrationalities, is strictly more complex than human policy . Extra ‘normative assumptions ’ need to be added to allow an algorithm to deduce human values from human behaviour. Work is ongoing to try and resolve this challenge .
By learning human policy an AI can attain considerable power ; it is thus important that AIs learn human preferences (and thus some level of alignment) before they achieve that power. Learning human preferences will also generally increase the algorithm’s knowledge of human policy , and hence their power over humans. Knowing human irrationalities can be even more dangerous, as this allows the AI to exploit these irrationalities from the beginning. Also, knowing irrationalities means that the AI cannot learn human preferences without also learning human policy.
It may thus be necessary for an algorithm to know most or all human preferences before it is deployed unrestricted in the world. This point will be illustrated by models of recommendation system algorithms that suggest videos for users, and by different behaviours of algorithms that are either fully ignorant, that know a user’s preferences, that know a user’s irrationalities, and that know the full user’s policy.
Especially dangerous is an unaligned AI with grounded knowledge  of human preferences, irrationalities and policy. The AI can then connect the grounded knowledge to the features of the world as it ‘knows’ them. Grounded knowledge can result in discontinuous jumps in power, so that relatively weak AI systems might suddenly become very influential (see the forthcoming paper by the same authors, ).
Ii Learning preferences, irrationalities, and policy
Ii-a Preferences, policies, and planers
It is difficult, verging on impossible, to directly program an algorithm to follow the preferences of a human or a group of humans. For ambiguously defined tasks, it has proven much more effective to have the algorithm learn these preferences from data , in this instance human behaviour as we go about our lives, choosing certain options and avoiding others.
Paper demonstrates, however, that one cannot learn the preferences (or the irrationalities) of irrational agents just through knowing their behaviour or policy.
Research  demonstrates, however, that the preferences or irrationalities of irrational agents cannot be learned simply by knowing their behaviour or policy. In the notation of the cited paper, is the reward function, is the rational (or irrational) decision module (called the ‘planner’), and is the agent’s policy. The paper shows that the pair has strictly more information than
does. If those three terms are seen as random variables (due to our uncertainty about them) andis information entropy222 A measure of the amount of uncertainty we have about the values, or, equivalently, how much information we gain upon knowing the values. , then
We take the most general position possible, and define irrationality as the deviation of the planner from a perfectly rational planner. We also posit that knowing human (ir)rationality and behaviour/policy permits the deduction of human preferences. Thus knowing and allows one to deduce 333 This is by no means a given for formal definitions of and . But we are only excluding from consideration preferences that never make any difference to action in any conceivable circumstance. .
Ii-B Normative assumptions
In order to infer (or ) from , anyone seeking to program an algorithm would need to add extra ‘normative’ assumptions, in order to bridge the difference between and . Informally, we might say that it is impossible to learn human values unsupervised; it must be at least semi-supervised, with labeled data points being normative assumptions.
Some of these assumptions derive from shared properties of the human theory of mind (e.g., ‘if someone is red in the face and shouting insults at you, they are likely to be angry at you, and this is not a positive thing’), which in normal human experience appear so trivial that we might not even think it necessary to state them444 Indeed, such assumptions might be implicitly included in the code by programmers without them realising it, as they ‘correct obvious errors’ or label data with ‘obvious’ but value-laden labels. . Some might be regarded as ‘meta-preferences’, pointing out how to resolve conflicts within the preferences of a given human (e.g., ‘moral values are more important than taste-based preferences’), or how to idealise human preferences into what a given person might want them to be (e.g., ‘remove any unconscious prejudices and fears within me’). Some might deal with how preferences should be extended to new and unexpected situations. Hand-crafting a full list of such assumptions would be prohibitively complex.
Ii-C Knowledge, power and AI alignment
An AI is powerful if it knows how to affect the world to a great extent. It is aligned if aims to maximise – the values and preferences – for all humans. Maximising requires that the AI knows it, of course, so alignment requires knowledge of .
Generally speaking, knowing would make the AI more powerful, since it knows how humans would react and hence how best to manipulate them. And although knowing does not give and directly, the three are connected and so knowing or , in whole or in part, would allow the AI to deduce much of . Thus knowledge of human preferences leads to knowledge of human policy, and hence to potential power over humans.
Two scenarios are provided as examples where this is relevant: an aligned AI in development, and an unaligned AI of limited (constrained) power.
Ii-C1 Aligned AI in development
Normative assumptions come in many different types, and humans are often not consciously aware of them. Thus AI programmers are unlikely to be able to code the whole set from first principles. Instead, they will experiment, trying out some assumptions, getting the AI to learn from human behaviour, seeing what the AI does, and refining the normative assumptions in an iterative loop.
Until this process is finished, the AI is unaligned: it is not fully motivated to maximise human preferences/rewards/values. If that AI becomes powerful during this intermediate stage there are likely to be consequences. It might be motivated to prevent its goals from being changed , and attempt to prevent further normative assumptions from being added to it555 Some papers  have demonstrated methods for combating this, but the methods are non-trivial to implement. . This outcome has been demonstrated in a paper  where algorithms that learn online (i.e., that learn their objectives while optimising these same objectives) are shown to have incentives to manipulate the learning process. A powerful AI with unaligned values could prove an existential risk to humanity .
Ii-C2 Constrained unaligned AI
A second type of unaligned AI is a constrained AI. This is an AI whose power is limited in some way, either through being ‘boxed’ (constrained to only certain inputs and outputs ), being a recommendation system, being only one agent in an economy of multiple agents (similar to how corporations and humans co-exist today), or simply being of limited power or intelligence.
However, the more such an AI knows about human policy, the more it can predict human reactions to its actions. So the more it knows, the more it is capable of manipulating humanity in order to gain power and influence, and to remove any constraints placed upon it.
Ii-C3 Knowing irrationalities, policies, and preferences
Everything else being equal, it is safest for both an aligned AIs in development and constrained AIs to know the maximum about (human preferences) while knowing the minimum about (human policy) and (human irrationalities). Further, it is better that a constrained AI knows more about , than about , and more about than about (the worst-case scenario is if it only knows human irrationalities). The latter point arises from the position that if an AI knows only it can (and must) offer a decent trade in exchange for achieving its own goals, while one that knows only can (and must) exploit human irrationalities for its own purposes. This will be illustrated in the next section.
The second point comes from the fact that an AI that knows only can (and must) offer a decent deal in exchange for achieving its own goals, while one that knows only can (and must) only exploit our irrationalities for its purposes.
Iii Exploiting irrationalities vs. satisfying preferences
Consider the following model: a constrained AI is a recommendation system that selects a daily video (for a website or an app). The system’s goal is to cause the human user watch the video in full before they move on to something else666 Significantly, but typically, the goals of this recommendation system would be aligned neither with the users (who value their time and enjoyment) nor their parent company (who would value long-term retention and user-spending, or that users watch advertising, rather than that they watch a single video). . To do so, it selects one video from a collection of a daily topical video options each day, generated randomly.
In this model, each video has ten features, five related to human preferences (genre, storyline, characters, etc.) and five related to irrationalities (use of cliff-hangers, listicles, sound inconsistencies, etc.). For each video the recommendation system is given a timeline of the varying importance of each feature over the course of the video. Based upon this information, the system will average the information, characterising
by two five-dimensional vectors: the preference vector(each preference feature denoted by a number between 0 and 1) and the irrationality vector (each irrationality feature also denoted by a number between 0 and 1).
Each user also has a collection of five preferences, (which denote how much they enjoy certain aspects of the video) and five irrationality features (which denote how susceptible they are to the video’s ‘tricks’). These ten numbers also take values between and .
Define as the Euclidean distance between and (i.e. the Euclidean norm of ). Similarly, define as the Euclidean distance between and
. Then the probability of the user watching the video in full is:
The recommendation system interacts with the same user each day, selecting a new video from the topical videos of that day. It knows how long the user watched previous videos, receiving a reward of whenever that length is the full length of the video (and otherwise).
This is formally known as a multi-armed contextual bandit online learning problem 
. Here the AI follows a greedy strategy: at each stage it selects the video that is most likely to be watched. To do so it uses a Monte Carlo simulation: generating a thousand random possible users, and computing their posterior probability of beingby updating, on past observations, what videos
watched and didn’t watch. It then computes the probability of each random user watching a topical video on a given day, and calculates a weighted sum across the random users to get a final probability estimate for a given video being watched777 In practice, since we’re only interested in one probability being higher than another, there is no need to renormalise the probabilities so that they sum to one. . It then selects the topical video with the highest probability of being watched in full.
We consider four different possible systems: one that knows nothing of the user and has to learn from observing what they watch or don’t, one that knows (the user’s preferences), one that knows (the user’s irrationalities), and one omniscient system that knows both (and therefore knows the human policy without needing to learn). For comparison, we also plot an aligned omniscient recommender system: this one knows the humans’ preferences and selected the video that best fit with these. The results are computed for users, and then averaged; see Figure 1.
The omniscient system convinces the user to watch the video roughly of the time. The ‘cold-start’ system , which initially knows nothing, begins with less than success rate but this gradually increases as it learns more about the user. The systems that know preferences or irrationalities demonstrate performance levels between these two, and are equivalent to one another (due to the symmetry between preferences and irrationalities in this specific model). The aligned system only convinces the user to watch the video around of the time. This is because of user irrationalities: the video they’d most enjoy is not necessarily the one they are most likely to watch.
Note that user only derives value from having their preferences satisfied, not from having their irrationalities exploited. Their reward needs to be inversely proportional to how closely the video matches their preferences, thus inversely proportional to .
Opportunity costs should be taken into account: if user h is not watching a video, then they would be doing some other activity that might be of value to them. Since reward functions are unchanged by adding constants, we choose to give a total reward of for these alternative activities. We set their reward for watching a video to be .
So if , then watching the video is a net gain for the user: they derive more value from that activity than from doing anything else. If , then watching the video is a net negative: they would have been better served. The total human reward is graphed in Figure 2.
Note that all non-aligned systems result in some disutility for their users: the opportunity cost removes any advantage in seeing a merely-adequate video. The system that knows only the user’s preferences has the lowest disutility. It starts by offering videos that better align with the user’s preferences until it learns their irrationalities as well, and user reward declines as the system selects less well-aligned videos that the user is nonetheless more likely to watch. The system that knows irrationalities exhibits the opposite behaviour. It starts by maximally exploiting irrationalities, then adds in more preference-aligned options as it learns user preferences, so that its disutility declines. The fully ignorant system has intermediate performance between the two.
As they learn, the behaviours of the non-aligned systems converge towards that of the omniscient system, which offers a consistent reward of around (hence an overall disutility for the user). By contrast, the aligned system that chooses the best video provides a user reward of around (hence an overall positive value for the user).
Iii-a Practical considerations
The model presented above assumes that exploited irrationalities are of neutral value to humans, but the exploitation of irrationalities can have negative value in lived experience, including epistemic fragmentation , preference amplification , and the distortion of human preferences caused by interaction with software agents . The human might also suffer disvalue from knowing or suspecting that their irrationalities are being exploited, and try to avoid this outcome.
One real-life experiment  that is similar to our model demonstrates the operation and negative value of exploited irrationalities. The authors of the study related clicks on hyperlinks with revealed preferences, and ‘human-in-the-loop’ ratings with stated preferences. If clicks were true preferences, then maximising clicks (‘optimising for engagement’) would maximise value to the user. Yet the authors discovered something that they called ‘negative engagement’, clicks made because the user had trouble finding the information they were looking for. A system optimising for engagement would amplify this behaviour, negatively affecting user experience. This is a mistake of the algorithm designer rather than of the algorithm, which was merely following the instructions of the designer.
Iii-B Grounded knowledge
An algorithm has grounded knowledge when it has some symbolic data (academic publications, social media posts, user ratings, etc.) and a way of connecting that data with known elements of the world. For instance, a user might search for “Should I be worried my nose is still dripping from the fight last night?” The Google Flu Trends (GFT) web service  might flag this as evidence for influenza based on the ‘dripping nose’ search terms, but it would not have done so if it understood the full meaning of the search phrase, which is clearly not flu related.
In the model above, the example algorithms are not exploiting the meaning of all the information they can access. There are ten features for each video, but they use only their average values; they also know how long the user watched a video, but only check whether this was full length or not.
A human with that information might deduce that a user is more likely to stop watching a video at the point where it is least pleasant to them – the furthest from their preferences or irrationalities – and could thus infer information about the user from that stopping point. It is less clear what the algorithm might have done if it ‘realised’ what that extra information ‘meant’.
It is however possible to model a grounded version of the algorithm. In our model, whenever the user rejects a video the system will obtain one piece of information about the preferences or irrationalities of that user888For this model, the algorithm will be given one of the ten preference or irrationality values at random, though this may be a value it already knows.. Without modifying the rest of the algorithm, Figure 3 demonstrates how a grounded algorithm starts off as a poor recommendation system, comparable to the standard ‘no-knowledge’ algorithm, but quickly achieves a level of performance comparable with the ‘omniscient’ one.
Iii-B1 Grounded knowledge overhang: cached information
One significant issue with grounded knowledge is that the algorithm might accumulate a sufficiently large collection of information and dramatically and discontinuously increase its power. For example, the ‘ignorant’ algorithm of Figure 3 might suddenly ‘realise’ the meaning of the information it has, and leap immediately from ‘no knowledge’ to ‘grounded’. This might have a relatively trivial effect in a model such as a video recommendation system, but might have far greater effects for the recommendation systems currently in use that dominate real-world search results, news feeds, and social media.
Iv The potential severity of the problem
Algorithms are typically designed with a measurable goal in mind, such as convincing a user to watch a video, click on a hyperlink, or re-subscribe to a service. Too little attention is paid to how the algorithm achieves that goal, or what information it uses to achieve it.
People are often on the lookout for use of sensitive personal information – things like race, gender, sexuality, or medical information. This paper demonstrates that it is also dangerous for an algorithm to learn too much about human irrationalities. This applies no matter what the power of the algorithm; indeed, knowing too much about human irrationalities increases its effective power.
If an algorithm makes a discontinuous leap to a smarter and more powerful system, and ‘realises’ that the knowledge inferred from the data it processes can be utilised for unaligned goals, then there is great potential risk for humanity. The data presented here shows that the risks of an algorithm manipulating easily-exploited irrationalities are greater than those that focus on preferences. This risk also applies to less powerful algorithms; knowing too much about human irrationalities will increase its effective power level.
The worst possible outcome would be where irrationalities are very easy to exploit, and where it is easy to deduce policy from preferences but hard to deduce preferences from policy. Constrained AIs would be the most powerful and exploitative, and in-development AIs would acquire a lot of power before they start to become even approximately aligned. By identifying this weakness in algorithm design, we can put in place checks and balances that limit the possibility of unaligned AIs becoming dangerous in this way.
We wish to thank Nick Bostrom, Ryan Carey, Paul Christiano, Michael Cohen, Oliver Daniel-Koch, Matt Davis, Owain Evans, Tom Everrit, Adam Gleave, Tristan Harris, Ben Pace, Shane Legg, Laurent Orseau, Gareth Roberts, Phil Rosedale, Stuart Russell, Anders Sandberg, and Tanya Singh Kasewa, among many others. This work was supported by the Alexander Tamas programme on AI safety research, the Leverhulme Trust, the Berkeley Existential Risk Institute, and the Machine Intelligence Research Institute.
-  N. Bostrom, Superintelligence: Paths, dangers, strategies. Oxford University Press, 2014.
-  E. P. Kroes and R. J. Sheldon, “Stated preference methods: an introduction,” Journal of transport economics and policy, pp. 11–25, 1988.
-  P. A. Samuelson, “A note on the pure theory of consumer’s behaviour,” Economica, vol. 5, no. 17, pp. 61–71, 1938.
-  D. Kahneman, Thinking, fast and slow. Macmillan, 2011.
-  R. H. Thaler and S. Benartzi, “Save more tomorrow™: Using behavioral economics to increase employee saving,” Journal of political Economy, vol. 112, no. S1, pp. S164–S187, 2004.
-  P. Špecián, “The precarious case of the true preferences,” Society, vol. 56, no. 3, pp. 267–272, 2019.
-  P. Viappiani, “Preference modeling and preference elicitation: An overview.” in DMRS, 2014, pp. 19–24.
-  J. Bennett and S. Lanning, “The netflix prize,” 2007.
-  C. Gomez-Uribe and N. Hunt, “The netflix recommender system,” ACM Transactions on Management Information Systems (TMIS), vol. 6, pp. 1 – 19, 2015.
-  J. Stray, I. Vendrov, J. Nixon, S. Adler, and D. Hadfield-Menell, “What are you optimizing for? aligning recommender systems with human values,” arXiv preprint arXiv:2107.10939, 2021.
-  S. Armstrong and S. Mindermann, “Occam’s razor is insufficient to infer the preferences of irrational agents,” in Advances in Neural Information Processing Systems, 2018, pp. 5598–5609.
-  D. Hadfield-Menell, S. Milli, S. J. Russell, P. Abbeel, and A. Dragan, “Inverse Reward Design,” in Advances in Neural Information Processing Systems, 2017, pp. 6749–6758.
-  S. Harnad, “The symbol grounding problem,” Physica D: Nonlinear Phenomena, vol. 42, no. 1-3, pp. 335–346, 1990.
-  R. Gorman and S. Armstrong, “Practical symbol grounding,” in preparation, 2022.
-  A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.
-  S. M. Omohundro, “The basic ai drives,” in Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference. Amsterdam, The Netherlands, The Netherlands: IOS Press, 2008, pp. 483–492. [Online]. Available: http://dl.acm.org/citation.cfm?id=1566174.1566226
-  L. Orseau and S. Armstrong, “Safely Interruptible Agents,” in Uncertainty in Artificial Intelligence, 2016, pp. 557–566.
-  S. Armstrong, J. Leike, L. Orseau, and S. Legg, “Pitfalls of Learning a Reward Function Online,” arXiv e-prints, Apr. 2020.
-  S. Armstrong, A. Sandberg, and N. Bostrom, “Thinking Inside the Box: Controlling and Using an Oracle AI,” Minds and Machines, vol. 22, pp. 299–324, 2012.
-  T. Lu, D. Pál, and M. Pál, “Contextual multi-armed bandits,” in Proceedings of the Thirteenth international conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 485–492.
-  S.-T. Park and W. Chu, “Pairwise preference regression for cold-start recommendation,” in Proceedings of the third ACM conference on Recommender systems, 2009, pp. 21–28.
-  S. Milano, B. Mittelstadt, S. Wachter, and C. Russell, “Epistemic fragmentation poses a threat to the governance of online targeting,” Nature Machine Intelligence, vol. 3, no. 6, pp. 466–472, 2021.
-  D. Kalimeris, kalimeris, and S. Bhagat, “Preference amplification in recommender systems,” 2021.
-  C. Burr, N. Cristianini, and J. Ladyman, “An analysis of the interaction between intelligent software agents and human users,” Minds and machines, vol. 28, no. 4, pp. 735–774, 2018.
-  Q. Zhao, F. M. Harper, G. Adomavicius, and J. A. Konstan, “Explicit or implicit feedback? engagement or satisfaction? a field experiment on machine-learning-based recommender systems,” in Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018, pp. 1331–1340.
-  A. F. Dugas, M. Jalalpour, Y. Gel, S. Levin, F. Torcaso, T. Igusa, and R. E. Rothman, “Influenza forecasting with google flu trends,” PloS one, vol. 8, no. 2, p. e56176, 2013.