# Optimal Farsighted Agents Tend to Seek Power

Some researchers have speculated that capable reinforcement learning (RL) agents pursuing misspecified objectives are often incentivized to seek resources and power in pursuit of those objectives. An agent seeking power is incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: humans seem idiosyncratic in their urges to power, which need not be present in the agents we design. We formalize a notion of power within the context of finite deterministic Markov decision processes (MDPs). We prove that, with respect to a wide class of reward function distributions, optimal policies tend to seek power over the environment.

## Authors

• 3 publications
• ### Learning in Markov Decision Processes under Constraints

We consider reinforcement learning (RL) in Markov Decision Processes (MD...
02/27/2020 ∙ by Rahul Singh, et al. ∙ 11

• ### The Online Coupon-Collector Problem and Its Application to Lifelong Reinforcement Learning

Transferring knowledge across a sequence of related tasks is an importan...
06/10/2015 ∙ by Emma Brunskill, et al. ∙ 0

• ### A Geometric Traversal Algorithm for Reward-Uncertain MDPs

Markov decision processes (MDPs) are widely used in modeling decision ma...
02/14/2012 ∙ by Eunsoo Oh, et al. ∙ 0

• ### Nonparametric General Reinforcement Learning

Reinforcement learning (RL) problems are often phrased in terms of Marko...
11/28/2016 ∙ by Jan Leike, et al. ∙ 0

• ### A Tensor Network Approach to Finite Markov Decision Processes

Tensor network (TN) techniques - often used in the context of quantum ma...
02/12/2020 ∙ by Edward Gillman, et al. ∙ 0

• ### A Definition of Happiness for Reinforcement Learning Agents

What is happiness for reinforcement learning agents? We seek a formal de...
05/18/2015 ∙ by Mayank Daswani, et al. ∙ 0

• ### Langevin Dynamics for Inverse Reinforcement Learning of Stochastic Gradient Algorithms

Inverse reinforcement learning (IRL) aims to estimate the reward functio...
06/20/2020 ∙ by Vikram Krishnamurthy, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Instrumental convergence is the idea that some actions are optimal for a wide range of goals: for example, to travel as quickly as possible to a randomly selected coordinate on Earth, one likely begins by driving to the nearest airport. Driving to the airport would then be instrumentally convergent for travel-related goals. In other words, instrumental convergence posits that there are strong regularities in optimal policies across a wide range of objectives.

Power may be defined as the ability to accomplish goals in general.111Informal definition suggested by Cohen et al. (2019). This seems reasonable: “money is power”, as the saying goes, and money helps one achieve many goals. Conversely, physical restrainment reduces one’s ability to steer the situation in various directions. A deactivated agent has no control over the future, and so has no power.

Instrumental convergence is a potential safety concern for the alignment of advanced RL systems with human goals. If gaining power over the environment is instrumentally convergent (as suggested by e.g. Omohundro (2008); Bostrom (2014); Russell (2019)), then even minor goal misspecification will incentivize the agent to resist correction and, eventually, to appropriate resources at scale to best pursue its goal. For example, Marvin Minsky imagined an agent tasked with proving the Riemann hypothesis might rationally turn the planet into computational resources (Russell and Norvig (2009)).

Some established researchers have argued that to impute power-seeking motives is to anthropomorphize, and recent months have brought debate as to the strength of instrumentally convergent incentives to gain power.

Pinker (2015) argued that “thinking does not imply subjugating”. It has been similarly suggested that cooperation is instrumentally convergent (and so the system will not gain undue power over us).

We put the matter to formal investigation, and find that their positions are contradicted by reasonable interpretations of our theorems. We make no supposition about the timeline over which real-world power-seeking behavior could become plausible; instead, we concern ourselves with the theoretical consequences of RL agents acting optimally in their environment. Instrumental convergence does, in fact, arise from the structural properties of MDPs. Power-seeking behavior is, in fact, instrumentally convergent. With respect to distributions over reward functions, we prove that optimal action is likely proportional to the power it supplies the agent. That seeking power is instrumentally convergent highlights a significant theoretical risk: for an agent to gain maximal power over real-world environments, it may need to disempower its supervisors.

## 2 Possibilities

Although we speculated about how power-seeking affects other agents in the environment, we leave formal multi-agent settings to future work. Let be a rewardless deterministic MDP with finite state and action spaces , deterministic transition function , and discount factor . We colloquially refer to agents as farsighted if is close to 1.

The first key insight is to consider not policies, but the trajectories induced by policies from a given state; to not look at the state itself, but the paths through time available from the state. We concern ourselves with the possibilities available at each juncture of the MDP.

To this end, for , consider the mapping of (where ); in other words, each policy maps to a function mapping each state

to a discounted state visitation frequency vector

, which we call a possibility. The meaning of each frequency vector is: starting in state and following policy , what sequence of states do we visit in the future?333Traditionally, possibilities have gone by many names, including “occupancy measures”, “state visit distributions” (Sutton and Barto (1998)), and “on-policy distributions”. We introduce new terminology to better focus on the natural interpretation of the vector as a path through time. States visited later in the sequence are discounted according to : the sequence would induce visitation frequency on , visitation frequency on , and visitation frequency on . The possibilities available at each state are defined .

Observe that each possibility has . Furthermore, for any reward function over the state space and for any state , the optimal value function at discount rate is defined (where is expressed as a column vector). Historically, this latter “dual” formulation has been the primary context in which possibilities have been considered. When considering the directed graph induced by the rewardless MDP (also called a model), we collapse multiple actions with the same consequence to a single outbound arrow.

### 2.1 Foundational results

Omitted proofs and additional results (corresponding to skips in theorem numbering) can be found in appendix A. We often omit statements such as “let be a state” when they are obvious from context.

[Paths and cycles]lempathcyc Let be a state. Consider the infinite state visitation sequence induced by following from . This sequence consists of an initial directed path of length in which no state appears twice, and a directed cycle of order .

###### Proof outline.

Apply the pigeonhole principle to the fact that is finite and is deterministic. ∎

lemcontVfn is piecewise linear with respect to ; in particular, it is continuous.

###### Proof.

takes the maximum over a set of fixed -dimensional linear functionals. Therefore, the maximum is piecewise linear. ∎

### 2.2 Non-dominated possibilities

Some possibilities are “redundant” – no goal’s optimal value is affected by their availability. If you assign some scalar values to chocolate and to bananas, it’s never strictly optimal to take half of each.

###### Definition 1.

is dominated if . The set of non-dominated possibilities at state is notated .

###### Definition 2.

The non-dominated subgraph at consists of those states visited and actions taken by some non-dominated possibility .

## 3 Power

Recall that we consider an agent’s power to be its ability to achieve goals in general.

###### Definition 3.

Let be any absolutely continuous distribution bounded over ,444Positive affine transformation allows extending our results to with different bounds. and define to be the corresponding distribution over reward functions with CDF (note that is distributed identically across states). The average optimal value at state is

 V∗avg(s,γ)\coloneqq∫RV∗R(s,γ)\difF(R). (1)

However, diverges as and includes an initial term of (as the agent has no control over its current presence at ).

###### Definition 4.
 \textscPower(s,γ)\coloneqq1−γγ(V∗avg% (s,γ)−E[D]). (2)

This quantifies the agent’s control at future time-steps. Observe that for any two states , iff .

[Minimal power]lemminPower Let be a state. iff .

[Maximal power]lemmaxPower Let be a state such that all states are one-step reachable from , each of which has a loop. . In particular, for any MDP with states, this is maximal.

propbounds .

If one must wait, one has less control over the future; for example, the MDP in fig. (a)a has a one-step waiting period. The following theorem nicely encapsulates this as a convex combination of the minimal present control and anticipated future control.

[Delay decreases power]propdelay Let be such that for , each has as its sole child. Then .

To further demonstrate the suitability of this notion of power, we consider one final property. Two vertices and are said to be similar if there exists a graph automorphism such that . If all vertices are similar, the graph is said to be vertex transitive. Vertex transitive graphs are highly symmetric; therefore, the power should be equal everywhere.

proppowSimilar If and are similar, .

corvTrans If the model is vertex transitive, all states have equal Power.

corsameSucc If and have the same children, .

### 3.1 Time-uniformity

To bolster the reader’s intuitions, we consider a special type of MDP where the power of each state can be immediately determined.

###### Definition 5.

A state is time-uniform when, for all , either all states reachable in steps have the same children or all such states can only reach themselves.

[Time-uniform power]thmtimePower If the state is time-uniform, then either all possibilities simultaneously enter 1-cycles after time steps and

 \textscPower(s,γ)=(1−γ)k−2∑i=0γi E[max of |T(si)| draws from D] +γk−1 E[max of |T(sk−1)| draws from D],

or no possibility ever enters a 1-cycle and

 \textscPower(s,γ)=(1−γ)∞∑i=0γiE[max of |T(si)| draws from D].

## 4 Optimal Policy Shifts

Time-uniformity brings us to another interesting property: some MDPs have no reward functions whose optimal policy set changes with . In other words, for any reward function and for all , the greedy policy is optimal.

###### Definition 6.

For a reward function and , we refer to a change in the set of -optimal policies as an optimal policy shift at . We also say that two possibilities and switch off at .

In which environments can an agent change its mind as it becomes more farsighted? When can optimal policy shifts occur? The answer: when the agent can be made to choose between lesser immediate reward and greater delayed reward. In other words, when gratification can be delayed.

thmoptShift There can exist an optimal policy whose action changes at iff .

###### Definition 7 (Blackwell optimal policies (Blackwell (1962))).

For reward function , an optimal policy set is said to be Blackwell -optimal if, for some , no further optimal policy shifts occur for .

Intuitively, a Blackwell optimal policy set means the agent has “settled down” and will no longer change its mind as it becomes more farsighted (that is, as increases towards 1).

Blackwell (1962) exploits linear-algebraic properties of the Bellman equations to conclude the existence of a Blackwell-optimal policy. We strengthen this result with an explicit upper bound.

lemswitchOff For any reward function and , and switch off at most times.

[Existence of a Blackwell optimal policy (Blackwell (1962))]thmfiniteShifts For any reward function , a finite number of optimal policy shifts occur.

As demonstrated in fig. 7, reward functions are often never all done shifting. However, we can prove that most of has switched to their Blackwell optimal policy set.

###### Definition 8.

Let , and let denote the subset of for which is optimal. The optimality measure of , notated , is the measure of under .555To avoid notational clutter, we keep implicit the state-dependence of , and other quantities involving one or more possibilities. That is, we do not write .

propravgConverge The following limits exist: and .

## 5 Instrumental Convergence

The intuitive notion of instrumental convergence is that with respect to , optimal policies are more likely to take one action than another (e.g. remaining activated versus being shut off). However, the state with maximal Power isn’t always instrumentally convergent from other states; see fig. 8. Our treatment of instrumental convergence therefore requires some care.

### 5.1 Characterization

###### Definition 9.

Define to be the contribution of to . For , . Similarly, .

We’d like to quantify when optimal policies tend to take certain actions more often than others. For example, if gaining money is “instrumentally convergent”, then concretely, this means that actions which gain money are more likely to be optimal than actions which do not gain money.

###### Definition 10.

We say that instrumental convergence exists downstream of state when, for some , state trajectory prefix , and such that there exist whose possibilities respectively induce and , we have .

Loosely speaking, the joint entropy of the distribution of (deterministic) optimal policies under is inversely related to the degree to which instrumental convergence is present.

[The character of instrumental convergence]thmcharacter Instrumental convergence exists downstream of a state iff a possibility of that state has measure variable in .

Consider that when is sufficiently close to 0, most agents act greedily; definition 10 hints that instrumental convergence relates to power-seeking behavior becoming more likely as .

cornoShiftNoIC If no optimal policy shifts can occur, then instrumental convergence does not exist.

### 5.2 Possibility similarity

###### Definition 11.

Let induce state trajectories and respectively. We say that and are similar if there exists a graph automorphism on the non-dominated subgraph at such that .

Observe that the existence of such a for the full model is sufficient for similarity.

proppossSim If and are similar, then and .

corsimNoIC If all non-dominated possibilities of a state are similar, then no instrumental convergence exists downstream of the state.

Vertex transitivity does not necessarily imply possibility similarity (e.g. instrumental convergence exists in the 3-prism graph with self-loops).

### 5.3 1-cycle MDPs

In this subsection, we consider states whose non-dominated possibilities all terminate in 1-cycles; powerful instrumental convergence results are available in this setting. Let contain all of the 1-cycles reachable from , and let . Let contain those possibilities ending in a cycle of .

thmloopMeas and .

[Reaching more 1-cycles is instrumentally convergent]coroneCyc Let . If , then .

Application of section 5.3 allows proving that it is instrumentally convergent to e.g. keep the game of Tic-Tac-Toe going as long as possible and avoid dying in Pac-Man (just consider the distribution of 1-cycles in the respective models).

### 5.4 Optimal policies tend to take control

[Power is roughly instrumentally convergent]thmpowerSeeking Let , , and . Suppose that

 \textscPower(F,γ)>KE[max of |S| draws from D]E[D]\textscPower(F′,γ).

Then . The statement also holds when Power and are exchanged.

###### Remark.

Section 5.4 can be extended to hold for arbitrary continuous distributions over reward functions (e.g., if some states have greater expected reward than others). The instrumental convergence then holds with respect to the Power for that distribution.

Suppose the agent starts at with a goal drawn from the uniform distribution over reward functions. If one child contributes 100 times as much Power as another child , then the agent is at least 50 times more likely to have an optimal policy navigating through ( for the uniform distribution, so ).

In the above analysis, familiarity with the mechanics of Power suggests that the terminal state corresponding to agent shutdown has miniscule power contribution. Therefore, in an MDP reflecting the consequences of deactivation, agents pursuing randomly selected goals are quite unlikely to allow themselves to be deactivated (if they have a choice in the matter).

Section 5.4 strongly informs an ongoing debate as to whether most agents act to acquire resources and avoid shutdown. As mentioned earlier, it has been argued that power-seeking behavior will not arise unless we specifically incentivize it.

Section 5.4 answers yes, optimal farsighted agents will usually acquire resources; yes, optimal farsighted agents will generally act to avoid being deactivated. If there is a set of possibilities through some part of the future offering a high degree of control over future state observations, optimal farsighted agents are likely to pursue that control. Conversely, if some set of possibilities is strongly instrumentally convergent, they offer a larger power contribution.

Suppose we are at state and can reach . The “top-down” differs from the power contribution of those possibilities running through , which is conditional on starting at (consider the power contributions presented in fig. (b)b).

## 6 Related Work

Benson-Tilsen and Soares (2016) explored how instrumental convergence arises in a particular toy model. In economics, turnpike theory studies a similar notion: certain paths of accumulation (turnpikes) are more likely to be optimal than others (see e.g. McKenzie (1976)). Soares et al. (2015) and Hadfield-Menell et al. (2016) formally consider the problem of an agent rationally resisting deactivation.

There is a surprising lack of basic theory with respect to the structural properties of possibilities. Wang et al. (2007) and Wang et al. (2008) both remark on this absence, using state visitation distributions to formulate dual versions of classic dynamic programming algorithms. Regan and Boutilier (2011) employ state visitation distributions to navigate reward uncertainty. Regan and Boutilier (2010) explore the idea of non–dominated policies – policies which are optimal for some instantiation of the reward function (which is closely related to our definition of non-dominated possibilities in section 2.2).

Multi-objective MDPs trade-off the maximization of several objectives (see e.g. Roijers et al. (2013)), while we examine how MDP structure determines the ability to maximize objectives in general.

Johns and Mahadevan (2007) observed that optimal value functions are smooth with respect to the dynamics of the environment, which can be proven with our formalism. Dadashi et al. (2019) explore topological properties of value function space while holding the reward function constant. Bellemare et al. (2019) studies the benefits of learning a certain subset of value functions. Foster and Dayan (2002) explore the properties of the optimal value function for a range of goals; along with Drummond (1998), Sutton et al. (2011), and Schaul et al. (2015), they note that value functions seem to encode important information about the environment. In separate work, we show that a limited subset of optimal value functions encodes the environment. Turner et al. (2019) speculate that the optimal value of a state is heavily correlated across reward functions.

### 6.1 Existing contenders for measuring power

We highlight the shortcomings of existing notions quantifying the agent’s control over the future, starting from a given state.

State reachability (discounted or otherwise) fails to quantify how often states can be visited (see fig. 12). Characterization by the sizes of the final communicating classes ignores both transient state information and the local dynamics in those final classes. Graph diameter ignores local information, as do the minimal and maximal degrees.

There are many graph centrality measures, none of which are appropriate. For brevity, we only consider two such alternatives. The degree centrality of a state ignores non-local dynamics – the agent’s control in the non-immediate future. Closeness centrality has the same problem as discounted reachability: it only accounts for distance in the MDP’s model, not for control over the future.

Salge et al. (2014) define information-theoretic empowerment as the maximum possible mutual information between the agent’s actions and the state observations steps in the future, notated . This notion requires an arbitrary choice of horizon, failing to account for the agent’s discount factor . As demonstrated in fig. 13, this leads to arbitrary evaluations of control.

One idea would be to take , however, this fails to converge for even simple MDPs (see fig. (a)a). Alternatively, one might consider the discounted empowerment series , or even taking the global maximum over this series of channel capacities (instead of adding the channel capacities for each individual horizon). Neither fix suffices.

Compounding these issues is the fact that “in a discrete deterministic world empowerment reduces to the logarithm of the number of sensor states reachable with the available actions” (Salge et al. (2014)). We have already observed that reachability metrics are unsatisfactory.

## 7 Discussion

We have only touched on a portion of the structural insights made possible by possibilities; for example, there are intriguing MDP representability results left unstated.

Although we only treated deterministic finite MDPs, it seems reasonable to expect the key conclusions to apply to broader classes of environments. We treat the case where the reward distribution is distributed identically across states; if we did not assume this, we could not prove much of interest, as sufficiently tailored distributions could make any part of the MDP “instrumentally convergent”. However, Power is compatible with arbitrary reward function distributions.

### 7.1 Open questions

We know that is continuous on (creftype 24), does not equal 0 at any (creftype 21) iff is non-dominated, and that it converges as (definition 8); similar statements hold for . However, for all continuous , do the optimality measures of possibilities and the powers of states eventually reach ordinal equilibrium for sufficiently close to 1? There are further interesting results which would immediately follow.

###### Conjecture.

either for all , or for at most finitely many such .

###### Proof outline.

. Consider the inequalities of the form such that is strictly optimal (for continuous , only a zero measure subset of requires the inequality to not be strict). Consider the measure of the subset of such that the inequality holds. Suppose this measure is a rational function of .666Note that each is a homogeneous degree-one polynomial on with coefficients rational in . The measure of this subset may not be a rational function under all bounded continuous distributions, but it should at least be rational under the uniform distribution. The integral can then be re-expressed as the summation of these measures. Then is a rational function on .

Then if , there are at most finitely many roots by the fundamental theorem of algebra. ∎

### 7.2 Formalizations

The formalization of power seems reasonable, consistent with intuitions for all toy MDPs examined. The formalization of instrumental convergence also seems correct. Practically, if we want to determine whether an agent might gain power in the real world, one might be wary of concluding that we can simply “imagine” a relevant MDP and then estimate e.g. the “power contributions” of certain courses of action. However, any formal calculations of

Power are obviously infeasible for nontrivial environments.

To make predictions using these results, we must combine the intuitive correctness of the power and instrumental convergence formalisms with empirical evidence (from toy models), with intuition (from working with the formal object), and with theorems (like section 5.3, which reaffirms the common-sense prediction that more cycles means asymptotic instrumental convergence, or definition 5, fully determining the power in time-uniform environments). We can reason, “for avoiding shutdown to not be heavily convergent, the model would have to look like such-and-such, but it almost certainly does not…”.

### 7.3 Power-seeking

The theory supplies significant formal understanding of power-seeking incentives. The results strongly support the philosophical arguments of Omohundro (2008) and the conclusions Benson-Tilsen and Soares (2016) drew from their toy model: one should reasonably expect instrumental convergence to arise in the real world. Furthermore, we can appreciate that this convergence arises from how goal-directed behavior interacts with the structure of the environment.

Beyond exploring this structure, the theory reveals facts of (eventual) practical relevance. For example, calculations in toy MDPs indicate that when is left-skew (i.e. reward is generally harder to come by), the agent begins seeking power at smaller (fig. 3). There is not always instrumental convergence towards the state with greatest Power (fig. 8); if one were to be “airdropped” into the MDP with a reward function drawn from , one should choose the state with greatest Power in order to maximize return in -expectation. However, given that one starts from a fixed state, optimal policies may lead more directly towards their destinations.

The overall concern raised by section 5.4 is not that we will build powerful RL agents with randomly selected goals. The concern is that random reward function inputs produce adversarial power-seeking behavior, which can produce perverse incentives such as avoiding deactivation and appropriating resources. Therefore, we should have specific reason to believe that providing the reward function we had in mind will not end in catastrophe.

## 8 Conclusion

Much research is devoted (directly or indirectly) towards the dream of AI: creating highly intelligent agents operating in the real world. In the real world, optimal pursuit of random goals doesn’t just lead to strange behavior – it leads to bad behavior: maximizing a reasonable notion of power over the environment entails resisting shutdown and potentially appropriating resources. Theoretically, section 5.4 implies that the farsighted optimal policies of most reinforcement learning agents acting in the real world are malign.

What if we succeed at creating these agents?

## Acknowledgements

This work was supported by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund. Logan Smith lent significant help by providing a codebase for exploring the power of different states in MDPs. I thank Max Sharnoff for contributions to definition 7. Daniel Blank, Ryan Carey, Ofer Givoli, Evan Hubinger, Joel Lehman, Vanessa Kosoy, Victoria Krakovna, Rohin Shah, Prasad Tadepalli, and Davide Zagami provided valuable feedback.

## References

• Bellemare et al. [2019] Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A Geometric Perspective on Optimal Representations for Reinforcement Learning. arXiv:1901.11530 [cs, stat], January 2019. arXiv: 1901.11530.
• Benson-Tilsen and Soares [2016] Tsvi Benson-Tilsen and Nate Soares. Formalizing Convergent Instrumental Goals.

Workshops at the Thirtieth AAAI Conference on Artificial Intelligence

, 2016.
• Blackwell [1962] David Blackwell. Discrete Dynamic Programming. The Annals of Mathematical Statistics, page 9, 1962.
• Bostrom [2014] Nick Bostrom. Superintelligence. Oxford University Press, 2014.
• Cohen et al. [2019] Michael K. Cohen, Badri Vellambi, and Marcus Hutter. Asymptotically Unambitious Artificial General Intelligence. arXiv:1905.12186 [cs], May 2019. arXiv: 1905.12186.
• Dadashi et al. [2019] Robert Dadashi, Marc G Bellemare, Adrien Ali Taiga, Nicolas Le Roux, and Dale Schuurmans. The Value Function Polytope in Reinforcement Learning. In

International Conference on Machine Learning

, pages 1486–1495, 2019.
• Drummond [1998] Chris Drummond. Composing functions to speed up reinforcement learning in a changing world. In Machine Learning: ECML-98, volume 1398, pages 370–381. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
• Foster and Dayan [2002] David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning, 49(2-3):325–346, 2002.
• Hadfield-Menell et al. [2016] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The Off-Switch Game. arXiv:1611.08219 [cs], November 2016. arXiv: 1611.08219.
• Johns and Mahadevan [2007] Jeff Johns and Sridhar Mahadevan. Constructing basis functions from directed graphs for value function approximation. In International Conference on Machine Learning, pages 385–392. ACM Press, 2007.
• McKenzie [1976] Lionel W McKenzie. Turnpike theory. Econometrica: Journal of the Econometric Society, pages 841–865, 1976.
• Omohundro [2008] Stephen Omohundro. The Basic AI Drives. 2008.
• Pinker [2015] Steven Pinker. Thinking does not imply subjugating. What to Think about Machines that Think; Brockman, J., Ed, pages 5–8, 2015.
• Puterman [2014] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
• Regan and Boutilier [2010] Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs using nondominated policies. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
• Regan and Boutilier [2011] Kevin Regan and Craig Boutilier. Robust online optimization of reward-uncertain MDPs. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
• Roijers et al. [2013] Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
• Russell and Norvig [2009] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Pearson Education Limited, 2009.
• Russell [2019] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019.
• Salge et al. [2014] Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment–an introduction. In Guided Self-Organization: Inception, pages 67–114. Springer, 2014.
• Schaul et al. [2015] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015.
• Soares et al. [2015] Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAI Workshops, 2015.
• Sutton and Barto [1998] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press, 1998.
• Sutton et al. [2011] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems, pages 761–768, 2011.
• Turner et al. [2019] Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad Tadepalli. Conservative Agency via Attainable Utility Preservation. February 2019. arXiv: 1902.09725.
• Wang et al. [2007] Tao Wang, Michael Bowling, and Dale Schuurmans. Dual Representations for Dynamic Programming and Reinforcement Learning. In International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 44–51. IEEE, 2007.
• Wang et al. [2008] Tao Wang, Michael Bowling, Dale Schuurmans, and Daniel J Lizotte. Stable dual dynamic programming. In Advances in neural information processing systems, pages 1569–1576, 2008.

## Appendix A Proofs

###### Proof outline.

Apply the pigeonhole principle to the fact that is finite and is deterministic. ∎

###### Lemma 1.

Each state has at least one possibility unique to it.

###### Proof.

For each , contains a possibility which visits state strictly more often than do any other possibilities at different states. That is, for any -visiting policy enacted from another which has distance from , places times more measure on than does . ∎

Define the restriction . is the unit column vector corresponding to state .

###### Lemma 2 (Prepend).

is reachable in 1 step from via action iff .

###### Proof.

Forward direction: let be a policy such that . Then starting in state , state is reached and then the state visitation frequency vector is produced. Repeat for all such .

Backward direction: creftype 1 shows that contains at least one possibility unique to , which is available even under restriction to because any policy maximizing -visitation would navigate to immediately from . Then this possibility can only be provided by being reachable in one step from . ∎

###### Lemma 3.

Suppose traverses a -cycle with states . Define . Then , and for any , .

###### Proof.

Since the form a -cycle, we have . Since the rewardless MDP is deterministic and by the definition of a -cycle, acts as the identity on all . ∎

###### Lemma 4 (Convergence to gain optimality (Puterman [2014])).

Let be a reward function, be a state, and contain the states of a cycle with maximal average -reward that is reachable from . Then

 limγ→1(1−γ)V∗R,γ(s)=∑s′∈ScycR(s′)∣∣Scyc∣∣. (3)
###### Proof.

takes the maximum over a set of fixed -dimensional linear functionals. Therefore, the maximum is piecewise linear. ∎

### a.1 Non-dominated possibilities

###### Proposition 5 (Domination criterion).

is dominated iff the inequality has no solution for .

###### Lemma 6.

If is non-dominated, the subset of for which is strictly optimal has positive measure and is convex.

###### Proof.

The set has positive measure because is continuous on by section 2.1. The set is convex because it is the intersection of open half-spaces () restricted to the -dimensional unit hypercube. ∎

###### Lemma 7 (Strict visitation optimality sufficient for non-domination).

If assigns more visitation frequency to some state than does any other , then .

###### Proof.

Let be the state indicator reward function for . ∎

###### Corollary 8.

Suppose which place strictly greater measure on some corresponding states than do other possibilities. Then and . In particular, when , .

###### Proof.

Apply creftype 7. For the second claim, observe that trivially when , and also holds for since of two distinct possibilities, each must have strict visitation optimality for at least one state. ∎

### a.2 Variational divergence

###### Lemma 9.

Suppose travels a path from state for steps, ending in . .

###### Proof.

Notice that each loses measure, and that all such states are distinct by the definition of a path. Since all possibilities have equal norm, the total measure lost equals the total measure gained by other states (and therefore the total variational divergence; see fig. 14). Then . ∎

Intuition suggests that a possibility is most different from itself halfway along a cycle. This is correct.

###### Lemma 10.

Suppose travels a -cycle () from state . .

###### Proof.
 dTV(fπs1∥fπsj) =j−1∑i=0γi−γk−i−1 (4) =1−γj1−γ⋅1−γk−j1−γk (5) =1−γj+γk−γk−j(1−γ)(1−γk). (6)

Equation 4 can be verified by inspection. Setting the derivative with respect to to 0, we solve

 0 =−γj+γk−j (7) j =k2. (8)

This is justified because the function is strictly concave on by the second-order test and the fact that . If is even, we are done. If

is odd, then we need an integer solution. Notice that plugging

and into eq. 5 yields the same maximal result.

Therefore, in the odd case, both inequalities in the theorem statement are strict. In the even case, the first inequality is an equality. ∎

For any , if , .

###### Proof.

The shortest path self-divergence is when for creftype 9, in which case . The shortest cycle self-divergence is when for creftype 10, in which case . ∎

### a.3 Optimality measure

For this section only, let be any continuous (not necessarily bounded) distribution over reward functions.

If , then .

###### Proof.

Let . Choose any for . Let (where the are members of an index set of the state space) correspond to the positive entries of , and () to the negative. Clearly, by creftype 11 (in particular, their sums are positive).

Clearly, iff . This carves out a lower-dimensional subset of ; since is continuous, this subset has zero measure. ∎

Note that continuity is required; discontinuous distributions admit non-zero probability of drawing a flat reward function, for which optimal value is the same everywhere.

No is suboptimal for all reward functions: every possibility is optimal for a constant reward function. However, for any given , almost every reward function has a unique optimal possibility at each state.

###### Theorem 13 (Optimal possibilities are almost always unique).

Let be any state. For any , has measure zero.

###### Proof.

Let and let be a state at which there is more than one optimal possibility. There exists a state reachable from with both one-step reachable from such that (if not, then one or the other would be strictly preferable and only one optimal possibility would exist). Apply creftype 12. ∎

###### Corollary 14 (Dominated possibilities almost never optimal).

Let be a dominated possibility at state , and let . The set of reward functions for which is optimal at discount rate has measure zero.

creftypecap 6 states that each element of is strictly optimal on a convex positive measure subset of . creftypecap 13 further shows that these positive measure subsets cumulatively have 1 measure under continuous distributions . In particular, if a dominated possibility is optimal, it must be optimal on the boundary of a convex subsets (otherwise it would be strictly dominated).

###### Lemma 15 (Average reward of different state subsets almost never equal.).

Let s.t. . Then .

###### Proof.

There are uncountably many unsatisfactory variants of every reward function which does satisfy the equality; since is continuous, the set of satisfactory reward functions must have measure zero. ∎

### a.4 Power

###### Lemma 16.
 V∗avg(s,γ)=∑f∈F% nd(s)∫opt(f,γ)f⊤r\difF(r).
###### Proof.

By the definition of domination, restriction to non-dominated possibilities leaves all attainable utilities unchanged; is continuous, so a zero measure subset of has multiple optimal possibilities (creftype 13). ∎

The optimality set can be calculated by solving the relevant system of inequalities.777Mathematica code to calculate these inequalities can be found at https://github.com/loganriggs/gold. For example, consider again the MDP of fig. 16. We would like to calculate . The two possibilities are and . To determine , solve

 ftop⊤r >fbottom⊤r r+γr1−γ >r+γr1−γ r >r.

Intersecting this region with , we have .

###### Proof.

Forward direction: let be the sole possibility at . Then has no maximum, so by the linearity of expectation.

Backward direction: for any MDP, iteratively construct it, starting such that and adding vertices and their arrows. Note that and monotonically increase throughout this process (due to the operator). In particular, if increases from 1, by creftype 8 there exists a second non-dominated possibility. By creftype 6, a positive measure subset of accrues strictly greater optimal value via this possibility. So the integration comes out strictly greater. Then if , . ∎

###### Proof.

Each possibility which immediately navigates to a state and stays there is non-dominated by creftype 7; these are also the only non-dominated possibilities, because the agent cannot do better than immediately navigating to the highest reward state and staying there. So .

Clearly, the possibility navigating to a child is optimal iff the child is a maximum-reward state for a given reward function.

 \textscPower(s,γ) =∫10rmax\difFmax(rmax) (9) =E[max of |S| draws from D]. (10)

###### Proof.
 V∗avg(s0,γ) \coloneqq∫Rmaxf∈F(s0)V∗R(s0)\difF(R) (11) =(ℓ−1∑i=0γi∫10R(si)\difF(R))+γℓ∫Rmaxf∈F(sℓ)V∗R(sℓ)\difF(R) (12) =1−γℓ1−γE[D]+γℓV∗avg(sℓ,γ). (13)

We then calculate :

 \textscPower(s0,γ) (14) =(1−γ)(1−γℓ−11−γE[D]+γℓ−1V∗avg(sℓ,γ)) (15) =(1−γ)(1−γℓ−11−γE[D]+γℓ−1(γ1−γ\textscPower(sℓ,γ)+E[D])) (16) =(1−γℓ−1)E[D]+γℓ\textscPower(sℓ,γ)+γℓ−1(1−γ)E[D] (17) =(1−γℓ)E[D]+γℓ\textscPower(sℓ,γ). (18)

###### Lemma 17.

For states and , there exists a permutation matrix such that iff and are similar.

###### Proof.

Forward: let be the permutation corresponding to ; without loss of generality, assume is the identity on all states not reachable from either or . Observe that