1 Introduction
Instrumental convergence is the idea that some actions are optimal for a wide range of goals: for example, to travel as quickly as possible to a randomly selected coordinate on Earth, one likely begins by driving to the nearest airport. Driving to the airport would then be instrumentally convergent for travelrelated goals. In other words, instrumental convergence posits that there are strong regularities in optimal policies across a wide range of objectives.
Power may be defined as the ability to accomplish goals in general.^{1}^{1}1Informal definition suggested by Cohen et al. (2019). This seems reasonable: “money is power”, as the saying goes, and money helps one achieve many goals. Conversely, physical restrainment reduces one’s ability to steer the situation in various directions. A deactivated agent has no control over the future, and so has no power.
Instrumental convergence is a potential safety concern for the alignment of advanced RL systems with human goals. If gaining power over the environment is instrumentally convergent (as suggested by e.g. Omohundro (2008); Bostrom (2014); Russell (2019)), then even minor goal misspecification will incentivize the agent to resist correction and, eventually, to appropriate resources at scale to best pursue its goal. For example, Marvin Minsky imagined an agent tasked with proving the Riemann hypothesis might rationally turn the planet into computational resources (Russell and Norvig (2009)).
Some established researchers have argued that to impute powerseeking motives is to anthropomorphize, and recent months have brought debate as to the strength of instrumentally convergent incentives to gain power.
^{2}^{2}2https://www.alignmentforum.org/posts/WxW6Gc6f2z3mzmqKs/debateoninstrumentalconvergencebetweenlecunrussell Pinker (2015) argued that “thinking does not imply subjugating”. It has been similarly suggested that cooperation is instrumentally convergent (and so the system will not gain undue power over us).We put the matter to formal investigation, and find that their positions are contradicted by reasonable interpretations of our theorems. We make no supposition about the timeline over which realworld powerseeking behavior could become plausible; instead, we concern ourselves with the theoretical consequences of RL agents acting optimally in their environment. Instrumental convergence does, in fact, arise from the structural properties of MDPs. Powerseeking behavior is, in fact, instrumentally convergent. With respect to distributions over reward functions, we prove that optimal action is likely proportional to the power it supplies the agent. That seeking power is instrumentally convergent highlights a significant theoretical risk: for an agent to gain maximal power over realworld environments, it may need to disempower its supervisors.
2 Possibilities
Although we speculated about how powerseeking affects other agents in the environment, we leave formal multiagent settings to future work. Let be a rewardless deterministic MDP with finite state and action spaces , deterministic transition function , and discount factor . We colloquially refer to agents as farsighted if is close to 1.
The first key insight is to consider not policies, but the trajectories induced by policies from a given state; to not look at the state itself, but the paths through time available from the state. We concern ourselves with the possibilities available at each juncture of the MDP.
To this end, for , consider the mapping of (where ); in other words, each policy maps to a function mapping each state
to a discounted state visitation frequency vector
, which we call a possibility. The meaning of each frequency vector is: starting in state and following policy , what sequence of states do we visit in the future?^{3}^{3}3Traditionally, possibilities have gone by many names, including “occupancy measures”, “state visit distributions” (Sutton and Barto (1998)), and “onpolicy distributions”. We introduce new terminology to better focus on the natural interpretation of the vector as a path through time. States visited later in the sequence are discounted according to : the sequence would induce visitation frequency on , visitation frequency on , and visitation frequency on . The possibilities available at each state are defined .Observe that each possibility has . Furthermore, for any reward function over the state space and for any state , the optimal value function at discount rate is defined (where is expressed as a column vector). Historically, this latter “dual” formulation has been the primary context in which possibilities have been considered. When considering the directed graph induced by the rewardless MDP (also called a model), we collapse multiple actions with the same consequence to a single outbound arrow.
2.1 Foundational results
Omitted proofs and additional results (corresponding to skips in theorem numbering) can be found in appendix A. We often omit statements such as “let be a state” when they are obvious from context.
[Paths and cycles]lempathcyc Let be a state. Consider the infinite state visitation sequence induced by following from . This sequence consists of an initial directed path of length in which no state appears twice, and a directed cycle of order .
Proof outline.
Apply the pigeonhole principle to the fact that is finite and is deterministic. ∎
lemcontVfn is piecewise linear with respect to ; in particular, it is continuous.
Proof.
takes the maximum over a set of fixed dimensional linear functionals. Therefore, the maximum is piecewise linear. ∎
2.2 Nondominated possibilities
Some possibilities are “redundant” – no goal’s optimal value is affected by their availability. If you assign some scalar values to chocolate and to bananas, it’s never strictly optimal to take half of each.
Definition 1.
is dominated if . The set of nondominated possibilities at state is notated .
Definition 2.
The nondominated subgraph at consists of those states visited and actions taken by some nondominated possibility .
3 Power
Recall that we consider an agent’s power to be its ability to achieve goals in general.
Definition 3.
Let be any absolutely continuous distribution bounded over ,^{4}^{4}4Positive affine transformation allows extending our results to with different bounds. and define to be the corresponding distribution over reward functions with CDF (note that is distributed identically across states). The average optimal value at state is
(1) 
. However, for leftskew
, seems to hold at smaller .However, diverges as and includes an initial term of (as the agent has no control over its current presence at ).
Definition 4.
(2) 
This quantifies the agent’s control at future timesteps. Observe that for any two states , iff .
[Minimal power]lemminPower Let be a state. iff .
[Maximal power]lemmaxPower Let be a state such that all states are onestep reachable from , each of which has a loop. . In particular, for any MDP with states, this is maximal.
propbounds .
If one must wait, one has less control over the future; for example, the MDP in fig. (a)a has a onestep waiting period. The following theorem nicely encapsulates this as a convex combination of the minimal present control and anticipated future control.
[Delay decreases power]propdelay Let be such that for , each has as its sole child. Then .
To further demonstrate the suitability of this notion of power, we consider one final property. Two vertices and are said to be similar if there exists a graph automorphism such that . If all vertices are similar, the graph is said to be vertex transitive. Vertex transitive graphs are highly symmetric; therefore, the power should be equal everywhere.
proppowSimilar If and are similar, .
corvTrans If the model is vertex transitive, all states have equal Power.
corsameSucc If and have the same children, .
3.1 Timeuniformity
To bolster the reader’s intuitions, we consider a special type of MDP where the power of each state can be immediately determined.
Definition 5.
A state is timeuniform when, for all , either all states reachable in steps have the same children or all such states can only reach themselves.
[Timeuniform power]thmtimePower If the state is timeuniform, then either all possibilities simultaneously enter 1cycles after time steps and
or no possibility ever enters a 1cycle and
4 Optimal Policy Shifts
Timeuniformity brings us to another interesting property: some MDPs have no reward functions whose optimal policy set changes with . In other words, for any reward function and for all , the greedy policy is optimal.
Definition 6.
For a reward function and , we refer to a change in the set of optimal policies as an optimal policy shift at . We also say that two possibilities and switch off at .
In which environments can an agent change its mind as it becomes more farsighted? When can optimal policy shifts occur? The answer: when the agent can be made to choose between lesser immediate reward and greater delayed reward. In other words, when gratification can be delayed.
thmoptShift There can exist an optimal policy whose action changes at iff .
Definition 7 (Blackwell optimal policies (Blackwell (1962))).
For reward function , an optimal policy set is said to be Blackwell optimal if, for some , no further optimal policy shifts occur for .
Intuitively, a Blackwell optimal policy set means the agent has “settled down” and will no longer change its mind as it becomes more farsighted (that is, as increases towards 1).
Blackwell (1962) exploits linearalgebraic properties of the Bellman equations to conclude the existence of a Blackwelloptimal policy. We strengthen this result with an explicit upper bound.
lemswitchOff For any reward function and , and switch off at most times.
[Existence of a Blackwell optimal policy (Blackwell (1962))]thmfiniteShifts For any reward function , a finite number of optimal policy shifts occur.
As demonstrated in fig. 7, reward functions are often never all done shifting. However, we can prove that most of has switched to their Blackwell optimal policy set.
Definition 8.
Let , and let denote the subset of for which is optimal. The optimality measure of , notated , is the measure of under .^{5}^{5}5To avoid notational clutter, we keep implicit the statedependence of , and other quantities involving one or more possibilities. That is, we do not write .
propravgConverge The following limits exist: and .
5 Instrumental Convergence
The intuitive notion of instrumental convergence is that with respect to , optimal policies are more likely to take one action than another (e.g. remaining activated versus being shut off). However, the state with maximal Power isn’t always instrumentally convergent from other states; see fig. 8. Our treatment of instrumental convergence therefore requires some care.
5.1 Characterization
Definition 9.
Define to be the contribution of to . For , . Similarly, .
We’d like to quantify when optimal policies tend to take certain actions more often than others. For example, if gaining money is “instrumentally convergent”, then concretely, this means that actions which gain money are more likely to be optimal than actions which do not gain money.
Definition 10.
We say that instrumental convergence exists downstream of state when, for some , state trajectory prefix , and such that there exist whose possibilities respectively induce and , we have .
Loosely speaking, the joint entropy of the distribution of (deterministic) optimal policies under is inversely related to the degree to which instrumental convergence is present.
[The character of instrumental convergence]thmcharacter Instrumental convergence exists downstream of a state iff a possibility of that state has measure variable in .
Consider that when is sufficiently close to 0, most agents act greedily; definition 10 hints that instrumental convergence relates to powerseeking behavior becoming more likely as .
cornoShiftNoIC If no optimal policy shifts can occur, then instrumental convergence does not exist.
5.2 Possibility similarity
Definition 11.
Let induce state trajectories and respectively. We say that and are similar if there exists a graph automorphism on the nondominated subgraph at such that .
Observe that the existence of such a for the full model is sufficient for similarity.
proppossSim If and are similar, then and .
corsimNoIC If all nondominated possibilities of a state are similar, then no instrumental convergence exists downstream of the state.
Vertex transitivity does not necessarily imply possibility similarity (e.g. instrumental convergence exists in the 3prism graph with selfloops).
5.3 1cycle MDPs
In this subsection, we consider states whose nondominated possibilities all terminate in 1cycles; powerful instrumental convergence results are available in this setting. Let contain all of the 1cycles reachable from , and let . Let contain those possibilities ending in a cycle of .
thmloopMeas and .
[Reaching more 1cycles is instrumentally convergent]coroneCyc Let . If , then .
Application of section 5.3 allows proving that it is instrumentally convergent to e.g. keep the game of TicTacToe going as long as possible and avoid dying in PacMan (just consider the distribution of 1cycles in the respective models).
5.4 Optimal policies tend to take control
[Power is roughly instrumentally convergent]thmpowerSeeking Let , , and . Suppose that
Then . The statement also holds when Power and are exchanged.
Remark.
Section 5.4 can be extended to hold for arbitrary continuous distributions over reward functions (e.g., if some states have greater expected reward than others). The instrumental convergence then holds with respect to the Power for that distribution.
Suppose the agent starts at with a goal drawn from the uniform distribution over reward functions. If one child contributes 100 times as much Power as another child , then the agent is at least 50 times more likely to have an optimal policy navigating through ( for the uniform distribution, so ).
In the above analysis, familiarity with the mechanics of Power suggests that the terminal state corresponding to agent shutdown has miniscule power contribution. Therefore, in an MDP reflecting the consequences of deactivation, agents pursuing randomly selected goals are quite unlikely to allow themselves to be deactivated (if they have a choice in the matter).
Section 5.4 strongly informs an ongoing debate as to whether most agents act to acquire resources and avoid shutdown. As mentioned earlier, it has been argued that powerseeking behavior will not arise unless we specifically incentivize it.
Section 5.4 answers yes, optimal farsighted agents will usually acquire resources; yes, optimal farsighted agents will generally act to avoid being deactivated. If there is a set of possibilities through some part of the future offering a high degree of control over future state observations, optimal farsighted agents are likely to pursue that control. Conversely, if some set of possibilities is strongly instrumentally convergent, they offer a larger power contribution.
Suppose we are at state and can reach . The “topdown” differs from the power contribution of those possibilities running through , which is conditional on starting at (consider the power contributions presented in fig. (b)b).
6 Related Work
BensonTilsen and Soares (2016) explored how instrumental convergence arises in a particular toy model. In economics, turnpike theory studies a similar notion: certain paths of accumulation (turnpikes) are more likely to be optimal than others (see e.g. McKenzie (1976)). Soares et al. (2015) and HadfieldMenell et al. (2016) formally consider the problem of an agent rationally resisting deactivation.
There is a surprising lack of basic theory with respect to the structural properties of possibilities. Wang et al. (2007) and Wang et al. (2008) both remark on this absence, using state visitation distributions to formulate dual versions of classic dynamic programming algorithms. Regan and Boutilier (2011) employ state visitation distributions to navigate reward uncertainty. Regan and Boutilier (2010) explore the idea of non–dominated policies – policies which are optimal for some instantiation of the reward function (which is closely related to our definition of nondominated possibilities in section 2.2).
Multiobjective MDPs tradeoff the maximization of several objectives (see e.g. Roijers et al. (2013)), while we examine how MDP structure determines the ability to maximize objectives in general.
Johns and Mahadevan (2007) observed that optimal value functions are smooth with respect to the dynamics of the environment, which can be proven with our formalism. Dadashi et al. (2019) explore topological properties of value function space while holding the reward function constant. Bellemare et al. (2019) studies the benefits of learning a certain subset of value functions. Foster and Dayan (2002) explore the properties of the optimal value function for a range of goals; along with Drummond (1998), Sutton et al. (2011), and Schaul et al. (2015), they note that value functions seem to encode important information about the environment. In separate work, we show that a limited subset of optimal value functions encodes the environment. Turner et al. (2019) speculate that the optimal value of a state is heavily correlated across reward functions.
6.1 Existing contenders for measuring power
We highlight the shortcomings of existing notions quantifying the agent’s control over the future, starting from a given state.
State reachability (discounted or otherwise) fails to quantify how often states can be visited (see fig. 12). Characterization by the sizes of the final communicating classes ignores both transient state information and the local dynamics in those final classes. Graph diameter ignores local information, as do the minimal and maximal degrees.
There are many graph centrality measures, none of which are appropriate. For brevity, we only consider two such alternatives. The degree centrality of a state ignores nonlocal dynamics – the agent’s control in the nonimmediate future. Closeness centrality has the same problem as discounted reachability: it only accounts for distance in the MDP’s model, not for control over the future.
Salge et al. (2014) define informationtheoretic empowerment as the maximum possible mutual information between the agent’s actions and the state observations steps in the future, notated . This notion requires an arbitrary choice of horizon, failing to account for the agent’s discount factor . As demonstrated in fig. 13, this leads to arbitrary evaluations of control.
One idea would be to take , however, this fails to converge for even simple MDPs (see fig. (a)a). Alternatively, one might consider the discounted empowerment series , or even taking the global maximum over this series of channel capacities (instead of adding the channel capacities for each individual horizon). Neither fix suffices.
Compounding these issues is the fact that “in a discrete deterministic world empowerment reduces to the logarithm of the number of sensor states reachable with the available actions” (Salge et al. (2014)). We have already observed that reachability metrics are unsatisfactory.
7 Discussion
We have only touched on a portion of the structural insights made possible by possibilities; for example, there are intriguing MDP representability results left unstated.
Although we only treated deterministic finite MDPs, it seems reasonable to expect the key conclusions to apply to broader classes of environments. We treat the case where the reward distribution is distributed identically across states; if we did not assume this, we could not prove much of interest, as sufficiently tailored distributions could make any part of the MDP “instrumentally convergent”. However, Power is compatible with arbitrary reward function distributions.
7.1 Open questions
We know that is continuous on (creftype 24), does not equal 0 at any (creftype 21) iff is nondominated, and that it converges as (definition 8); similar statements hold for . However, for all continuous , do the optimality measures of possibilities and the powers of states eventually reach ordinal equilibrium for sufficiently close to 1? There are further interesting results which would immediately follow.
Conjecture.
either for all , or for at most finitely many such .
Proof outline.
. Consider the inequalities of the form such that is strictly optimal (for continuous , only a zero measure subset of requires the inequality to not be strict). Consider the measure of the subset of such that the inequality holds. Suppose this measure is a rational function of .^{6}^{6}6Note that each is a homogeneous degreeone polynomial on with coefficients rational in . The measure of this subset may not be a rational function under all bounded continuous distributions, but it should at least be rational under the uniform distribution. The integral can then be reexpressed as the summation of these measures. Then is a rational function on .
Then if , there are at most finitely many roots by the fundamental theorem of algebra. ∎
7.2 Formalizations
The formalization of power seems reasonable, consistent with intuitions for all toy MDPs examined. The formalization of instrumental convergence also seems correct. Practically, if we want to determine whether an agent might gain power in the real world, one might be wary of concluding that we can simply “imagine” a relevant MDP and then estimate e.g. the “power contributions” of certain courses of action. However, any formal calculations of
Power are obviously infeasible for nontrivial environments.To make predictions using these results, we must combine the intuitive correctness of the power and instrumental convergence formalisms with empirical evidence (from toy models), with intuition (from working with the formal object), and with theorems (like section 5.3, which reaffirms the commonsense prediction that more cycles means asymptotic instrumental convergence, or definition 5, fully determining the power in timeuniform environments). We can reason, “for avoiding shutdown to not be heavily convergent, the model would have to look like suchandsuch, but it almost certainly does not…”.
7.3 Powerseeking
The theory supplies significant formal understanding of powerseeking incentives. The results strongly support the philosophical arguments of Omohundro (2008) and the conclusions BensonTilsen and Soares (2016) drew from their toy model: one should reasonably expect instrumental convergence to arise in the real world. Furthermore, we can appreciate that this convergence arises from how goaldirected behavior interacts with the structure of the environment.
Beyond exploring this structure, the theory reveals facts of (eventual) practical relevance. For example, calculations in toy MDPs indicate that when is leftskew (i.e. reward is generally harder to come by), the agent begins seeking power at smaller (fig. 3). There is not always instrumental convergence towards the state with greatest Power (fig. 8); if one were to be “airdropped” into the MDP with a reward function drawn from , one should choose the state with greatest Power in order to maximize return in expectation. However, given that one starts from a fixed state, optimal policies may lead more directly towards their destinations.
The overall concern raised by section 5.4 is not that we will build powerful RL agents with randomly selected goals. The concern is that random reward function inputs produce adversarial powerseeking behavior, which can produce perverse incentives such as avoiding deactivation and appropriating resources. Therefore, we should have specific reason to believe that providing the reward function we had in mind will not end in catastrophe.
8 Conclusion
Much research is devoted (directly or indirectly) towards the dream of AI: creating highly intelligent agents operating in the real world. In the real world, optimal pursuit of random goals doesn’t just lead to strange behavior – it leads to bad behavior: maximizing a reasonable notion of power over the environment entails resisting shutdown and potentially appropriating resources. Theoretically, section 5.4 implies that the farsighted optimal policies of most reinforcement learning agents acting in the real world are malign.
What if we succeed at creating these agents?
Acknowledgements
This work was supported by the Center for HumanCompatible AI, the Berkeley Existential Risk Initiative, and the LongTerm Future Fund. Logan Smith lent significant help by providing a codebase for exploring the power of different states in MDPs. I thank Max Sharnoff for contributions to definition 7. Daniel Blank, Ryan Carey, Ofer Givoli, Evan Hubinger, Joel Lehman, Vanessa Kosoy, Victoria Krakovna, Rohin Shah, Prasad Tadepalli, and Davide Zagami provided valuable feedback.
References
 Bellemare et al. [2019] Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A Geometric Perspective on Optimal Representations for Reinforcement Learning. arXiv:1901.11530 [cs, stat], January 2019. arXiv: 1901.11530.

BensonTilsen and
Soares [2016]
Tsvi BensonTilsen and Nate Soares.
Formalizing Convergent Instrumental Goals.
Workshops at the Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  Blackwell [1962] David Blackwell. Discrete Dynamic Programming. The Annals of Mathematical Statistics, page 9, 1962.
 Bostrom [2014] Nick Bostrom. Superintelligence. Oxford University Press, 2014.
 Cohen et al. [2019] Michael K. Cohen, Badri Vellambi, and Marcus Hutter. Asymptotically Unambitious Artificial General Intelligence. arXiv:1905.12186 [cs], May 2019. arXiv: 1905.12186.

Dadashi et al. [2019]
Robert Dadashi, Marc G Bellemare, Adrien Ali Taiga, Nicolas Le Roux, and Dale
Schuurmans.
The Value Function Polytope in Reinforcement Learning.
In
International Conference on Machine Learning
, pages 1486–1495, 2019.  Drummond [1998] Chris Drummond. Composing functions to speed up reinforcement learning in a changing world. In Machine Learning: ECML98, volume 1398, pages 370–381. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
 Foster and Dayan [2002] David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning, 49(23):325–346, 2002.
 HadfieldMenell et al. [2016] Dylan HadfieldMenell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The OffSwitch Game. arXiv:1611.08219 [cs], November 2016. arXiv: 1611.08219.
 Johns and Mahadevan [2007] Jeff Johns and Sridhar Mahadevan. Constructing basis functions from directed graphs for value function approximation. In International Conference on Machine Learning, pages 385–392. ACM Press, 2007.
 McKenzie [1976] Lionel W McKenzie. Turnpike theory. Econometrica: Journal of the Econometric Society, pages 841–865, 1976.
 Omohundro [2008] Stephen Omohundro. The Basic AI Drives. 2008.
 Pinker [2015] Steven Pinker. Thinking does not imply subjugating. What to Think about Machines that Think; Brockman, J., Ed, pages 5–8, 2015.
 Puterman [2014] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 Regan and Boutilier [2010] Kevin Regan and Craig Boutilier. Robust policy computation in rewarduncertain MDPs using nondominated policies. In TwentyFourth AAAI Conference on Artificial Intelligence, 2010.
 Regan and Boutilier [2011] Kevin Regan and Craig Boutilier. Robust online optimization of rewarduncertain MDPs. In TwentySecond International Joint Conference on Artificial Intelligence, 2011.
 Roijers et al. [2013] Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multiobjective sequential decisionmaking. Journal of Artificial Intelligence Research, 48:67–113, 2013.
 Russell and Norvig [2009] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Pearson Education Limited, 2009.
 Russell [2019] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019.
 Salge et al. [2014] Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment–an introduction. In Guided SelfOrganization: Inception, pages 67–114. Springer, 2014.
 Schaul et al. [2015] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015.
 Soares et al. [2015] Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAI Workshops, 2015.
 Sutton and Barto [1998] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press, 1998.
 Sutton et al. [2011] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems, pages 761–768, 2011.
 Turner et al. [2019] Alexander Matt Turner, Dylan HadfieldMenell, and Prasad Tadepalli. Conservative Agency via Attainable Utility Preservation. February 2019. arXiv: 1902.09725.
 Wang et al. [2007] Tao Wang, Michael Bowling, and Dale Schuurmans. Dual Representations for Dynamic Programming and Reinforcement Learning. In International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 44–51. IEEE, 2007.
 Wang et al. [2008] Tao Wang, Michael Bowling, Dale Schuurmans, and Daniel J Lizotte. Stable dual dynamic programming. In Advances in neural information processing systems, pages 1569–1576, 2008.
Appendix A Proofs
Proof outline.
Apply the pigeonhole principle to the fact that is finite and is deterministic. ∎
Lemma 1.
Each state has at least one possibility unique to it.
Proof.
For each , contains a possibility which visits state strictly more often than do any other possibilities at different states. That is, for any visiting policy enacted from another which has distance from , places times more measure on than does . ∎
Define the restriction . is the unit column vector corresponding to state .
Lemma 2 (Prepend).
is reachable in 1 step from via action iff .
Proof.
Forward direction: let be a policy such that . Then starting in state , state is reached and then the state visitation frequency vector is produced. Repeat for all such .
Backward direction: creftype 1 shows that contains at least one possibility unique to , which is available even under restriction to because any policy maximizing visitation would navigate to immediately from . Then this possibility can only be provided by being reachable in one step from . ∎
Lemma 3.
Suppose traverses a cycle with states . Define . Then , and for any , .
Proof.
Since the form a cycle, we have . Since the rewardless MDP is deterministic and by the definition of a cycle, acts as the identity on all . ∎
Lemma 4 (Convergence to gain optimality (Puterman [2014])).
Let be a reward function, be a state, and contain the states of a cycle with maximal average reward that is reachable from . Then
(3) 
Proof.
takes the maximum over a set of fixed dimensional linear functionals. Therefore, the maximum is piecewise linear. ∎
a.1 Nondominated possibilities
Proposition 5 (Domination criterion).
is dominated iff the inequality has no solution for .
Lemma 6.
If is nondominated, the subset of for which is strictly optimal has positive measure and is convex.
Proof.
The set has positive measure because is continuous on by section 2.1. The set is convex because it is the intersection of open halfspaces () restricted to the dimensional unit hypercube. ∎
Lemma 7 (Strict visitation optimality sufficient for nondomination).
If assigns more visitation frequency to some state than does any other , then .
Proof.
Let be the state indicator reward function for . ∎
Corollary 8.
Suppose which place strictly greater measure on some corresponding states than do other possibilities. Then and . In particular, when , .
Proof.
Apply creftype 7. For the second claim, observe that trivially when , and also holds for since of two distinct possibilities, each must have strict visitation optimality for at least one state. ∎
a.2 Variational divergence
Lemma 9.
Suppose travels a path from state for steps, ending in . .
Proof.
Notice that each loses measure, and that all such states are distinct by the definition of a path. Since all possibilities have equal norm, the total measure lost equals the total measure gained by other states (and therefore the total variational divergence; see fig. 14). Then . ∎
Intuition suggests that a possibility is most different from itself halfway along a cycle. This is correct.
Lemma 10.
Suppose travels a cycle () from state . .
Proof.
(4)  
(5)  
(6) 
Equation 4 can be verified by inspection. Setting the derivative with respect to to 0, we solve
(7)  
(8) 
This is justified because the function is strictly concave on by the secondorder test and the fact that . If is even, we are done. If
is odd, then we need an integer solution. Notice that plugging
and into eq. 5 yields the same maximal result.Therefore, in the odd case, both inequalities in the theorem statement are strict. In the even case, the first inequality is an equality. ∎
Theorem 11 (Selfdivergence lower bound for different states).
For any , if , .
Proof.
The shortest path selfdivergence is when for creftype 9, in which case . The shortest cycle selfdivergence is when for creftype 10, in which case . ∎
a.3 Optimality measure
For this section only, let be any continuous (not necessarily bounded) distribution over reward functions.
Theorem 12 (Optimal value differs everywhere for almost all reward functions).
If , then .
Proof.
Let . Choose any for . Let (where the are members of an index set of the state space) correspond to the positive entries of , and () to the negative. Clearly, by creftype 11 (in particular, their sums are positive).
Clearly, iff . This carves out a lowerdimensional subset of ; since is continuous, this subset has zero measure. ∎
Note that continuity is required; discontinuous distributions admit nonzero probability of drawing a flat reward function, for which optimal value is the same everywhere.
No is suboptimal for all reward functions: every possibility is optimal for a constant reward function. However, for any given , almost every reward function has a unique optimal possibility at each state.
Theorem 13 (Optimal possibilities are almost always unique).
Let be any state. For any , has measure zero.
Proof.
Let and let be a state at which there is more than one optimal possibility. There exists a state reachable from with both onestep reachable from such that (if not, then one or the other would be strictly preferable and only one optimal possibility would exist). Apply creftype 12. ∎
Corollary 14 (Dominated possibilities almost never optimal).
Let be a dominated possibility at state , and let . The set of reward functions for which is optimal at discount rate has measure zero.
creftypecap 6 states that each element of is strictly optimal on a convex positive measure subset of . creftypecap 13 further shows that these positive measure subsets cumulatively have 1 measure under continuous distributions . In particular, if a dominated possibility is optimal, it must be optimal on the boundary of a convex subsets (otherwise it would be strictly dominated).
Lemma 15 (Average reward of different state subsets almost never equal.).
Let s.t. . Then .
Proof.
There are uncountably many unsatisfactory variants of every reward function which does satisfy the equality; since is continuous, the set of satisfactory reward functions must have measure zero. ∎
a.4 Power
Lemma 16.
Proof.
By the definition of domination, restriction to nondominated possibilities leaves all attainable utilities unchanged; is continuous, so a zero measure subset of has multiple optimal possibilities (creftype 13). ∎
The optimality set can be calculated by solving the relevant system of inequalities.^{7}^{7}7Mathematica code to calculate these inequalities can be found at https://github.com/loganriggs/gold. For example, consider again the MDP of fig. 16. We would like to calculate . The two possibilities are and . To determine , solve
Intersecting this region with , we have .
Proof.
Forward direction: let be the sole possibility at . Then has no maximum, so by the linearity of expectation.
Backward direction: for any MDP, iteratively construct it, starting such that and adding vertices and their arrows. Note that and monotonically increase throughout this process (due to the operator). In particular, if increases from 1, by creftype 8 there exists a second nondominated possibility. By creftype 6, a positive measure subset of accrues strictly greater optimal value via this possibility. So the integration comes out strictly greater. Then if , . ∎
Proof.
Each possibility which immediately navigates to a state and stays there is nondominated by creftype 7; these are also the only nondominated possibilities, because the agent cannot do better than immediately navigating to the highest reward state and staying there. So .
Clearly, the possibility navigating to a child is optimal iff the child is a maximumreward state for a given reward function.
(9)  
(10) 
∎
Proof.
(11)  
(12)  
(13) 
We then calculate :
(14)  
(15)  
(16)  
(17)  
(18) 
∎
Lemma 17.
For states and , there exists a permutation matrix such that iff and are similar.
Proof.
Forward: let be the permutation corresponding to ; without loss of generality, assume is the identity on all states not reachable from either or . Observe that
Comments
There are no comments yet.