As AI systems take on an increasingly pivotal decision-making role in human society, an important question arises: Whose values should a powerful decision-making machine be built to serve? (Bostrom, 2014)
Consider, informally, a scenario wherein two or more principals—perhaps individuals, companies, or states—are considering cooperating to build or otherwise obtain an “agent” that will then interact with an environment on their behalf. The “agent” here could be anything that follows a policy, such as a robot, a corporation, or a web-based AI system. In such a scenario, the principals will be concerned with the question of “how much” the agent will prioritize each principal’s interests, a question which this paper addresses quantitatively.
One might be tempted to model the agent as maximizing the expected value, given its observations, of some utility function of the environment that equals a weighted sum
of the principals’ individual utility functions and , as Harsanyi’s social aggregation theorem (Harsanyi, 1980) recommends. Then the question of prioritization could be reduced to that of choosing values for the weights .
However, this turns out to be a suboptimal approach, from the perspective of the principals. As we shall see in Proposition 3, this solution form is not generally compatible with Pareto-optimality when agents have different beliefs. Harsanyi’s setting does not account for agents having different priors, nor for decisions being made sequentially, after future observations.
In such a setting, we need a new form of solution, exhibited in this paper. The solution is presented along with a recursion (Theorem 2) that characterizes solutions by a process algebraically similar to, but meaningfully different from, Bayesian updating. The updating process resembles a kind of bet-settling between the principals, which allows them each to expect to benefit from the veracity of their own beliefs.
Qualitatively, this phenomenon can be seen in isolation whenever two people make a bet on a piece of decision-irrelevant trivia. If neither Alice nor Bob would base any important decision on whether Michael Jackson was born in 1958 or 1959, they might still make a bet for $100 on the answer. For a person chosen to arbitrate the bet (their “agent”), Michael Jackson’s birth year now becomes a decision-relevant observation: it determines which of Alice and Bob gets the money!
Even in scenarios where differences in belief are not decision-irrelevant, once might expect some “degree” of bet-settling to arise from the disagreement. The main result of this paper (Theorem 2) is a precise formulation of exactly how and how much a Pareto-optimal agent will tend to prioritize each of its principals over time, as a result of differences in their implicit predictions about the agent’s observations.
This paper may be viewed as extending or complimenting results in several areas:
Value alignment theory.
The “single principal” value alignment problem—that of aligning the value function of an agent with the values of single human, or a team of humans in close agreement with one another—is already a very difficult one and should not be swept under the rug; approaches like inverse reinforcement learning (IRL)(Russell, 1998) (Ng et al., 2000) (Abbeel and Ng, 2004) and cooperative inverse reinforcement learning (CIRL) (Hadfield-Menell et al., 2016) have only begun to address it.
Social choice theory.
The whole of social choice theory and voting theory may be viewed as an attempt to specify an agreeable formal policy to enact on behalf of a group. Harsanyi’s utility aggregation theorem (Harsanyi, 1980) suggests one form of solution: maximizing a linear combination of group members’ utility functions. The present work shows that this solution is inappropriate when principals have different beliefs, and Theorem 2 may be viewed as an extension of Harsanyi’s form that accounts simultaneously for differing priors and the prospect of future observations. Indeed, Harsanyi’s form follows as a direct corollary of Theorem 2 when principals do share the same beliefs (Corollary 3).
The formal theory of bargaining, as pioneered by (Nash, 1950) and carried on by (Myerson, 1979), (Myerson, 2013), and (Myerson and Satterthwaite, 1983), is also topical. Future investigation in this area might be aimed at generalizing their work to sequential decision-making settings, and this author recommends a focus on research specifically targeted at resolving conflicts.
There is ample literature examining multi-agent systems using sequential decision-making models. Shoham and Leyton-Brown (2008) survey various models of multiplayer games using an MDP to model each agent’s objectives. Chapter 9 of the same text surveys social choice theory, but does not account for sequential decision-making.
Zhang and Shah (2014) may be considered a sequential decision-making approach to social choice: they use MDPs to represent the decisions of players in a competitive game, and exhibit an algorithm for the players that, if followed, arrives at a Pareto-optimal Nash equilibrium satisfying a certain fairness criterion. Among the literature surveyed here, that paper is the closest to the present work in terms of its intended application: roughly speaking, achieving mutually desirable outcomes via sequential decision-making. However, that work is concerned with an ongoing interaction between the players, rather than selecting a policy for a single agent to follow as in this paper.
Multi-objective sequential decision-making.
There is also a good deal of work on Multi-Objective Optimization (MOO) (Tzeng and Huang, 2011), including for sequential decision-making, where solution methods have been called Multi-Objective Reinforcement Learning (MORL). For instance, Gábor et al. (1998) introduce a MORL method called Pareto Q-learning for learning a set of a Pareto-optimal polices for a Multi-Objective MDP (MOMDP). Soh and Demiris (2011)et al. (2015) refer to the same problems as Multi-objective POMDPS (MOPOMDPs), and provide a bounded approximation method for the optimal solution set for all possible weightings of the objectives. Wang (2014) surveys MORL methods, and contributes Multi-Objective Monte-Carlo Tree Search (MOMCTS) for discovering multiple Pareto-optimal solutions to a multi-objective optimization problem. Wray and Zilberstein (2015) introduce Lexicographic Partially Observable Markov Decision Process (LPOMDPs), along with two accompanying solution methods.
However, none of these or related works addresses scenarios where the objectives are derived from principals with differing beliefs, from which the priority-shifting phenomenon of Theorem 2 arises. Differing beliefs are likely to play a key role in negotiations, so for that purpose, the formulation of multi-objective decision-making adopted here is preferable.
Random variables are denoted by uppercase letters, e.g., , and lowercase letters, e.g., , are used as indices ranging over the values of a variable, as in the equation
Given a set
, the set of probability distributions onis denoted .
Sequences are denoted by overbars, e.g., given a sequence , stands for the whole sequence. Subsequences are denoted by subscripted inequalities, so e.g., stands for , and stands for .
N.B.: All results in this paper generalize directly from agents with two principals to agents with several, but for clarity of exposition, the case of two principals will be prioritized.
Consider a scenario wherein Alice and Bob will share some cake, and have different predictions of the cake’s color. Even if the color would be decision-irrelevant for either Alice or Bob on their own (they don’t care what color the cake is), we will show that the difference between their predictions will tend to make the cake color a decision-relevant observation for a Pareto-optimal cake-splitting policy that is adopted before they see the cake. Specifically, we will show that Pareto-optimal policies tend to incorporate some degree of bet-settling between Alice and Bob, where the person who was more right about the color of the cake will end up getting more of it.
Serving multiple principals as a single POMDP
To formalize such scenarios, where a single agent acts on behalf of multiple principals, we need some definitions.
We encode each principal ’s view of the agent’s decision problem as a finite horizon POMDP, , which simultaneously represents that principal’s beliefs about the environment, and the principal’s utility function (see Russell et al. (2003) for an introduction to POMDPs). These symbols take on their usual meaning:
represents a set of possible states of the environment,
represents the set of possible actions available to the agent,
represents the conditional probabilities principal believes will govern the environment state transitions, i.e., ,
represents principal ’s utility function from sequences of environmental states to ; for the sake of generality, is not assumed to be additive over time, as reward functions often are,
represents the set of possible observations of the agent,
represents the conditional probabilities principal believes will govern the agent’s observations, i.e., , and
is the horizon (number of time steps)
This POMDP structure is depicted by the Bayesian network in Figure1. (See Darwiche (2009) for an intro to Bayesian networks.) At each point in time , the agent has a time-specific policy , which receives the agent’s history,
and returns a distribution on actions , which will then be used to generate an action with probability . Thus, principal ’s subjective probability of an outcome is given by a probability distribution that takes as a parameter:
Full-memory assumption. Every policy in this paper will be assumed to employ a “full memory”, so it decomposes into a sequence of policies for each time step. In Figure 1, the part of the Bayes net governed by the full-memory policy is highlighted in green.
Common knowledge assumptions.
It is assumed that the principals will have common knowledge of the (full-memory) policy they select for the agent to implement, but that the principals may have different beliefs about how the environment works, and of course different utility functions. It is also assumed that the principals have common knowledge of one another’s current beliefs at the time of the agent’s creation, which we refer to as their their priors.
This last assumption is critical. During the agent’s creation, one should expect each principal’s beliefs to have updated somewhat in response to disagreements from the other. Assuming common knowledge of their priors means assuming the principals to have reached an equilibrium where, each knowing what the other believes, they do not wish to further update their own beliefs.111It is enough to assume the principals have reached a “persistent disagreement” that cannot be mediated by the agent in some way. Future work should design solutions for facilitating the process of attaining common knowledge, or to obviate the need to assume it.
A policy will be considered Pareto-optimal relative to a set of POMDPs it could be deployed to solve.
[Compatible POMDPs] We say that two POMDPs, and , are compatible if any policy for one may be viewed as a policy for the other, i.e., they have the same set of actions and observations , and the same number of time steps .
In this context, where a single policy may be evaluated relative to more than one POMDP, we use superscripts to represent which POMDP is governing the probabilities and expectations, e.g.,
represents the expectation in of the utility function , assuming policy is followed. [Pareto-optimal policies] A policy is Pareto-optimal for a set of compatible POMDPs if for any other policy and any
It is assumed that, before the agent’s creation, the principals will be seeking a Pareto-optimal (full-memory) policy for the agent to follow, relative to the POMDPs describing each principal’s view of the agent’s task.
Example: cake betting
A quantitative model of a cake betting scenario is laid out in Table 1, and described as follows.
Alice (Principal 1) and Bob (Principal 2) are about to be presented with a cake which they can choose to split in half to share, or give entirely to one of them. They have (built or purchased) a robot that will make the cake-splitting decision on their behalf. Alice’s utility function returns if she gets no cake, if she gets half a cake, or if she gets a whole cake. Bob’s utility function values Bob getting cake in the same way.
|red cake||(all, none)||30||0|
|green cake||(all, none)||30||0|
However, Alice and Bob have different beliefs about the color of the cake. Alice is sure that the cake is red (), versus sure it will be green (), whereas Bob’s probabilities are reversed.
Upon seeing the cake, the robot must decide to either give Alice the entire cake (), split the cake half-and-half (), or give Bob the entire cake (). Moreover, Alice and Bob have common knowledge of all these facts.
Now, consider the following Pareto-optimal full-memory policy that favors Alice (Principal 1) when is red, and Bob (Principal 2) when is green:
This policy can be viewed intuitively as a bet between Alice and Bob about the value of , and is highly appealing to both principals:
In particular, is more appealing to both Alice and Bob than an agreement to deterministically split the cake (half, half), which would yield them each an expected utility of . However,
The Pareto-optimal strategy above cannot be implemented by any agent that naïvely maximizes a fixed-over-time linear combination of the conditionally expected utilities of the two principals. That is, it cannot be implemented by any policy satisfying
for some fixed . Moreover, every such policy is strictly worse than in expectation to one of the principals.
See appendix. ∎
This proposition is relatively unsurprising when one considers the full-memory policy intuitively as a bet-settling mechanism, because the nature of betting is to favor different preferences based on future observations. However, to be sure of this impossibility claim, one must rule out the possibility that the could be implemented by having the agent choose which element of the in Equation 3 to use based on whether the cake appears red or green. (See appendix.)
Characterizing Pareto-optimality geometrically
With the definitions above, we can characterize a Pareto-optimality as a geometric condition.
Policy mixing assumption.
Given policies and a distribution , we assume that the agent may construct a new policy by choosing at time 0 between the with probability , and then executing the chosen policy for the rest of time. We write this policy as whence we derive:
[Polytope Lemma] A full-memory policy is Pareto-optimal to principals and if and only if there exist weights with such that
The mixing assumption gives the set of policies the structure of a convex space that the maps respect by Equation 4. This ensures that the image of the map given by
is a closed, convex polytope. As such, a point lies on the Pareto boundary of if and only if there exist nonnegative weights , not both zero, such that
After normalizing to equal , this implies the result. ∎
Characterizing Pareto-optimality probabilistically
To help us apply the Polytope Lemma, we will adopt an interpretation wherein the weights are subjective probabilities for the agent, as follows.
For any , we define a new POMDP, , that works by flipping a -weighted coin, and then running or thereafter, according to the coin flip. We denote this by
and call a POMDP mixture. A formal definition of is given in the appendix. It can be depicted by a Bayes net by adding an additional environmental node for in the diagram of and (see Figure 2).
Given any full-memory policy , the expected payoff of in is exactly
Therefore, using the above definitions, Lemma 3 may be restated in the following equivalent form:
[Mixture Lemma] Given a pair of compatible POMDPs, a full-memory policy is Pareto-optimal for that pair if and only if there exists such that is an optimal full-memory policy for the single POMDP given by .
Expressed in the form of Equation 5, it might not be clear how a Pareto-optimal full-memory policy makes use of its observations over time, aside from storing them in memory. For example, is there any sense in which the agent carries “beliefs” about the environment that it “updates” at each time step? Lemma 2 allows us to reduce some such questions about Pareto-optimal policies to questions about single POMDPs.
If is an optimal full-memory policy for a single POMDP, the optimality of each action distribution can be characterized without reference to the previous policy components , nor to for any alternate history . This can be expressed using Pearl’s “” notation (Pearl, 2009): [“do” notation] The probability of causally conditioned on is defined as
[Expected utility abbreviation] For brevity, given any POMDP and policy , we write
i.e., the total expected utility in that would result from replacing by . This quantity does not depend on .
Proposition (Classical separability).
If is a POMDP described by conditional probabilities and utility function (as in Equation 2), then a full-memory policy is optimal for if and only if for each time step and each observation/action history , the action distribution satisfies the following backward recursion:
This characterization of does not refer to , nor to for any alternate history .
This is just Bellman’s Principle of Optimality. See (Bellman, 1957), Chap. III. 3. ∎
N.B.: Unlike Bellman’s “backup” equation, the above proposition requires no assumption whatsoever on the form of the utility function. Note also that when the probability term is non-zero, it may be removed from the without changing the theorem statement. But when the term is zero, its presence is essential, and implies that can be anything.
It turns out that Pareto-optimality can be characterized in a similar way by backward recursion from the final time step. The resulting recursion reveals a pattern in how the weights on the principals’ conditionally expected utilities must change over time, which is the main result of this paper:
[Pareto-optimal control theorem] Given a pair of compatible POMDPs with horizon , a full-memory policy is Pareto-optimal if and only if its components for satisfy the following backward recursion for some weights :
In words, to achieve Pareto-optimality, the agent must
use each principal’s own world-model
when estimating the degreeto which a decision favors that principal’s utility function, and
shift the relative priority of each principal’s expected utility in the agent’s maximizationtarget over time, by a factor proportional to how well that principal’s prior predicts the agent’s observations, .
N.B.: The analogous result for more than two POMDPs holds as well, with essentially the same proof.
Proof of Theorem 2.
By Lemma 2, the Pareto-optimality of for is equivalent to its classical optimality for for some . Writing for probabilities in D, Proposition 3 says this is equivalent to maximizing the following expression for each :
The expectation factor on the right equals
and applying Bayes’ rule yields that
hence the result. ∎
To see the necessity of the terms that shift the expectation weights in Theorem 2 over time, recall from Proposition 3 that, without these, some Pareto-optimal policies cannot be implemented. These terms are responsible for the “bet-settling” phenomena discussed in the introduction.
However, when the principals have the same beliefs, they aways assign the same probability to the agent’s observations, so the weights on their respective valuations do not change over time. Hence, as a special instance, we derive:
Corollary (Harsanyi’s utility aggregation formula).
Suppose that principals 1 and 2 share the same beliefs about the environment, i.e., the pair of compatible POMDPs agree on all parameters except the principals’ utility functions . Then a full-memory policy is Pareto-optimal if and only if there exists such that for , satisfies
where denotes the shared expectations of both principals.
Setting in Theorem 2, factoring out the common coefficient , and applying linearity of expectation yields the result. ∎
Theorem 2 exhibits a novel form for the objective of a sequential decision-making policy that is Pareto-optimal according to principals with differing beliefs.
This form represents two departures from naïve utility aggregation: to achieve Pareto-optimality for principals with differing beliefs, an agent must (1) use each principal’s own beliefs (updated on the agent’s observations) when evaluating how well an action will serve that principal’s utility function, and (2) shift the relative priority it assigns to each principal’s expected utilities over time, by a factor proportional to how well that principal’s prior predicts the agent’s observations.
Implications for contract design
Theorem 2 has implications for modeling and structuring the process of contract design. If a contract is being created between principals with different beliefs, then to the extent that the principals will target Pareto-optimality among them as an objective, there will be a tendency for the contract to end up implicitly settling bets between the principals. Perhaps making the bet-settling nature of Pareto-optimal contract design more explicit might help to design contracts that are more attractive to both principals, along the lines illustrated by Proposition 3. This could potentially lead to more successful negotiations, provided the principals remained willing to uphold the contract after its implicit bets have been settled.
Implications for shareable AI systems
is more attractive—from the perspective of the principals—than policies that do not account for differences in their beliefs. The relative attractiveness of shared ownership versus individual ownership of AI systems may be essential to the technological adoption of shared systems. Consider the following product substitutions that might be enabled by the development of shareable machine learning systems:
Office assistant software jointly controlled by a team, as an improvement over personal assistant software for each member of the team.
A team of domestic robots controlled by a family, as an improvement over individual robots each controlled by a separate family member.
A web-based security system shared by several interested companies or nations, as an improvement over individual security systems deployed by each group.
It may represent a significant technical challenge for any of these substitutions to become viable. However, machine learning systems that are able to approximate Pareto-optimality as an objective are more likely to be sufficiently appealing to motivate the switch from individual control to sharing.
Implications for bargaining versus racing
Consider two nations—allies or adversaries—who must decide whether to cooperate in the deployment of a very powerful and autonomous AI system.
If the nations cannot reach agreement as to what policy a jointly owned AI system should follow, joint ownership may be less attractive than building separate AI systems, one for each party. This could lead to an arms race between nations competing under time pressure to develop ever more powerful militarized AI systems. Under such race conditions, everyone loses, as each nation is afforded less time to ensure the safety and value alignment of its own system.
The first author’s primary motivation for this paper is to initiate a research program with the mission of averting such scenarios. Beginning work today on AI architectures that are more amenable to joint ownership could help lead to futures wherein powerful entities are more likely to share and less likely to compete for the ownership of such systems.
Insofar as Theorem 2 is not particularly mathematically sophisticated—it employs only basic facts about convexity and linear algebra—this suggests there may be more low-hanging fruit to be found in the domain of “machine implementable social choice theory”. Future work should address methods for helping the principals to share information—perhaps in exchange for adjustments to the weights in Theorem 2—to reach either a state of agreement or a persistent disagreement that allows the theorem to be applied. More ambitiously, bargaining models that account for a degree of transparency between the principals should be employed, as individual humans and institutions have some capacity for detecting one another’s intentions.
As well, scenarios where the principals continue to exhibit some active control over the system after its creation should be modeled in detail. In real life, principals usually continue to exist in their agents’ environments, and accounting for this will be a separate technical challenge.
As a final motivating remark, consider that social choice theory and bargaining theory were both pioneered during the Cold War, when it was particularly compelling to understand the potential for cooperation between human institutions that might behave competitively. In the coming decades, machine intelligence will likely bring many new challenges for cooperation, as well as new means to cooperate, and new reasons to do so. As such, new technical aspects of social choice and bargaining will likely continue to emerge.
Here we make available the technical details for defining POMDP mixtures, and proving that certain Pareto-optimal expectations cannot be obtained without priority-shifting.
[POMDP mixtures] Suppose that and are compatible POMDPs, with parameters . Define a new POMDP compatible with both, denoted , with parameters , as follows:
Environmental transition probabilities given by
for any initial state , and thereafter,
Hence, the value of will be constant over time, so a full history for the environment may be represented by a pair
Let denote the boolean random variable that equals whichever constant value of obtains, so then
The utility function is given by
The observation probabilities are given by
In particular, the agent does not observe directly whether or .
Proof of Proposition 3.
Suppose is any policy satisfying Equation 3 for some fixed , and consider the following cases for :
If , then must satisfy
Here, , so is strictly worse than in expectation to Alice.
If , then must satisfy
for some depending on . Here, (with equality when ), so is strictly worse than in expectation to Alice.
If , then must satisfy
Here, , so is strictly worse than in expectation to both Alice and Bob.
The remaining cases, and , are symmetric to the first two, with Bob in place of Alice and (none, all) in place of (all, none).
Hence, no fixed linear combination of the principals’ utility functions can be maximized to simultaneously achieve an expected utility of 27 for both players. ∎
- Abbeel and Ng  Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
- Bellman  Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ., 1957.
- Bostrom  Nick Bostrom. Superintelligence: Paths, dangers, strategies. OUP Oxford, 2014.
- Darwiche  Adnan Darwiche. Modeling and reasoning with Bayesian networks (Chapter 4). Cambridge University Press, 2009.
- Gábor et al.  Zoltán Gábor, Zsolt Kalmár, and Csaba Szepesvári. Multi-criteria reinforcement learning. In ICML, volume 98, pages 197–205, 1998.
- Hadfield-Menell et al.  Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning, 2016.
- Harsanyi  John C Harsanyi. Cardinal welfare, individualistic ethics, and interpersonal comparisons of utility. In Essays on Ethics, Social Behavior, and Scientific Explanation, pages 6–23. Springer, 1980.
- Myerson and Satterthwaite  Roger B Myerson and Mark A Satterthwaite. Efficient mechanisms for bilateral trading. Journal of economic theory, 29(2):265–281, 1983.
- Myerson  Roger B Myerson. Incentive compatibility and the bargaining problem. Econometrica: journal of the Econometric Society, pages 61–73, 1979.
- Myerson  Roger B Myerson. Game theory. Harvard university press, 2013.
- Nash  John F Nash. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155–162, 1950.
- Ng et al.  Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
- Pearl  Judea Pearl. Causality. Cambridge university press, 2009.
Roijers et al. 
Diederik M Roijers, Shimon Whiteson, and Frans A Oliehoek.
Point-based planning for multi-objective pomdps.
IJCAI 2015: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, pages 1666–1672, 2015.
- Russell et al.  Stuart Russell, Peter Norvig, John F Canny, Jitendra M Malik, and Douglas D Edwards. Artificial intelligence: a modern approach (Chapter 17.1), volume 2. Prentice hall Upper Saddle River, 2003.
Learning agents for uncertain environments.
Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103. ACM, 1998.
- Shoham and Leyton-Brown  Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.
Soh and Demiris 
Harold Soh and Yiannis Demiris.
Evolving policies for multi-reward partially observable markov
decision processes (mr-pomdps).
Proceedings of the 13th annual conference on Genetic and evolutionary computation, pages 713–720. ACM, 2011.
- Tzeng and Huang  Gwo-Hshiung Tzeng and Jih-Jeng Huang. Multiple attribute decision making: methods and applications. CRC press, 2011.
- Wang  Weijia Wang. Multi-objective sequential decision making. PhD thesis, Université Paris Sud-Paris XI, 2014.
- Wray and Zilberstein  Kyle Hollins Wray and Shlomo Zilberstein. Multi-objective pomdps with lexicographic reward preferences. In Proceedings of the 24th International Joint Conference of Artificial Intelligence (IJCAI), pages 1719–1725, 2015.
- Zhang and Shah  Chongjie Zhang and Julie A Shah. Fairness in multi-agent sequential decision-making. In Advances in Neural Information Processing Systems, pages 2636–2644, 2014.