1 Introduction
Reinforcement learning (RL) [27] has proved to be a powerful framework for solving sequential decisionmaking problems under uncertainty. For instance, RL has been used to build an expert backgammon player [28], an acrobatic helicopter pilot [1], a humanlevel video game player [15]. RL is based on the Markov decision process model (MDP) [21]. In the standard setting, both MDP and RL rely on scalar numeric evaluations of actions (and thus histories and policies). However, in practice, those evaluations may not be scalar or may even not be available.
Often actions are rather valued on several generally conflicting dimensions. For instance, in a navigation problem, these dimensions may represent duration, cost and length. This observation has led to the extension of MDP and RL to multiobjective MDP (MOMDP) and RL (MORL) [24]
. In multiobjective optimization, it is possible to distinguish three interpretations for objectives. The first one corresponds to singleagent decisionmaking problems where actions are evaluated on different criteria, like in the navigation example. The second comes up when the effects of actions are uncertain and one then also wants to optimize objectives that correspond to probability of success or risk for instance. The last interpretation is in multiagent settings where each objective represents the payoff received by a different agent. Of course, in one particular multiobjective problem, one may encounter objectives with different interpretations.
More generally, sometimes no numerical evaluation of actions is available at all. In this case, inverse reinforcement learning (IRL) [16] has been proposed as an approach to learn a reward function from demonstration provided by a human expert who is assumed to use an optimal policy. However, this assumption may be problematic in practice as humans are known not to act optimally. A different approach, qualified as preferencebased, takes as initial preferential information the comparisons of actions or histories instead of a reward function. This direction has been explored in the MDP setting [12] and the RL setting where it is called preferencebased RL (PBRL) [2, 9].
This theoretic paper presents a short overview of some recent work on multiobjective and preferencebased sequential decisionmaking with the goal of relating those two research strands. The contribution of this paper is threefold. We build a bridge between preferencebased RL and multiobjective RL, and highlight new possible approaches for both settings. In particular, our observation offers a new interpretation of an objective, which yields a new source of multiobjective problems.
The paper is organized as follows. In Section 2, we recall the definition of standard MDP/RL, their extensions to the multiobjective setting and their generalizations to the preferencebased setting. In Section 3, we show how MORL can be viewed as a PBRL problem. This then allows the methods developed for PBRL to be imported to the MORL setting. Conversely, in Section 4, we show how some structured PBRL can be viewed as an MORL, which then justifies the application of MORL techniques on those PBRL problems. Finally, we conclude in Section 5.
2 Background and Related Work
In this section, we recall the necessary definitions needed in the next sections while presenting a short review of related work. We start with the reinforcement learning setting (Section 2.1) and then present its extension to the multiobjective setting (Section 2.2) and to the preferencebased setting (Section 2.3).
2.1 Reinforcement Learning
A reinforcement learning problem is usually defined using the Markov Decision Process (MDP) model. A standard MDP [21] is defined as a tuple where:

is a finite set of states,

is a finite set of actions,

is a transition function with being the probability of reaching state after executing action in state ,

is a reward function with being the immediate numerical environmental feedback received by the agent after performing action in state .
In this framework, a step history is a sequence of stateaction:
where and . The value of such a history is defined as:
where is a discount factor. A policy specifies how to choose an action in every state. A deterministic policy is a function from the set of states to the set of actions, while a randomized policy states the probability of choosing an action in a state .
The value function of a policy in a state is defined as:
(1) 
where
is a random variable defining the reward received at time
under policy and starting in state . Equation (1) can be computed iteratively as the limit of the following sequence: ,(2)  
(3) 
In a standard MDP, an optimal policy can be obtained by solving the Bellman’s optimality equations:
(4) 
Many solution methods can be used [21]
to solve this problem exactly: for instance, value iteration, policy iteration, linear programming. Approaches based on approximating the value function for solving largesized state space have also been proposed
[27].Classically, in reinforcement learning (RL), it is assumed that the agent does not know the transition and reward functions. In that case, an optimal policy has to be learned by interacting with the environment. Two main approaches can be distinguished here [27]
. The first (called indirect or modelbased method), tries to first estimate the transition and reward functions and then use an MDP solving method on the learned environment model (e.g.,
[26]). The second (called direct or modelfree method), searches for an optimal policy without trying to learn a model of the environment.The preference model that describes how policies are compared in standard MDP/RL is defined as follows. A history is valued by the discounted sum of rewards obtained along that history. Then, as a policy in a state induces a probability distribution over histories, it also induces a probability distribution over discounted sums of rewards. The decision criterion used to compare policies in standard MDP is then based on the expectation criterion.
Both MDP and RL assume that the environmental feedback from which the agent plans/learns a (near) optimal policy is a scalar numeric reward value. In many settings, this assumption does not hold. The value of an action may be determined over several often conflicting dimensions. For instance, in the autonomous navigation problem, an action lasts a certain duration, has an energy consumption cost and travels a certain length. To tackle those situations, MDP and RL have been extended to deal with vectorial rewards.
2.2 Multiobjective RL
Multiobjective MDP (MOMDP) [24] is an MDP where the reward function is redefined as with being the number of objectives. The value function of a policy is now vectorial and can be computed as the limit of the vectorial version of (2) and (3): ,
(5)  
(6) 
In MOMDP, the value function of policy Paretodominates that of another policy if in every state , is not smaller than on every objective and is greater than on at least one objective. By extension, we say that Paretodominates if value function Paretodominates value function . A value function (resp. policy) is Paretooptimal if it is not Paretodominated by any other value function (resp. policy). Due to incomparability of vectorial value functions, there are generally many Paretooptimal value functions (and therefore policies), which constitutes the main difficulty of the multiobjective setting.
Similarly to standard MDP, MOMDP can be extended to multiobjective reinforcement learning (MORL), in which case the agent is not assumed to know the transition function, neither the vectorial reward function.
In multiobjective optimization, four main families of approaches can be distinguished. One first natural approach is to determine the set of all Paretooptimal solutions (e.g., [33, 14]). However, in practice, searching for all the Paretooptimal solutions may not be feasible. Indeed, it is known [20] that this set can be exponential in the size of the state and action spaces. A more practical approach is then to determine an cover of it [7, 20], which is an approximation of the set of Paretooptimal solutions.
Definition 2.1
A set is an cover of a set if
where .
Another approach related to the first one is to consider refinements of Pareto dominance, such as Lorenz dominance (which models a certain notion of fairness) or lexicographic order [34, 10]. In fact, with Lorenz dominance, the set of optimal value functions may still be exponential in the size of the state and action spaces. Again, one may therefore prefer to determine its cover [20] in practice.
Still another approach to solve multiobjective problems is to assume the existence of a scalarizing function
, which, given a vector
, returns a scalar valuation. Two cases can be considered: can be either linear [3] or nonlinear [19, 18, 17].The scalarizing function can be used at three different levels:

It can be directly applied on the vectorial reward function leading to the definition of a scalarized reward function. This boils down to defining a standard MDP/RL from a MOMDP/MORL, which can then be tackled with standard solving methods.

It can also aggregate the different objectives of the vector values of histories and then a policy in a state can be valued by taking the expectation of those scalarized evaluation of histories.

It can be applied on the vectorial value functions of policies in order to obtain scalar value functions.
For linear scalarizing functions, those three levels lead to the same solutions. However, for nonlinear scalarizing functions, they generally lead to different solutions. In practice, it generally only makes sense to use a nonlinear scalarizing function on expected discounted sum of vector rewards (i.e., vector value functions), as the scalarizing function is normally defined to aggregate over the final vector values. To the best of our knowledge, most previous work has applied a scalarizing function in this fashion. In Section 3, we describe a setting where applying a nonlinear scalarizing function on vector values of histories could be justified.
A final approach to multiobjective problem assumes an interactive setting where a human expert is present and can provide additional preferential information (i.e., how to tradeoff between different objectives). This approach loops between the following two steps until a certain criterion is satisfied (e.g., the expert is satisfied with a proposed solution or there is only one solution left):

show potential solutions or ask query to the expert

receive a feedback/answer from the expert
The feedback/answer from the expert allows to guide the search for a preferred solution among all Paretooptimal ones [25], or elicit unknown parameters of user preference model [22].
In both standard MDP/RL and MOMDP/MORL, it is assumed that numeric environmental feedback is available. In fact, this may not be the case in some situations. For instance, in the medical domain, it may be difficult and even impossible to value a treatment of a lifethreatening illness in terms of patient wellbeing or death with a single numeric value. Preferencebased approaches have been proposed to handle these situations.
2.3 PreferenceBased RL
A preferencebased MDP (PBMDP) is an MDP where possibly no reward function is given. Instead, one assumes that a preference relation is defined over histories. In the case where the dynamics of the system is not known, this setting is referred to as preferencebased reinforcement learning (PBRL) [9, 2, 4, 6]. Due to this ordinal preferential information, it is not possible to directly use the same decision criterion based on expectation like in the standard or multiobjective cases. Most approaches in PBRL [9, 4, 6] relies on comparing policies with probabilistic dominance, which is defined as follows:
(7) 
where denotes the probability that policy generates a history preferred or equivalent to that generated by policy . Probabilistic dominance is related to Condorcet methods (where a candidate is preferred to another if more voters prefers the former than the latter) in social choice theory. This is why the optimal policy for probabilistic dominance is often called a Condorcet winner.
The difficulty with this decision model is that it may lead to preference cycles (i.e., ) [12]. To tackle this issue, three approaches have been considered. The first approach simply consists in assuming some consistency conditions that forbid the occurence of preference cycles. This is the case in the seminal paper [35] that proposed the framework of dueling bandits. This setting is the preferencebased version of multiarmed bandit, which is itself a special case of reinforcement learning. The second approach consists in considering stronger versions of (7). Drawing from voting rules studied in social choice theory, refinements such as Copeland’s rule or Borda’s rule for instance, have been considered [5, 6]. The last approach, which was proposed recently [12, 8], consists in searching for an optimal mixed^{2}^{2}2The randomization is over policies and not over actions, like in randomized policies.
policy instead of an optimal deterministic policy, which may not exist. Drawing from the minimax theorem in game theory, it can be shown that an optimal mixed policy is guaranteed to exist.
3 Morl as Pbrl
An MOMDP/MORL problem can obviously be seen as a PBMDP/PBRL problem. Indeed, the preference relation over histories can simply be taken as the preference relation induced over histories by Pareto dominance. Then probabilistic dominance (7) in this setting can be interpreted as follows. A policy is preferred to another policy if the probability that generates a history that Paretodominates a history generated by is higher than the probability of the opposite event. A minor issue in this formulation is that incomparability is treated in the same way as equivalence.
More interestingly, when a scalarizing function is given, scalarized values of histories can then be used and compared in (7), leading to:
where (resp. ) is a random history generated by policy (resp. ) and (resp. ) is its vectorial value. Notably, this setting motivates the application of a nonlinear scalarizing function on vector values of histories, which has not been investigated before [24].
More generally, viewing MOMDP/MORL as a PBMDP/PBRL, one can import all the techniques and solving methods that have been developed in the preferencebased settings [9, 6, 12]. As far as we know, both cases above (with Pareto dominance or with a scalarizing function) have not been investigated. We expect that efficient solving algorithms exploiting the additively decomposable vector rewards could possibly be designed by adapting PBMDP/PBRL algorithms.
When transforming a multiobjective into a preferencebased problem, the decision criterion has generally to be changed from one based on expectation to one based on probabilistic dominance. This change may be justified for different reasons. For instance, when it is known in advance that an agent is going to face the decision problems only a limited number times, the expectation criterion may not be suitable because it does not take into account notions of variability and risk attitudes. Besides, when the decision problem really corresponds to a competitive setting, probabilistic dominance is particularly wellsuited.
4 Pbrl as Morl
While viewing MOMDP/MORL as PBMDP/PBRL is quite natural, the other way around may be less obvious and more interesting. We therefore develop in more details this direction by focusing on one particular case of PBMDP/PBRL where the preference relation over histories is assumed to be representable by an additively decomposable utility function and the decision criterion is based on expectation (e.g., as assumed in inverse reinforcement learning [16]). This amounts to assuming the existence of a reward function where the ’s are unknown scalar numeric reward values. Exploiting this assumption, we present two cases where PBMDP/PBRL can be transformed into MOMDP/MORL, and justifies the use of one scalarizing function, the Chebyshev norm, on the MOMDP/MORL model obtained from a PBMDP/PBRL model.
4.1 From Unknown Rewards to Vectorial Rewards
The first transformation assumes that an order over unknown rewards is known, while the second assumes more generally that an order over some histories are known.
4.1.1 Ordered Rewards
In the first case, we assume that we know the order over the ’s. Without loss of generality, we assume that .
Following previous work [29, 30], it is possible to transform a PBMDP into an MDP with vector rewards by defining the following vectorial reward function from :
(8) 
where is the th canonical vector of . Using , one can compute the vector value function of a policy by adapting (5) and (6). The th component of a vector value function of a policy in a state can be interpreted as the expected discounted count of reward obtained when applying policy . However, note that because of the preferential order over components, two vectors cannot be directly compared with Pareto dominance. Another transformation is needed to obtain a usual MOMDP.
Given a vector , we define its decumulative as follows:
A PBMDP/PBRL can be reformulated as the following MOMDP/MORL where the reward function is defined by:
(9) 
Using this reward function, the vector value function of a policy can be computed by adapting (5) and (6). One may notice that is the decumulative vector computed from .
The relations between the standard value function , the vectorial value functions and are stated in the following lemma.
Lemma 4.1
We have:
where denotes the inner product of vector and vector .
It is then easy to see that if Paretodominates then thanks to the order over the ’s.
4.1.2 Ordered Histories
In some situations, the order over unknown rewards may not be known and may not be easily determined. For instance, in a navigation problem, it may not be obvious how to compare each action locally. However, comparing trajectories may be more natural and easier to perform for the system designer. Note that although vectorial reward function in (8) can be defined, without the order over rewards ’s, vectorial reward function in (9) (and thus the corresponding MOMDP/MORL) cannot be defined anymore.
In those cases, if sufficient preferential information over histories is given, the previous trick can be adapted using simple linear algebra. We now present this new transformation from PBMDP/PBRL to MOMDP/MORL. We assume that the following comparisons are available:
(10) 
where the ’s are histories. Using the vector reward , one can compute the vector value of each history, i.e., , if then its value is defined by:
We assume that form an independent set, which implies that the matrix whose columns are composed of is invertible. Recall is the basis change matrix from basis to the canonical basis and its inverse matrix is the basis change matrix in the other direction. Rewards ’s represented by the canonical basis can then be expressed in the basis formed by the independent vectors using the basis change matrix . Now, let us define a new vector reward function by:
(11) 
where is the decumulative of the th column of matrix . Using this new reward function, one can define vector value function of a policy by adapting (5) and (6).
Lemma 4.2
We have:
where is the value of history , i.e., .
As the value of the ’s is increasing with , if Paretodominates , then should be preferred.
4.1.3 Applying MORL techniques to PBRL
We have seen two cases where a PBMDP/PBRL problem can be transformed into an MOMDP/MORL problem. As a side note, one may notice that the second case is a generalization of the first one. Thanks to this transformation, the multiobjective approaches that we recalled in Section 2.2 can be applied in the preferencebased setting. We now mention a few cases that would be interesting to investigate in our opinion.
Here, a Paretooptimal solution corresponds to a policy that is optimal for admissible reward values that respects the order known over rewards or histories. Like in MOMDP/MORL, it may not be feasible to determine the set of all Pareto optimal solutions. A natural approach [20] is then to compute its cover to obtain a representative set of solutions that are approximately optimal.
Another approach is to use a nonlinear scalarizing function like the Chebyshev distance to an ideal point. A policy is Chebyshevoptimal if it minimizes:
(12) 
where defines the th component of the ideal point , is a positive probability distribution over initial states and is the th component of the vector value function of an MOMDP/MORL obtained from a PBMDP/PBRL. It is possible to show that a Chebyshevoptimal policy is a minimaxregretoptimal policy [23], whose definition can be written as follows:
(13) 
where is the set of nonnegative values representing differences of consecutive reward values.
Lemma 4.3
A policy is Chebyshevoptimal if and only if it is minimaxregretoptimal.
It is easy to see that the maximum (over ) in (13) is attained by choosing as a canonical vector and equal to the maximum (over ) in (12). This simple property justifies the application of one simple nonlinear scalarizing function used in multiobjective optimization in the preferencebased setting.
The interactive approach mentioned in Section 2.2 has been already exploited for eliciting the unknown rewards in interactive settings where comparison queries can be issued to an expert by interleaving optimization/learning phases with elicitation phases in PBMDP with value iteration [31, 11] and PBRL with Qlearning [32]. It would be interesting to use an interactive approach to elicit the reward values by comparing the element of an cover of the Pareto optimal solutions. This technique may help reduce the number of queries.
5 Conclusion
In this paper, we highlighted the relation between two sequential decisionmaking settings: preferencebased MDP/RL and multiobjective MDP/RL. In particular, we showed that multiobjective problems can also arise in situations of unknown reward values. Based on the link between both formalisms, one can possibly import techniques designed for one setting to solve the other. To illustrate our points, we also listed a few interesting cases.
Besides, in our translation of a PBMDP/PBRL to an MOMDP/MORL, we assumed that rewards were Markovian, which may not always be true in practice. It would be interesting to extend our translation to the nonMarkovian case [13].
References
 [1] Abbeel, P., Coates, A., Ng, A.Y.: Autonomous helicopter aerobatics through apprenticeship learning. International Journal of Robotics Research 29(13), 1608–1639 (2010)
 [2] Akrour, R., Schoenauer, M., Sébag, M.: April: Active preferencelearning based reinforcement learning. In: ECML PKDD, Lecture Notes in Computer Science. vol. 7524, pp. 116–131 (2012)
 [3] Barrett, L., Narayanan, S.: Learning all optimal policies with multiple criteria. In: ICML (2008)
 [4] BusaFekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preferencebased reinforcement learning. In: European Workshop on Reinforcement Learning, Dagstuhl Seminar (2013)
 [5] BusaFekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Topk selection based on adaptive sampling of noisy preferences. In: International Conference on Marchine Learning (ICML) (2013)

[6]
BusaFekete, R., Szorenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preferencebased Reinforcement Learning: Evolutionary Direct Policy Search using a Preferencebased Racing Algorithm. Machine Learning 97(3), 327–351 (2014)
 [7] Chatterjee, K., Majumdar, R., Henzinger, T.: Markov decision processes with multiple objectives. In: STACS (2006)
 [8] Dudík, M., Hofmann, K., Schapire, R.E., Slivkins, A., Zoghi, M.: Contextual dueling bandits. In: COLT (2015)
 [9] Fürnkranz, J., Hüllermeier, E., Cheng, W., Park, S.: Preferencebased reinforcement learning: A formal framework and a policy iteration algorithm. Machine Learning 89(1), 123–156 (2012)
 [10] Gábor, Z., Kalmár, Z., Szepesvári, C.: Multicriteria reinforcement learning. Proceedings of International Conference of Machine Learning (1998)
 [11] Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Reducing the number of queries in interactive value iteration. In: International Conference on Algorithmic Decision Theory (ADT). pp. 139–152 (2015)

[12]
Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Solving MDPs with skew symmetric bilinear utility functions. In: IJCAI. pp. 1989–1995 (2015)
 [13] Gretton, C., Price, D., Thiebaux, S.: Implementation and comparison of solution methods for decision processes with nonMarkovian rewards. In: UAI. vol. 19, pp. 289–296 (2003)
 [14] Lizotte, D.J., Bowling, M., Murphy, S.A.: Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: ICML (2010)
 [15] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Humanlevel control through deep reinforcement learning. Nature 518, 529–533 (2015)
 [16] Ng, A., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML. Morgan Kaufmann (2000)

[17]
Ogryczak, W., Perny, P., Weng, P.: On minimizing ordered weighted regrets in multiobjective Markov decision processes. In: International Conference on Algorithmic Decision Theory (ADT). Lecture Notes in Artificial Intelligence, Springer (2011)
 [18] Ogryczak, W., Perny, P., Weng, P.: A compromise programming approach to multiobjective Markov decision processes. International Journal of Information Technology & Decision Making 12, 1021–1053 (2013)
 [19] Perny, P., Weng, P.: On finding compromise solutions in multiobjective Markov decision processes. In: Multidisciplinary Workshop on Advances in Preference Handling (MPREF) @ European Conference on Artificial Intelligence (ECAI) (2010)
 [20] Perny, P., Weng, P., Goldsmith, J., Hanna, J.: Approximation of Lorenzoptimal solutions in multiobjective Markov decision processes. In: International Conference on Uncertainty in Artificial Intelligence (UAI) (2013)
 [21] Puterman, M.: Markov decision processes: discrete stochastic dynamic programming. Wiley (1994)
 [22] Regan, K., Boutilier, C.: Eliciting additive reward functions for Markov decision processes. In: IJCAI. pp. 2159–2164 (2011)
 [23] Regan, K., Boutilier, C.: Robust online optimization of rewarduncertain MDPs. In: IJCAI. pp. 2165–2171 (2011)
 [24] Roijers, D., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multiobjective sequential decisionmaking. Journal of Artificial Intelligence Research 48, 67–113 (2013)
 [25] Steuer, R., Choo, E.U.: An interactive weighted Tchebycheff procedure for multiple objective programming. Mathematical Programming 26, 326–344 (1983)
 [26] Strehl, A.L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research (2009)
 [27] Sutton, R., Barto, A.: Reinforcement learning: An introduction. MIT Press (1998)
 [28] Tesauro, G.: Temporal difference learning and tdgammon. Communications of the ACM 38(3), 58–68 (1995)
 [29] Weng, P.: Markov decision processes with ordinal rewards: Reference pointbased preferences. In: International Conference on Automated Planning and Scheduling (ICAPS). vol. 21, pp. 282–289 (2011)
 [30] Weng, P.: Ordinal decision models for Markov decision processes. In: European Conference on Artificial Intelligence (ECAI). vol. 20, pp. 828–833 (2012)
 [31] Weng, P., Zanuttini, B.: Interactive value iteration for Markov decision processes with unknown rewards. In: IJCAI (2013)
 [32] Weng, P., BusaFekete, R., Hüllermeier, E.: Interactive QLearning with Ordinal Rewards and Unreliable Tutor. In: ECML/PKDD Workshop Reinforcement Learning with Generalized Feedback (Sep 2013)
 [33] White, D.: Multiobjective infinitehorizon discounted Markov decision processes. J. Math. Anal. Appls. 89, 639–647 (1982)
 [34] Wray, K.H., Zilberstein, S., Mouaddib, A.I.: Multiobjective mdps with conditional lexicographic reward preferences. In: AAAI (2015)
 [35] Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The karmed dueling bandits problem. Journal of Computer and System Sciences 78(5), 1538–1556 (2012)
Comments
There are no comments yet.