1. Introduction
Multiobjective reinforcement learning (MORL) aims to extend the capabilities of reinforcement learning (RL) methods to enable them to work for problems with multiple, conflicting objectives roijers2013survey
. RL algorithms generally assume that the environment is a Markov Decision Process (MDP) in which the agent is provided with a scalar reward after each action, and must aim to learn the policy which maximises the longterm return based on those rewards. In contrast MORL algorithms operate within multiobjective MDPs (MOMDPs), in which the reward terms are vectors, with each element in the vector corresponding to a different objective. This creates a number of new issues to be addressed by the MORL agent. Most notably there may be multiple policies which may be optimal (in terms of Pareto optimality), and which policy the agent should learn is not immediately obvious.
In the utilitybased paradigm of MORL roijers2013survey; zintgraf2015quality it is assumed that the preferences of the user can be defined in terms of a parameterised utility function , and that the aim of the agent should be to learn the policy which produces vector returns which maximises the utility to the user as defined by .
Various approaches have been explored for the form of the utility function – some may be better suited to express the preference of the user within a particular problem domain, while others offer benefits from an algorithmic perspective. A simple weighted linear scalarisation has been widely used because of its simplicity (for example, barrett2008learning; castelletti2010tree; perez2009responsive). Linear scalarisation transforms an MOMDP into an equivalent singleobjective MDP, and enables existing RL approaches to be directly applied roijers2013survey. However for many tasks this may not be able to accurately represent the utility of the user, and so may fail to discover the policy which is optimal with regards to their true utility. As a result numerous nonlinear scalarisation functions have been explored in the literature (for example, gabor1998multi; van2013scalarized; van2013hypervolume) – these tend to produce algorithmic complications, but also may better represent the true preferences of the user.
As well as the choice of scalarisation function and parameters, a second factor must be considered within this utilitybased paradigm – the timeframe over which the utility is being maximised. roijers2013survey identified two distinct possibilities. The agent may aim to maximise the expected scalarised return (ESR). That is, it is assumed the returns are first scalarised, and then this agent aims for the policy which maximises the expected value of that scalar. This ESR approach is suited to problems where the aim is to maximise the expected outcome within any individual episode. For example, when producing a treatment plan for a patient which trades off the likelihood of a cure versus the extent of negative sideeffects  any individual patient will only undergo this treatment once, and so they care about the utility obtained within that specific episode.
In other contexts we may be concerned about the mean utility received over multiple episodes. In this situation the agent should aim to maximise the scalarised expected return (SER)  that is, it estimates the expected vector return per episode, and then maximises the scalarisation of that expected return. As demonstrated in
roijers2018multi, the optimal policy for a particular MOMDP under the ESR and SER setting may differ considerably, even if the same scalarisation function and parameters are used in both cases.As noted by roijers2018multi and ruadulescu2019equilibria much of the existing work in MORL has considered SER optimization, although this has often been implicit rather than explicitly stated. Much of this SERfocused work has been based on benchmark environments such as those of vamplew2011empirical, the majority of which are deterministic MOMDPs.
In this paper we demonstrate by example that the modelfree valuebased methods previously widely used in MORL research may fail to maximise the SER utility when applied to MOMDPs with stochastic state transitions.
2. Space Traders: An Example Stochastic MOMDP
As shown in Figure 1 the Space Traders MOMDP is a finitehorizon task with a horizon of 2 timesteps. It consists of two nonterminal states, with three actions available in each state. The agent starts at its home planet (state A) and must travel to another planet (state B) to deliver a shipment, and then return to State A with the payment. The agent receives a reward with two elements  the first is 0 on all actions, except that a reward of 1 is received when the agent successfully returns to state A, while the second element is a negative value reflecting the time taken to execute the action.
There are three possible pathways between the two planets. The direct path (actions shown by solid black lines in Figure 1
) is fairly short, but there is a risk of the agent being waylaid by space pirates and failing to complete the task. The indirect path (grey lines) avoids the pirates and so always leads to successful completion of the mission, but takes longer. Finally the recently developed teleportation system (dashed lines) allows instantaneous transportation, but has a higher risk of failure. The figure also details the probability of success, and the reward for the missionsuccess and time objectives for each action – due to variations in local conditions such as solar winds and the location of the space pirates, the time values for the outward and return journeys on a particular path may vary.
Table 1 summarises the transition probabilities and rewards of the MOMDP, and also shows the mean immediate reward for each action from each state, weighted by the probability of success.
State  Action  P(success) 





A  Indirect  1.0  (0,12)  n/a  (0,12)  
Direct  0.9  (0, 6)  (0, 1)  (0, 5.5)  
Teleport  0.85  (0,0)  (0,0)  (0, 0)  
B  Indirect  1.0  (1, 10)  n/a  (1, 10)  
Direct  0.9  (1, 8)  (0, 7)  (0.9, 7.9)  
Teleport  0.85  (1, 0)  (0, 0)  (0.85, 0) 
As there are three actions from each state there are a total of nine deterministic policies available to the agent. The mean reward per episode for each of these policies is shown in Table 2 and illustrated in Figure 2. The solid points in the figure highlight the policies which belong to the Pareto front, and the dashed grey line indicates the convex hull (only those policies lying on the convex hull can be located via methods using linear scalarisation – this set of policies is referred to as the Convex Coverage Set roijers2013computing).



Mean return  

II  Indirect  Indirect  (1, 22)  
ID  Indirect  Direct  (0.9, 19.9)  
IT  Indirect  Teleport  (0.85, 12)  
DI  Direct  Indirect  (0.9, 14.5)  
DD  Direct  Direct  (0.81, 12.61)  
DT  Direct  Teleport  (0.765, 5.5)  
TI  Teleport  Indirect  (0.85, 8.5)  
TD  Teleport  Direct  (0.765, 6.715)  
TT  Teleport  Teleport  (0.7225, 0) 
For the remainder of the paper we will assume that the agent’s aim is to minimise the time taken to complete the delivery and return to A, subject to having at least an 88% probability of successful completion. That is, the scalarisation function if and otherwise. The optimal policy for this aim is to follow the direct path to B and then the indirect path back to A (policy DI).
3. Applying ModelFree ValueBased MORL Methods to Space Traders
In this section we will discuss how some of the valuebased MORL methods previously used in the literature would perform on the Space Traders MOMDP. All the methods discussed are assumed to be based on a multiobjective extension of modelfree valuebased RL algorithms such as QLearning or SARSA – for example see (van2014multi, p. 3668). For the purposes of this section we will restrict discussion to singlepolicy methods in which the scalarisation function is used to filter the multiple Paretooptimal policies which may be available so as to obtain a single policy which is optimal with regards to . Multiplepolicy MORL methods will be discussed in Section 5.
All methods learn vectorvalued estimated Qvalues, but differ in terms of the scalarisation or ordering method used to perform actionselection, and the characteristics on which the Qvalue and policy are conditioned.
3.1. Linear scalarisation
A simple approach to MORL is to apply a linear weighted scalarisation to the elements of the Qvalue vector prior to selecting the greedy action. As mentioned earlier, this converts the MOMDP into an equivalent MDP, and so the Qvalues and actionselection need only be conditioned on the current state of the MDP. However it is wellknown that methods using linear scalarisation are unable to identify solutions which do not lie on the convex hull of the Pareto front vamplew2008limitations. Clearly from Figure 2 this is the case for policy DI, and so linear methods will not be able to converge to this policy. This result is not surprising and we mention it here simple for the sake of completeness.
3.2. Nonlinear scalarisation
A variety of nonlinear scalarisation methods have been explored in the MORL literature gabor1998multi; van2013scalarized; van2013hypervolume. The nonlinear nature of the scalarisation function means that the assumption of additivity underlying the Bellman equation no longer applies. In order to deal with this, both the choice of action and the Qvalues must be conditioned not only on the current state of the environment, but also on rewards received so far by the agent during this episode geibel2006reinforcement; roijers2018multi. That is, if the scalarisation function is then at time the agent will select the action which maximises the value of .
For the purposes of the following discussion we will assume that is the thresholded lexicographic ordering operator (TLO) gabor1998multi; issabekov2012empirical, and that a thresholding parameter of 0.88 is applied to the first element of the Qvalue vector. The intention here is to maximise the value of the second objective (i.e. minimise time), subject to achieving the threshold level for the first objective. If this operator could be applied directly to the mean returns of each policy from Table 2, then clearly policy DI would be selected.
However if we consider how the TLO operator selects actions during the execution of a policy, then a different result will emerge. Regardless of the path selected at state A, if state B is successfully reached then a zero reward will have been received by the agent. Therefore the choice of action at state B is independent of the previous action. Looking at the mean action values reported in Table 1, it can be seen that action T will be eliminated as it fails to meet the threshold for the first objective, and that action D will be preferred over I as both meet the threshold, and D has a superior value for the time objective. So it can already be seen that this agent will not converge to the desired policy DI.
Knowing that action D will be selected at state B, we can calculate the Qvalues for each action at state A, as shown in Table 3. The TLO action selector will eliminate actions D and T from consideration as neither meets the threshold of 0.88 for the probability of success. Action I will be selected giving rise to the overall policy ID. Not only is this not the desired DI policy, but as is evident from Figure 2 its average outcome is in fact Paretodominated by DI.
Action in state A  Policy  Q(A, a) 

Indirect  ID  (0.9, 19.9) 
Direct  DD  (0.81, 12.61) 
Teleport  TD  (0.765, 6.715) 
4. The Interaction of Local DecisionMaking and Stochastic State Transitions
The failure of the nonlinear valuebased MORL algorithms on the Space Traders MOMDP can be explained by the analysis of stochastictransition MOMDPs previously carried out by bryce2007probabilistic in the context of probabilistic planning. This analysis has been largely overlooked by MORL researchers so far, and so one of the contributions of this paper is to bring this work to the attention of the MORL research community.
Figure 3 illustrates a simple MDP reproduced from bryce2007probabilistic, with a stochastic branch occurring on the transition from the initial state. The table in the lower half of this figure specifies the mean return for the four possible deterministic policies. Keeping in mind that this MOMDP is phrased in terms of minising cost (rather than maximising the inverse of the cost), it can be seen that unlike Space Traders, there are no Paretodominated policies for this MOMDP.^{1}^{1}1While clearly illustrating the problem, this MOMDP also lacks the narrative drama of Space Traders!.
The aim of the agent is to minimise the cost, subject to satisfying at least a 0.6 probability of success. Within an ESR formulation of the problem (i.e. ensure the probability of success threshold is achieved in each episode), the optimal policy is to select subplan at branch and at branch as both of these subplans individually satisfy the probability threshold. However if considered from the SER perspective, the optimal plan is to execute at branch and at branch – while itself fails to achieve the probability threshold, this branch is executed with a low probability and so the mean outcome of the two subplans will achieve the threshold while also producing a significant cost saving.
As identified by bryce2007probabilistic, whether the overall policy meets the constraints depends on the probability with which each branch is executed as well as the mean outcome of each branch. Determining the correct subplan to follow at each branch requires consideration of the subplan options available at each other branch in combination with the probability of branch execution.
This requirement is fundamentally incompatible with the localised decisionmaking at the heart of modelfree valuebased RL methods like Qlearning, where it is assumed that the correct choice of action can be determined purely based on information available to the agent at the current state. The provision of additional information such as the sum of rewards received so far in the episode as discussed in Section 3.2 is insufficient, as it still only provides information about the branch which has been followed in this episode, rather than all possible branches which might have been executed.
The conclusion to be drawn from both this example and Space Traders is that valuebased modelfree MORL methods are inherently limited when applied in the context of SER optimisation of nonlinear utility on MOMDPs with nondeterministic state transitions. These methods may fail to discover the policy which maximises the SER (i.e. the mean utility over multiple episodes). To the best of our knowledge this limitation has not previously been identified in the MORL literature. It is particularly important as the combination of SER, stochastic state transitions and nonlinear utility may well arise in important areas of application such as AI safety vamplew2018human.
5. Potential Solutions
In this section we will briefly review and critique various options which may address the issue identified above.
5.1. ESR Optimisation
As noted earlier the issue described arises due to the fact that an agent aiming to find a policy optimal with regards to SER must take into account the value which will be received on average by its policy across multiple episodes. Framing the problem in terms of ESR optimisation would eliminate this issue. However ESR is clearly inappropriate for the context of the Space Traders MOMDP. The agent will aim to ensure every episode meets the threshold for the missionsuccess objective. This can only be achieved by following the strictly safe II policy, which produces results which are far worse for the user’s true utility than the DI policy.
5.2. Nonstationary or nondeterministic policies
Previous work has demonstrated that for the SER formulation, or for nonepisodic tasks, policies formed from a nonstationary or nondeterministic mixture of deterministic policies can Paretodominate deterministic policies vamplew2009constructing; vamplew2017steering. For example, a mixture which randomly selects between policies TI and II with appropriate probabilities at the start of each episode can produce a mean outcome which exceeds that of policy DI, as shown in Figure 4 – the mixture policy which selects TI with probability 0.65 and II with probability 0.35 achieves a mean return of (0.9025, 13.225) which is superior to the deterministic DI policy with regards to both objectives.
However the use of policies which vary so widely may not be appropriate in all contexts – for many problems the more consistent outcome produced by a deterministic policy may be preferable, and so methods to find SERoptimal deterministic policies for stochastic MOMDPs are still required.
5.3. Multipolicy valuebased MORL
As well as the singlepolicy valuebased MORL methods examined in this paper, several authors have proposed multipolicy methods. These operate by retaining multiple value vectors at each state. These can correspond to either all Paretooptimal values obtainable from that state, or (for purposes of efficiency) be constrained to store only those values which can help construct the optimal value function under some assumptions about the nature of the overall utility function roijers2013computing. Multipolicy algorithms were first proposed for variants of dynamic programming white1982multi; wiering2007computing and more recently have been extended to MORL van2014multi; ruiz2017temporal.
By propagating back the coverage set of values available at each successor state, these algorithms would correctly identify all potentially optimal policies available at the starting state, and the optimal policy could then be selected at that point – in the context of Space Traders this would allow for the desired DI policy to be selected. However two issues still need to be addressed. One is ensuring that the agent has a means of determining which action should be performed in each encountered state to align with the initial choice of policy. Existing algorithms do not necessarily provide such a means in the context of stochastic transitions. Second, the existing multipolicy MORL algorithms do not have an obvious extension to complex statespaces where tabular methods are infeasible. Conventional functionapproximation methods can not be applied, as the cardinality of the vectors to be stored can vary between states. vamplew2018non provides preliminary work addressing this problem, but further work is still required to make this approach practical.
5.4. Modelbased methods
As well as describing the difficulties faced by probabilistic planning, bryce2007probabilistic also propose a search algorithm known as Multiobjective Looping AO* (MOLAO*) to solve such tasks. As a planning method, this assumes an MOMDP with known state transition probabilities and a finite and tractable number of discrete states. It may be possible to extend this approach by integrating it within modelbased RL algorithms which can learn to estimate the transition probabilities and to generalise across states. We are not aware of any prior work which has attempted to do so. However the modelbased MORL approach proposed in wiering2014model may provide a suitable basis for implementing a reinforcement learning equivalent of MOLAO*.
5.5. Policysearch methods
An alternative to valuebased approaches is to use policysearch approaches to RL. As these directly maximise the policy as a whole as defined by a set of policy parameters, they do not have the local decisionmaking issue faced by modelfree valuebased methods.
Multiple researchers have proposed and evaluated policysearch methods for multiobjective problems shelton2001importance; uchibe2007constrained; pirotta2015multi; parisi2017manifold. One issue to be addressed however is that these methods most naturally produce stochastic policies and as such may have the same problems as faced by the mixture or nonstationary approaches discussed in Section 5.2, unless they are modified or constrained so as to ensure convergence to a deterministic policy.
6. Conclusion
We have described a stochastic MOMDP and utility function which, despite their seeming simplicity, are not amenable to solution by the widelyused modelfree valuebased approaches to MORL. While this issue with MOMDPs with stochastic state transitions has previously been described in the context of probabilistic planning bryce2007probabilistic, this is the first work to identify the implications for MORL. Our example also demonstrates that under stochastic statetransitions, it is in fact possible for such MORL methods to converge to a Paretodominated policy.
The combination of SER optimisation, stochastic state transitions and the need for a deterministic policy are likely to arise in a range of applications (particularly in riskaware agents), and so awareness of the limitations of some MORL methods to work under these characteristics is important in order to avoid the use of inappropriate methods.
Comments
There are no comments yet.