Predicting optimal value functions by interpolating reward functions in scalarized multi-objective reinforcement learning

by   Arpan Kusari, et al.
Ford Motor Company

A common approach for defining a reward function for Multi-objective Reinforcement Learning (MORL) problems is the weighted sum of the multiple objectives. The weights are then treated as design parameters dependent on the expertise (and preference) of the person performing the learning, with the typical result that a new solution is required for any change in these settings. This paper investigates the relationship between the reward function and the optimal value function for MORL; specifically addressing the question of how to approximate the optimal value function well beyond the set of weights for which the optimization problem was actually solved, thereby avoiding the need to recompute for any particular choice. We prove that the value function transforms smoothly given a transformation of weights of the reward function (and thus a smooth interpolation in the policy space). A Gaussian process is used to obtain a smooth interpolation over the reward function weights of the optimal value function for three well-known examples: GridWorld, Objectworld and Pendulum. The results show that the interpolation can provide very robust values for sample states and action space in discrete and continuous domain problems. Significant advantages arise from utilizing this interpolation technique in the domain of autonomous vehicles: easy, instant adaptation of user preferences while driving and true randomization of obstacle vehicle behavior preferences during training.


page 1

page 5

page 6


Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning

Many reinforcement-learning researchers treat the reward function as a p...

Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning

Reinforcement learning (RL) has drawn increasing interests in recent yea...

Reward-Balancing for Statistical Spoken Dialogue Systems using Multi-objective Reinforcement Learning

Reinforcement learning is widely used for dialogue policy optimization w...

Active Altruism Learning and Information Sufficiency for Autonomous Driving

Safe interaction between vehicles requires the ability to choose actions...

Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

In the standard Markov decision process formalism, users specify tasks b...

Safe Value Functions

The relationship between safety and optimality in control is not well un...

Relationship Explainable Multi-objective Optimization Via Vector Value Function Based Reinforcement Learning

Solving multi-objective optimization problems is important in various ap...

I Introduction

Reinforcement learning (RL) is a machine learning technique that provides the basis for decision-making, where a reward provided by the environment leads the agent to behave in a manner so as to maximize the cumulative sum of rewards. The reward function of RL problems often requires optimization of multiple, often conflicting objectives

[8]. For example, in the domain of autonomous vehicles, driving preferences have to be balanced between time to goal, comfort and safety [12], which are correlated and its unclear how they influence each other. These conflicting objectives do not yield a single optimal solution, but rather a set of trade-off solutions which balance the objectives [17]. The easiest way to solve the multi-objective problem is to use a linear scalarization function [6] that transforms the given problem into a standard single-objective using a weighted sum of the parameters.

Sutton’s reward hypothesis states that all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Thus, the inference being that any given multi-objective problem can always be transformed into a single objective reward function. The most obvious problem in this case is that that the weights used during training are a design parameter and dependent on the preference of the person designing the RL problem. Thus, the trained RL has a set optimal policy (and optimal value function) which is dependent on the weights provided. Having a fixed set of weights can be detrimental to the possibility of adaptation to different user experiences whereby for every instance of change of weights, the process of training (which is tedious and time intensive) needs to be repeated.

A question which arises is: Given a small sparse group of optimal value functions under variable reward functions given by different weights, is it possible to interpolate through the entire space of the reward functions to provide exact estimates of optimal value functions at all possible states and actions?

To the best of our understanding, prior research works focusing on value function interpolation have been used to show convergence of RL algorithms for countable and uncountable spaces. Ref. [5] proposed multilinear interpolation techniques on coarse grid to solve various RL paradigms. Ref. [16] provided convergence of RL algorithms combined with value function interpolation while providing convergence of Q-learning [14] for uncountable spaces. Although it is fairly obvious that changing the reward function would effect the value function directly, we have not found any research work which investigates the relationship and predicts it for weights not previously seen during training.

The majority of MORL approaches consist of single-policy algorithms in order to learn Pareto optimal solutions [9]. Ref. [1] provides a modification of RL to learn all the optimal policies for all linear preference assignments by incorporating the convex hull of the value function. Ref. [18] uses Monte-Carlo Tree Search (MCTS) along with multi-objective indicator by the way of a hypervolume indicator to define action-selection criterion. Ref. [17], which uses multi-objective optimization techniques within a RL framework, creates a multi-policy algorithm that learns a set of Pareto dominating policies in a single run of the algorithm which they call Pareto Q-learning. While our proposed approach is useful for MORL problems, we do not aim to create a different MORL approach in this paper. Rather our research formulation is different than the existing MORL approaches in that we seek to derive value functions at unseen reward weights (in the training phase) from the neighboring interpolations.

Through this research, we aim to interpolate through the space of the value functions as a result of changing the weights of the reward function using Gaussian Process (GP). The change in weights may be non-uniform, which makes the process highly nonlinear. Thus, it becomes a supervised learning problem where with the increase in the number of objectives, the weight space increases and data points becomes extremely sparse. Finding accurate value function values across problem space would be extremely beneficial for machine learning in general and autonomous vehicles in particular. GP provides flexible function approximators, capable of learning intricate structure through their covariance kernels

[19]. Utilizing the predictive power of GPs to interpolate through the high-dimensional input space should yield accurate value functions at all points of the large state space.

This paper is organized as follows: Section II provides a preliminary background of RL and GP, Section III provides the claim along with the mathematical reasoning, Section IV gives the results of the methodology on various standard RL examples, and Section V gives the discussions and conclusions.

Ii Background

Ii-a Reinforcement learning

In the RL task, at time t, the agent observes a state, S, which represents the environmental model of the system. It takes an action, A. The agent receives an immediate scalar reward and moves to a new state

. The environment’s dynamics are characterized by state transition probabilities

. This can be formally stated as a Markov Decision Process (MDP) where the next state can be completely defined by the previous state and action (Markov property) and receive a scalar reward for executing the action


The goal of the agent is to maximize the cumulative reward (discounted sum of rewards) or value function:


where is the discount factor and is the reward at time-step . In terms of a policy , the value function can be given by Bellman equation as:


Using Bellman’s optimality equation, we can define a policy which is greater than or equal to any other policy , if value function for all S. This policy is known as an optimal policy () and its value function is known as optimal value function ().

For continuous state space problems, such as arising in control of nonlinear dynamical systems, a common approach to solve the problem is using value function approach [15]. Value-function approach estimates a value function for each action and chooses the “greedy” policy (policy having highest value function) at each time-step. Thus, the value function is updated until it converges to the optimal value function.

Ii-B Gaussian process regression

A stochastic process is a collection of random variables of functions,

, where the variables are collected from a set

. A GP is a special form of stochastic process, where any finite subset of the random variables has a multivariate Gaussian distribution

[13]. In particular, a collection of random variables is said to be drawn from a GP with mean function and covariance function , if for any finite set of elements , the associated finite set of random variables have distribution,


The resulting GP is then denoted as

While any real-valued function is suitable for mean function , the kernel function needs to guarantee positive-semidefiniteness.

Let be a training set of i.i.d. examples from some unknown distribution. In the Gaussian process regression model,


where the are i.i.d. “noise” variables with independent distributions. We assume a zero-mean Gaussian process prior, with a covariance function . The marginal distribution over any set of input points belonging to must have a joint multivariate Gaussian distribution. Therefore, for testing points , the marginal distribution is given as


where X is the matrix formulation of the training input vector,

is the matrix formulation of the test input vector and is the compactly written vector formulation of . The outputs can therefore be written as:


where are i.i.d. “noise” variables with independent distributions. We derive the test outputs from Equation 9 as:




Iii Methodology

Iii-a Value function interpolation

In this section, we focus on providing mathematical justifications for the interpolation of value function based on the weights of the objectives of reward function.

For initial analysis, we wish to prove that given a simple, linear transformation of weights, the value function can be interpolated in an accurate manner. Intuitively, we are trying to derive the intermediate optimal value function giving the optimal policy for some MDP, where the reward is the weighted combination of various different objectives.

Theorem 1

For a reward function composed of different objectives, each associated with weight , with the full set given by , such that for a given state and a given action , the reward function is


where are normalized reward functions at a given state and action , respectively, the gradient of the state-value function with respect to the weights exists.

The optimal value at a state is given by the state-value function


where . Given a particular set of weights, we substitute (11) into (12) to obtain


However, note that for a different set of weights , the optimal state-value function is


Subtracting (13) from (14) yields


Using the property yields


Equation (16) can be written in a matrix form as


where and

Since, is constant for all states and actions, (17) can be rearranged as


which gives the approximate gradient of the value function with respect to the weight. If all the rewards at the current state and action is finite, then the gradient will exist for that given state of the MDP. Thus, the linear interpolation of weights in reward function leads to smooth interpolation of state-value function.

Corollary 1.1

Under linear transformation of weights in reward function, the gradient of the action-value function with respect to the weights exists

For a optimal state-value function that gives the best value at that particular state, the optimal action-value function (optimal value of a state and action combination) is


Given two different set of weights, the difference in q-value functions can be written as:


Replacing from equation 17, we get


Therefore the gradient of with respect to the weight is given as:


The shaped reward function is a specific case of the MORL reward function, whereby, the reward function is augmented using an indicator function where a positive reward is given if the next state is closer to the goal and can be presented as:


where is the goal state. Assuming that the goal state is constant across the different weights of the reward function, the added shaped reward remains constant for the given state across weights. Thus, the reward shaping does not pose any problems for interpolating reward functions.

Iv Results

We use three different example tasks with various degrees of complexity to test the validity of our approach. The GP regression from Scikit in Python [11] is used by us to determine the interpolated value function where the input vector corresponds to a state vector augmented with the discrete action and the weights of the reward function and the scalar target corresponds to the value function. The Matern kernel is utilized for training the GP with default parameters in all the cases. We used other kernels and did not find sufficient difference between choice of different kernels.

Iv-a Gridworld

The gridworld [14] is a discrete

grid with four actions per state, corresponding to steps in each direction, and each action has a 10% chance of randomly changing direction. If the agent hits a wall, then it stays in the same position. The goal states corresponds to a large terminal reward and there is a living cost (negative reward) for each of the other states, which helps in the agent reaching the goal the fastest way possible. There is a walled state in (2,2) position. The default terminal rewards are +1 and -1 in the two states and the default living reward is -0.02. For the various different experiments, we vary the living cost and terminal rewards. Two kinds of metrics are reported: the mean squared error between the actual value function and predicted value function over all states and all actions and the the median value of the standard deviation at the query points. Two different query points are reported, one interpolated and another extrapolated, which are presented as representative samples.

Fig. 1: Visualization of optimal value functions for two extreme living rewards: (a) default living reward of -0.02 and (b) living reward = -0.4. For the interpolation of living reward = -0.23, we show the two neighboring points (c) living reward = -0.2 and (d) living reward = -0.3. (e) shows the predicted and (f) shows the actual values for the case with living reward = -0.23.

Iv-A1 Changing the living reward

We vary the living reward of all states except the terminal states, to vary the optimal policy (and by virtue the optimal state value function), such that the variability is non-linear. The living reward is varied from 0 to -0.4 by stepping every 0.1. Two evaluation living rewards are used as given in Table I below. The interpolation result is accurate to the fourth decimal place while the extrapolation is within a feasible error bound. The state-value function for different living rewards are shown in Figure 1 below to provide a sense of the variability of the value functions as a function of the living rewards and the accuracy of the interpolation for the entire state space.

Living reward Mean squared error Median sigma
-0.23 2.913e-08 4.24e-04
-0.5 0.0278 0.1
TABLE I: Predicting value functions for living rewards

Iv-A2 Changing the negative terminal reward

The negative terminal reward is varied from -1 to -5 with steps of -0.5. The evaluations are given in the Table II below. With the increase in negative terminal reward, the value function in other states is not influenced and thus, the mean squared error is minimal.

Negative reward Mean squared error Median sigma
-2.2 2.678e-07 8.535e-04
-6 7.018e-06 2.304e-03
TABLE II: Predicting value functions for negative terminal rewards

Iv-A3 Changing the positive terminal reward

The positive terminal reward is changed from 1 to 5 with steps of 0.5 and evaluated at two random points 2.2 and 6 given in Table III below. Both in interpolation and extrapolation, GP is able to track the value functions perfectly.

Positive reward Mean squared error Median sigma
2.2 3.118e-09 1.339e-03
6 2.018e-06 6.993e-04
TABLE III: Predicting value functions for positive terminal rewards

Iv-B Objectworld

Objectworld is an extension of gridworld described by Ref. [7] featuring random objects placed in the grid (Figure 2(a)). The objects are assigned a random outer and inner color out of C colors with the state vector being composed of the Euclidean distance to the nearest object with a specific inner or outer color. The true reward is positive in states that are both within 3 cells of outer color 1 and 2 cells of outer color 2, negative within 3 cells of outer color 1, and zero otherwise. Inner colors and all other outer colors are distractors. In the given example, we use two colors, blue and red. Fifteen different objects are placed randomly within the 10x10 grid with randomly chosen inner and outer color. The positive reward is varied from 0.5 to 1 with the 0.6 point being predicted. Figure 2(b) shows the actual value function while Figure 2(c) provides the predicted value function. Table IV provides the statistics for the given prediction. The interpolation is not accurate as in gridworld due to the non-linearity of the reward with respect to the states, but GP can still recover provide values relatively close to the actual values, especially in the positive reward region.

Reward Mean squared error Median sigma
0.6 0.5207 0.0068
TABLE IV: Predicting value functions for rewards in objectworld
Fig. 2: (a) Objectworld with 15 randomly placed objects in blue and red inner and outer colors chosen randomly; white represents positive reward, black negative reward and grey zero reward (b) Actual value function for positive reward = 0.6 (c) Predicted value function
Fig. 3: Pendulum boxplot

Iv-C Pendulum

The pendulum environment [3], is an well-known problem in control literature whereby a pendulum starts from a random position and the goal is to keep it upright by applying the minimum amount of force. The state vector is composed of the cosine of the angle of the pendulum, sine of the angle of the pendulum and the derivative of the angle. The action is the joint effort of 5 actions linearly spaced within the [-2, 2] range. The reward is given as:


where , and are the reward weights for the angle , derivative of angle and action respectively. The optimal reward weights given by OpenAI are [1, 0.1, 0.001] respectively. An episode is limited to 1000 timesteps.

Deep Q-network (DQN) has been proposed by [10]

which combines deep neural networks with RL to solve continuous state discrete action problems. DQN uses a neural network with gives the q-values for every action and uses a buffer to store old states and actions to sample from which helps to stabilize training. The pendulum environment is solved using DQN approach for varying

with the evaluation performed at 0.001. Since this is a continuous state problem, we utilize the trained evaluation model to transition to the next state. The boxplot for the difference in values for 10 example episodes is provided in Figure 3

. Utilizing a DQN provides no guarantees that the states seen during testing have been visited during training which can lead to the outliers.

V Conclusions

In this paper, we showed a direct relationship between the weights of the reward function and the optimal value function for scalarized MORL. This helped us in interpolating through a space of optimal value functions generated using the sparse set of reward functions to estimate the value functions at sample states. Utilizing this relationship would be very beneficial in high-dimensional problems where the instant adaptation of optimal value functions (and thus optimal policies) would save time and cost required for retraining.

The scalarization approach of MORL is restrictive in that it cannot work with objectives where Pareto fronts are non-convex or have discontinuities [4]. It is an area of active research which uses algorithms borrowed from the multi-objective optimization literature. However, our paper deals with problems which have a defined convex Pareto front and provides a very simple technique in determining optimal value functions at different weights.

A future work will focus on developing transfer learning of specific behaviors in multi-agent environments with different reward functions based on different weights.


  • [1] L. Barrett and S. Narayanan (2008) Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on Machine learning, pp. 41–47. Cited by: §I.
  • [2] R. Bellman (1957) A Markovian decision process. Journal of Mathematics and Mechanics 6 (5), pp. 679–684. Cited by: §II-A.
  • [3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IV-C.
  • [4] I. Das and J. E. Dennis (1997) A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems. Structural optimization 14 (1), pp. 63–69. Cited by: §V.
  • [5] S. Davies (1997) Multidimensional triangulation and interpolation for reinforcement learning. In Advances in neural information processing systems, pp. 1005–1011. Cited by: §I.
  • [6] C. Hwang and A. S. M. Masud (2012) Multiple objective decision making—methods and applications: a state-of-the-art survey. Vol. 164, Springer Science & Business Media. Cited by: §I.
  • [7] S. Levine, Z. Popovic, and V. Koltun (2011) Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pp. 19–27. Cited by: §IV-B.
  • [8] C. Liu, X. Xu, and D. Hu (2015) Multiobjective reinforcement learning: a comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems 45 (3), pp. 385–398. Cited by: §I.
  • [9] P. Mannion, S. Devlin, K. Mason, J. Duggan, and E. Howley (2017) Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 263, pp. 60–73. Cited by: §I.
  • [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §IV-C.
  • [11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IV.
  • [12] G. Prokop (2001) Modeling human vehicle driving by model predictive online optimization. Vehicle System Dynamics 35 (1), pp. 19–53. Cited by: §I.
  • [13] C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §II-B.
  • [14] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §IV-A.
  • [15] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §II-A.
  • [16] C. Szepesvári (2001) Convergent reinforcement learning with value function interpolation. Technical report Technical Report TR-2001-02, Mindmaker Ltd., Budapest 1121, Konkoly Th. M. u …. Cited by: §I.
  • [17] K. Van Moffaert and A. Nowé (2014) Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research 15 (1), pp. 3483–3512. Cited by: §I, §I.
  • [18] W. Wang and M. Sebag (2012) Multi-objective monte-carlo tree search. Cited by: §I.
  • [19] C. Williams and C. E. Rasmussen (1996) Gaussian processes for regression. In Advances in neural information processing systems, pp. 514–520. Cited by: §I.