I Introduction
Reinforcement learning (RL) is a machine learning technique that provides the basis for decision-making, where a reward provided by the environment leads the agent to behave in a manner so as to maximize the cumulative sum of rewards. The reward function of RL problems often requires optimization of multiple, often conflicting objectives
[8]. For example, in the domain of autonomous vehicles, driving preferences have to be balanced between time to goal, comfort and safety [12], which are correlated and its unclear how they influence each other. These conflicting objectives do not yield a single optimal solution, but rather a set of trade-off solutions which balance the objectives [17]. The easiest way to solve the multi-objective problem is to use a linear scalarization function [6] that transforms the given problem into a standard single-objective using a weighted sum of the parameters.Sutton’s reward hypothesis states that all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Thus, the inference being that any given multi-objective problem can always be transformed into a single objective reward function. The most obvious problem in this case is that that the weights used during training are a design parameter and dependent on the preference of the person designing the RL problem. Thus, the trained RL has a set optimal policy (and optimal value function) which is dependent on the weights provided. Having a fixed set of weights can be detrimental to the possibility of adaptation to different user experiences whereby for every instance of change of weights, the process of training (which is tedious and time intensive) needs to be repeated.
A question which arises is: Given a small sparse group of optimal value functions under variable reward functions given by different weights, is it possible to interpolate through the entire space of the reward functions to provide exact estimates of optimal value functions at all possible states and actions?
To the best of our understanding, prior research works focusing on value function interpolation have been used to show convergence of RL algorithms for countable and uncountable spaces. Ref. [5] proposed multilinear interpolation techniques on coarse grid to solve various RL paradigms. Ref. [16] provided convergence of RL algorithms combined with value function interpolation while providing convergence of Q-learning [14] for uncountable spaces. Although it is fairly obvious that changing the reward function would effect the value function directly, we have not found any research work which investigates the relationship and predicts it for weights not previously seen during training.
The majority of MORL approaches consist of single-policy algorithms in order to learn Pareto optimal solutions [9]. Ref. [1] provides a modification of RL to learn all the optimal policies for all linear preference assignments by incorporating the convex hull of the value function. Ref. [18] uses Monte-Carlo Tree Search (MCTS) along with multi-objective indicator by the way of a hypervolume indicator to define action-selection criterion. Ref. [17], which uses multi-objective optimization techniques within a RL framework, creates a multi-policy algorithm that learns a set of Pareto dominating policies in a single run of the algorithm which they call Pareto Q-learning. While our proposed approach is useful for MORL problems, we do not aim to create a different MORL approach in this paper. Rather our research formulation is different than the existing MORL approaches in that we seek to derive value functions at unseen reward weights (in the training phase) from the neighboring interpolations.
Through this research, we aim to interpolate through the space of the value functions as a result of changing the weights of the reward function using Gaussian Process (GP). The change in weights may be non-uniform, which makes the process highly nonlinear. Thus, it becomes a supervised learning problem where with the increase in the number of objectives, the weight space increases and data points becomes extremely sparse. Finding accurate value function values across problem space would be extremely beneficial for machine learning in general and autonomous vehicles in particular. GP provides flexible function approximators, capable of learning intricate structure through their covariance kernels
[19]. Utilizing the predictive power of GPs to interpolate through the high-dimensional input space should yield accurate value functions at all points of the large state space.Ii Background
Ii-a Reinforcement learning
In the RL task, at time t, the agent observes a state, S, which represents the environmental model of the system. It takes an action, A. The agent receives an immediate scalar reward and moves to a new state
. The environment’s dynamics are characterized by state transition probabilities
. This can be formally stated as a Markov Decision Process (MDP) where the next state can be completely defined by the previous state and action (Markov property) and receive a scalar reward for executing the action
[2].The goal of the agent is to maximize the cumulative reward (discounted sum of rewards) or value function:
(1) |
where is the discount factor and is the reward at time-step . In terms of a policy , the value function can be given by Bellman equation as:
(2) | ||||
(3) | ||||
(4) | ||||
(5) |
Using Bellman’s optimality equation, we can define a policy which is greater than or equal to any other policy , if value function for all S. This policy is known as an optimal policy () and its value function is known as optimal value function ().
For continuous state space problems, such as arising in control of nonlinear dynamical systems, a common approach to solve the problem is using value function approach [15]. Value-function approach estimates a value function for each action and chooses the “greedy” policy (policy having highest value function) at each time-step. Thus, the value function is updated until it converges to the optimal value function.
Ii-B Gaussian process regression
A stochastic process is a collection of random variables of functions,
, where the variables are collected from a set. A GP is a special form of stochastic process, where any finite subset of the random variables has a multivariate Gaussian distribution
[13]. In particular, a collection of random variables is said to be drawn from a GP with mean function and covariance function , if for any finite set of elements , the associated finite set of random variables have distribution,(6) |
The resulting GP is then denoted as
While any real-valued function is suitable for mean function , the kernel function needs to guarantee positive-semidefiniteness.
Let be a training set of i.i.d. examples from some unknown distribution. In the Gaussian process regression model,
(7) |
where the are i.i.d. “noise” variables with independent distributions. We assume a zero-mean Gaussian process prior, with a covariance function . The marginal distribution over any set of input points belonging to must have a joint multivariate Gaussian distribution. Therefore, for testing points , the marginal distribution is given as
(8) |
where X is the matrix formulation of the training input vector,
is the matrix formulation of the test input vector and is the compactly written vector formulation of . The outputs can therefore be written as:(9) |
where are i.i.d. “noise” variables with independent distributions. We derive the test outputs from Equation 9 as:
(10) |
where
and
Iii Methodology
Iii-a Value function interpolation
In this section, we focus on providing mathematical justifications for the interpolation of value function based on the weights of the objectives of reward function.
For initial analysis, we wish to prove that given a simple, linear transformation of weights, the value function can be interpolated in an accurate manner. Intuitively, we are trying to derive the intermediate optimal value function giving the optimal policy for some MDP, where the reward is the weighted combination of various different objectives.
Theorem 1
For a reward function composed of different objectives, each associated with weight , with the full set given by , such that for a given state and a given action , the reward function is
(11) |
where are normalized reward functions at a given state and action , respectively, the gradient of the state-value function with respect to the weights exists.
The optimal value at a state is given by the state-value function
(12) |
where . Given a particular set of weights, we substitute (11) into (12) to obtain
(13) |
However, note that for a different set of weights , the optimal state-value function is
(14) |
Subtracting (13) from (14) yields
(15) |
Using the property yields
(16) |
Equation (16) can be written in a matrix form as
(17) |
where and
Since, is constant for all states and actions, (17) can be rearranged as
(18) |
which gives the approximate gradient of the value function with respect to the weight. If all the rewards at the current state and action is finite, then the gradient will exist for that given state of the MDP. Thus, the linear interpolation of weights in reward function leads to smooth interpolation of state-value function.
Corollary 1.1
Under linear transformation of weights in reward function, the gradient of the action-value function with respect to the weights exists
For a optimal state-value function that gives the best value at that particular state, the optimal action-value function (optimal value of a state and action combination) is
(19) |
Given two different set of weights, the difference in q-value functions can be written as:
(20) |
Replacing from equation 17, we get
(21) |
Therefore the gradient of with respect to the weight is given as:
(22) |
The shaped reward function is a specific case of the MORL reward function, whereby, the reward function is augmented using an indicator function where a positive reward is given if the next state is closer to the goal and can be presented as:
(23) |
where is the goal state. Assuming that the goal state is constant across the different weights of the reward function, the added shaped reward remains constant for the given state across weights. Thus, the reward shaping does not pose any problems for interpolating reward functions.
Iv Results
We use three different example tasks with various degrees of complexity to test the validity of our approach. The GP regression from Scikit in Python [11] is used by us to determine the interpolated value function where the input vector corresponds to a state vector augmented with the discrete action and the weights of the reward function and the scalar target corresponds to the value function. The Matern kernel is utilized for training the GP with default parameters in all the cases. We used other kernels and did not find sufficient difference between choice of different kernels.
Iv-a Gridworld
The gridworld [14] is a discrete
grid with four actions per state, corresponding to steps in each direction, and each action has a 10% chance of randomly changing direction. If the agent hits a wall, then it stays in the same position. The goal states corresponds to a large terminal reward and there is a living cost (negative reward) for each of the other states, which helps in the agent reaching the goal the fastest way possible. There is a walled state in (2,2) position. The default terminal rewards are +1 and -1 in the two states and the default living reward is -0.02. For the various different experiments, we vary the living cost and terminal rewards. Two kinds of metrics are reported: the mean squared error between the actual value function and predicted value function over all states and all actions and the the median value of the standard deviation at the query points. Two different query points are reported, one interpolated and another extrapolated, which are presented as representative samples.
![]() |
![]() |
Iv-A1 Changing the living reward
We vary the living reward of all states except the terminal states, to vary the optimal policy (and by virtue the optimal state value function), such that the variability is non-linear. The living reward is varied from 0 to -0.4 by stepping every 0.1. Two evaluation living rewards are used as given in Table I below. The interpolation result is accurate to the fourth decimal place while the extrapolation is within a feasible error bound. The state-value function for different living rewards are shown in Figure 1 below to provide a sense of the variability of the value functions as a function of the living rewards and the accuracy of the interpolation for the entire state space.
Living reward | Mean squared error | Median sigma |
---|---|---|
-0.23 | 2.913e-08 | 4.24e-04 |
-0.5 | 0.0278 | 0.1 |
Iv-A2 Changing the negative terminal reward
The negative terminal reward is varied from -1 to -5 with steps of -0.5. The evaluations are given in the Table II below. With the increase in negative terminal reward, the value function in other states is not influenced and thus, the mean squared error is minimal.
Negative reward | Mean squared error | Median sigma |
---|---|---|
-2.2 | 2.678e-07 | 8.535e-04 |
-6 | 7.018e-06 | 2.304e-03 |
Iv-A3 Changing the positive terminal reward
The positive terminal reward is changed from 1 to 5 with steps of 0.5 and evaluated at two random points 2.2 and 6 given in Table III below. Both in interpolation and extrapolation, GP is able to track the value functions perfectly.
Positive reward | Mean squared error | Median sigma |
---|---|---|
2.2 | 3.118e-09 | 1.339e-03 |
6 | 2.018e-06 | 6.993e-04 |
Iv-B Objectworld
Objectworld is an extension of gridworld described by Ref. [7] featuring random objects placed in the grid (Figure 2(a)). The objects are assigned a random outer and inner color out of C colors with the state vector being composed of the Euclidean distance to the nearest object with a specific inner or outer color. The true reward is positive in states that are both within 3 cells of outer color 1 and 2 cells of outer color 2, negative within 3 cells of outer color 1, and zero otherwise. Inner colors and all other outer colors are distractors. In the given example, we use two colors, blue and red. Fifteen different objects are placed randomly within the 10x10 grid with randomly chosen inner and outer color. The positive reward is varied from 0.5 to 1 with the 0.6 point being predicted. Figure 2(b) shows the actual value function while Figure 2(c) provides the predicted value function. Table IV provides the statistics for the given prediction. The interpolation is not accurate as in gridworld due to the non-linearity of the reward with respect to the states, but GP can still recover provide values relatively close to the actual values, especially in the positive reward region.
Reward | Mean squared error | Median sigma |
---|---|---|
0.6 | 0.5207 | 0.0068 |


Iv-C Pendulum
The pendulum environment [3], is an well-known problem in control literature whereby a pendulum starts from a random position and the goal is to keep it upright by applying the minimum amount of force. The state vector is composed of the cosine of the angle of the pendulum, sine of the angle of the pendulum and the derivative of the angle. The action is the joint effort of 5 actions linearly spaced within the [-2, 2] range. The reward is given as:
(24) |
where , and are the reward weights for the angle , derivative of angle and action respectively. The optimal reward weights given by OpenAI are [1, 0.1, 0.001] respectively. An episode is limited to 1000 timesteps.
Deep Q-network (DQN) has been proposed by [10]
which combines deep neural networks with RL to solve continuous state discrete action problems. DQN uses a neural network with gives the q-values for every action and uses a buffer to store old states and actions to sample from which helps to stabilize training. The pendulum environment is solved using DQN approach for varying
with the evaluation performed at 0.001. Since this is a continuous state problem, we utilize the trained evaluation model to transition to the next state. The boxplot for the difference in values for 10 example episodes is provided in Figure 3. Utilizing a DQN provides no guarantees that the states seen during testing have been visited during training which can lead to the outliers.
V Conclusions
In this paper, we showed a direct relationship between the weights of the reward function and the optimal value function for scalarized MORL. This helped us in interpolating through a space of optimal value functions generated using the sparse set of reward functions to estimate the value functions at sample states. Utilizing this relationship would be very beneficial in high-dimensional problems where the instant adaptation of optimal value functions (and thus optimal policies) would save time and cost required for retraining.
The scalarization approach of MORL is restrictive in that it cannot work with objectives where Pareto fronts are non-convex or have discontinuities [4]. It is an area of active research which uses algorithms borrowed from the multi-objective optimization literature. However, our paper deals with problems which have a defined convex Pareto front and provides a very simple technique in determining optimal value functions at different weights.
A future work will focus on developing transfer learning of specific behaviors in multi-agent environments with different reward functions based on different weights.
References
- [1] (2008) Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on Machine learning, pp. 41–47. Cited by: §I.
- [2] (1957) A Markovian decision process. Journal of Mathematics and Mechanics 6 (5), pp. 679–684. Cited by: §II-A.
- [3] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IV-C.
- [4] (1997) A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems. Structural optimization 14 (1), pp. 63–69. Cited by: §V.
- [5] (1997) Multidimensional triangulation and interpolation for reinforcement learning. In Advances in neural information processing systems, pp. 1005–1011. Cited by: §I.
- [6] (2012) Multiple objective decision making—methods and applications: a state-of-the-art survey. Vol. 164, Springer Science & Business Media. Cited by: §I.
- [7] (2011) Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pp. 19–27. Cited by: §IV-B.
- [8] (2015) Multiobjective reinforcement learning: a comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems 45 (3), pp. 385–398. Cited by: §I.
- [9] (2017) Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 263, pp. 60–73. Cited by: §I.
- [10] (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §IV-C.
- [11] (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IV.
- [12] (2001) Modeling human vehicle driving by model predictive online optimization. Vehicle System Dynamics 35 (1), pp. 19–53. Cited by: §I.
- [13] (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §II-B.
- [14] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §IV-A.
- [15] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §II-A.
- [16] (2001) Convergent reinforcement learning with value function interpolation. Technical report Technical Report TR-2001-02, Mindmaker Ltd., Budapest 1121, Konkoly Th. M. u …. Cited by: §I.
- [17] (2014) Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research 15 (1), pp. 3483–3512. Cited by: §I, §I.
- [18] (2012) Multi-objective monte-carlo tree search. Cited by: §I.
- [19] (1996) Gaussian processes for regression. In Advances in neural information processing systems, pp. 514–520. Cited by: §I.