Human Preference Scaling with Demonstrations For Deep Reinforcement Learning

07/25/2020 ∙ by Zehong Cao, et al. ∙ University of Tasmania 0

The current reward learning from human preferences could be used for resolving complex reinforcement learning (RL) tasks without access to the reward function by defining a single fixed preference between pairs of trajectory segments. However, the judgement of preferences between trajectories is not dynamic and still requires human inputs over 1,000 times. In this study, we propose a human preference scaling model that naturally reflects the human perception of the degree of choice between trajectories and then develop a human-demonstration preference model via supervised learning to reduce the number of human inputs. The proposed human preference scaling model with demonstrations can effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion - MuJoCo games - relative to the single fixed human preferences. Furthermore, our developed human-demonstration preference model only needs human feedback for less than 0.01% of the agent's interactions with the environment and significantly reduces up to 30% of the cost of human inputs compared to the existing approaches. To present the flexibility of our approach, we released a video ( showing comparisons of behaviours of agents trained with different types of human inputs. We believe that our naturally inspired human preference scaling with demonstrations is beneficial for precise reward learning and can potentially be applied to state-of-the-art RL systems, such as autonomy-level driving systems.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning (RL) [13] has intensively used the reward function to train agent’s behaviours for a specified task. Nevertheless, constructing an effective reward function in complicated scenarios can sometimes be challenging. If the design of a reward function is too simple, then the behaviours of the trained agent may not be similar to our expectations, suggesting that the results may have a misalignment between our expectations and the actual testing [15]. To generate more effective RL, having a communication pathway between the RL agent and our expectations during the training process is valuable [3].

Some recent work showed challenges in training an intelligent robot to complete task objectives [16] [2] [11] and multi-agent interactions [5] [6], addressing the difficulties of alignment between human expectations and the final training outcome. Although approaches such as inverse RL [14]

and imitation learning

[10] were suggested to extract the reward function and mimic the actions of human experts to ensure an expected outcome, these approaches are not direct enough to train the desired behaviour. Moreover, the degree of movement of a robot could be larger than that of a human, as a human demonstration for imitation learning may not be available in some cases [17].

A novel study by the DeepMind group [7] proposed deep RL from human preferences, in which a comparison measurement between pairs of trajectory segments was designed to replace the reward function and preference inputs from an expert were allowed, as shown in Figure 1. This approach asks for advice from an expert to ensure that the RL training is on the correct track, implying that agents could be helped when tackling problems in some highly complicated scenarios. However, the judgement of preferences between trajectories is not dynamic. For example, the candidates of preferences only include the fixed left, right, or equal options, presented by the preference values 1, 0, and 0.5, respectively, so we assume that this approach cannot reflect the natural human intentions. In addition, the current approach still requires human inputs over 1,000 times, which consumes a very high time cost for humans.

Fig. 1: Illustration of Deep RL from Human Preferences

Inspired by the above DeepMind study and uncertainties in decision making [7] [25], in this study, we propose a new method - human preference scaling with demonstrations for deep RL. In particular, we developed a scaling model to support dynamic human preferences, instead of single fixed preferences, for RL. Moreover, we used the database of scaling values of human preference to develop a human-demonstration preference model via supervised learning to predict the preference scaling values based on the initial human inputs to reduce human efforts.

Of note, our contributions of human preference scaling with demonstrations for RL are as follows:

  1. Our proposed human preference scaling with demonstrations for deep RL allows humans to input dynamic preference levels via our developed human preference scaling model to reflect human behaviour and decrease the number of human inputs for our developed human-demonstration model.

  2. Based on an experiment with the robotic physical simulator MuJoCo [21], our developed human preference scaling model for RL could achieve higher cumulative reward values than the current fixed human preference model, and our developed human-demonstration model support for RL could reduce by up to 30% of the amount of human inputs of dynamic preferences without significantly sacrificing the reward values.

The rest of this paper is organised as follows. The related work is briefly introduced in Section II. Then, the preliminaries and our proposed method are illustrated in Section III. Afterwards, the experiment involving MuJoCo games and relevant settings are addressed in Section IV. Finally, we present our findings and comparison results in Section V and conclude this work in Section VI.

Ii Related Work

Ii-a Preference Learning

Sometimes an agent may not be able to learn expected actions from the rewards when using the traditional strategy of RL, but preference learning can potentially minimise the gap between the missing information of the agent and the desired behaviour suggested by a human. A recent study [24] addressed that preference learning could assist in adapting robotic movement and noted that the human demonstration is particularly easy for task-dependent goals. An earlier study [8] also presented that human preferences are more effective in acquiring better actions than agent rewards in RL. They implied that the feedback from a human could leverage RL in qualitative policy models, and manipulation from preference learning could be a new strategy to assist the agent in achieving desired human behaviour in robotic control.

Ii-B Learning by Pairwise Comparison

Learning by pairwise comparison allows the application of human preferences to RL. Fürnkranz’s work [8] addressed that the preferences could help the agent have label ranking in RL so that humanity has the chance to provide feedback to the agent. To distinguish between preference learning and RL, a survey study [23] clarified that the agent allows receiving feedback from two options in preference learning, while in RL, only one human feedback option is accepted. This will help the RL agent have label ranking in performing a task.

The notation represents that is the preferred choice over between the two options, and , where represents a set of options. Kreps [12] noted some notations that could be defined from preference learning while comparing a pair of options, such as

  • : option is absolutely preferred.

  • : option is absolutely preferred.

  • : option and are indifferent, cannot distinguish which option is preferred.

  • : option is weakly preferred.

  • : option is weakly preferred.

Ii-C RL from Human Preferences

To link from preference learning to RL, according to the communication pathway proposed by Christiano et al. [7], the basic flow of RL from human preferences is mainly built with three main modules: agent, preference interface, and reward predictor. The agent keeps training and exploring the environment as RL progresses. The novelty of the communication pathway is adding a preference interface, which randomly generates some episodes to ask for a human’s judgement. An expert therefore inputs which options the expert prefers by inputting the preference database into the preference interface. These preferences are sent to a reward predictor to perform further training to predict the relative rewards from these preferences. These data are then used to transfer back the trained preferences as the agent’s observations so that policy training can be performed.

Ii-D Motivations

The settings of Christiano’s study [7] only cover three conditions among the definitions of preference learning from Kreps [12], in which the conditions , and were investigated. Christiano’s study did not set up the conditions of and , which are two important remaining conditions we encounter in reality for decision making based on preferences. This motivates us to develop a new RL-based human preference setup to cover the conditions of and .

Furthermore, we noted that Christiano’s study [7] requires a large number of labels of human preferences, which may cause a long period of time to be spent on human input. In particular, it requires at least 5,500 human inputs in the training setup. We assume that some labels of human preferences may not be accurate, as a human may endure a high fatigue level, so the error rate may increase after a long period of time when performing the preference judgement. This motivates us to develop a human-demonstration model for human-agent learning to reduce the number of human inputs and improve the learning performance.

Iii Preliminaries and Methods

Iii-a Settings and Goal

An RL agent interacts with the environment for a number of steps, in which the agent inspects the environment according to observation , receives instant reward at each timestep and performs action based on the observation. The agent aims to maximise the cumulative rewards during the training and receives the predicted instant reward values during each timestep via the reward predictor and preference interface modules. When the agent takes an action in the environment, a video clip from a trajectory segment is pushed into the list in the preference interface. The preference interface module randomly draws two video clips from the list and asks a human which clip the human would prefer to choose. Once the human inputs the preference, this information is passed to the reward predictor to generate predicted rewards for the agent for training.

In this study, we followed the preference interface design from Christiano’s work [7], which accepts the generated segments from the RL agent and puts these segments into a queue, as shown in Figure 1. Two segments are randomly selected from the queue, and the preference interface asks a human for the preference between the two candidate options. After the preference interface collects the preferred option from the human, this option will be saved in the preference queue in the reward predictor for training of RL policies. As the currently used approach only accepts a fixed-based preference, which is left, right or equal, in the input, it lacks dynamic input of the preference range. In this study, we modified the preference interface design and proposed a synthetic preference scaling model, as shown in the following section.

Iii-B Synthetic Preference Scaling

In this study, we assume that the human always prefers to choose the trajectory segment that has the potential to return a high reward value, so we used the synthetic oracle, a Bayesian approach for policy learning from trajectory preference queries, to mimic the preference of the human, whose preference over trajectories precisely reflects the reward [22]. For the synthetic human preference, when the agent queries for comparisons, the synthetic human could immediately reply by indicating a preference for whichever trajectory segment receives a higher reward in the underlying task.

Based on the synthetic human preference [22], we developed a synthetic preference scaling model, as presented in Algorithm 1.

0:  synthetic preference scaling represented as .
0:   from a given reward set (, ) in the range [0.0, 1.0], where equal preference is equivalent to the value of 0.5.
0:  , the reward of the left trajectory segment.
0:  , the reward of the right trajectory segment.
0:  , the reward list composed of all reward sets.
2:  sort(), sort the list, ascending by default.
3:   number of elements in .
4:   element from .
5:   element from .
6:  if (then
7:     normalise
9:  else if (then
10:     normalise
12:  else
14:  end if
15:  return  
Algorithm 1 Synthetic Preference Scaling

To elaborate on the proposed Algorithm 1, we address the details in the following. As the range of is set as [0.0, 1.0], the value 1.0 means that the human absolutely prefers the left trajectory segment, and the value 0.0 means that the human absolutely prefers the right trajectory segment. The value 0.5 indicates that the human cannot judge between the two trajectory segments. To start with preference scaling, the reward list

is collected in the memory in each iteration, and the synthetic preferences are calculated based on the reward values with a normalisation measurement. In particular, based on the 90% confidence interval level, from the sets in the reward list

, we first remove the lowest 10% and highest 10% of reward values to ensure that outliers or anomalous reward values will not affect the preference scaling calculation. Then, all collected reward values are normalised to values between 0.0 and 1.0, and all sets in the reward list

are ranked by ascending order.

Iii-C Human Preference Scaling for Deep RL

Following the basic preference settings for RL as addressed in Section III-A, at each time , we maintain the policy : , where the agent interacts with the environment according to observation and then performs particular action based on instant observation

. During the training, the agent tries to estimate the reward function


from a deep neural network, which is updated as follows:

  1. A set of trajectories is generated by policy . The parameters of policy are updated by the traditional RL to ensure that the maximum sum of predicted rewards that could be achieved from observation and action is obtained.

  2. A pair of segments is randomly selected from a set of trajectories . This pair of segments is sent to the preference interface and allows the human to perform the comparison.

  3. Based on Algorithm 1, the preference scaling is collected from the human and linked to the pair of segments .

Please note that the above updating processes are in the asynchronous mode: process (1) passes the trajectories to process (2), process (2) passes the human preferences to process (3), and process (3) passes the parameters of back to process (1).

Fig. 2: Proposed (A) Human Preference Scaling Model and (B) Human-Demonstration Model for Deep RL

Preference Elicitation

In Figure 2-A, we show a human preference scaling structure for the RL agent to make these processes easily understandable. To reflect the natural human intentions, we modify the format of human preferences from fixed-based preferences to scale-based preferences in process (3). In terms of the current fixed-base preferences for a pair of segments [7] (as shown in Figure 1), this approach only allows to input the left, right or equal option, and the judgement is saved into preference database with the data format , where and are the extracted paired segments and is the fixed-based preference inputs from the human. Thus, it does not provide information regarding how much the human prefers the segment. For example, the left segment could be better than the right segment, but the left one is not perfect.

Our proposed scale-based preferences provide a scaling model (as shown in the Algorithm 1) for the human to input a dynamic score for the preferred segment by assigning any value between 0.0 and 1.0. The value of could be in the range of to specify the dynamic judgement of the human. If is input as 0.0, this represents that the condition of is absolutely preferred, while if is input as 1.0, this represents that the condition of is absolutely preferred. The value 0.5 represents that and are not different, and the above conditions do not hold. Then, we can further address the additional conditions, which are in the weakly preferred categories. To be more specific, any values between 0.0 and 0.5 (excluding the margin values 0.0 and 0.5) represent that is weakly preferred, while any values between 0.5 and 1.0 (excluding the margin values 0.5 and 1.0) represent that is weakly preferred. For example, a human could input a preference value of 0.87 or any values they like in the range between 0.0 and 1.0 to specify that they weakly prefer the left or right segment. Our contribution aims to supply more accurate information from the human preferences to the RL agent. This is not simply advice that the chosen option is better than the other one but also provides the degree of how much better it is than the other one.

Fitting the Reward Function

If the reward function estimate is the predicted reward from the reward predictor, as shown in Figure 2-A, then we consider

as a latent factor explaining the human’s judgements and assume that the human’s probability of preferring a segment

depends exponentially on the value of the latent reward summed over the length of the clip, which follows Christiano’s design process [7] as follows.

minimises the cross-entropy loss between these predictions and the actual labels of human inputs:

Optimising the Policy

After the reward function computes rewards, we can meet the need for traditional RL. As the reward function, may be non-stationary, which leads us to prefer RL algorithms that are robust to changes in the reward function, such as policy gradient methods [9]. In this study, we use proximal policy optimisation (PPO) [18] to perform simulated robotics tasks and applied the same parameter settings as in Christiano’s work [7].

Iii-D Human-Demonstration for Deep RL

As a large number of human preference scaling values (generally over 1,000 preference inputs) are required to be stored in the preference database to train the fitting of predicted rewards, we assume that reducing the number of human preference inputs by training a preference estimator is worthwhile. As shown in Figure 2-B, we developed a preference estimator based on the previous human inputs and used a regression model with supervised learning that we call the human-demonstration model to predict some of the human preferences. We expect this human-demonstration model to not only reduce the number of human inputs but also maintain good performance without sacrificing the cumulative rewards.

The human-demonstration model is an extended version from the previous Section III-C presenting human preference scaling for deep RL. In particular, the collected database of times preference scaling has the data format , as we addressed. To construct a prediction model and fit the parameters of the human preference scaling estimator, the database is separated into two parts, the training dataset and testing dataset, and we implemented two types of data splitting: 50% of data for training and 50% of data for testing as well as 70% of data for training and 30% of data for testing. This indicates that 30-50% of human inputs will be replaced by the agent’s estimation.

In particular, we applied two prediction models for the human preference scaling estimator: linear regression and support vector regression (SVR) with a radial basis function (RBF) kernel. By using linear regression, the objective function for ordinary least squares with one preferred segment

in the set is as follows:

where is the estimated preference scaling value and is the coefficient.

SVR gives us the flexibility to define errors and finds an appropriate hyperplane in higher dimensions to fit the data. The objective function of SVR is to minimise the coefficients, i.e., the L2-norm of the coefficient vector, with constraints, as shown below:



where is the estimated preference scaling value, and are the coefficients and the preferred segment, respectively, and is the maximum error called epsilon.

To evaluate the performance of the estimator, the mean squared error (MSE) is applied to measure the average of the squares of the errors, i.e., the average squared difference between the estimated preference scaling values and the ground truth of preference scaling .

Iv Experiment

We implemented the existing models and our proposed models for deep RL and performed experiments in 5 scenarios from MuJoCo [20]

with TensorFlow

[1] under the OpenAI Gym platform [4]. The collected results were consolidated under the TensorBoard package from TensorFlow.

Iv-a Robotic Control Scenarios

OpenAI Gym provides baseline environments to train the agent with RL algorithms [4]. MuJoCo is one of the popular continuous control tasks in OpenAI Gym with a physics engine to simulate the model-based control [21]. MuJoCo [20] contains diverse scenarios with robot control, where the agent moves different joints with continuous control instead of intermittent control to achieve the goal [19]. The agent will try to perform different types of actions to achieve the maximum cumulative reward value to reach the target goal. This could be challenging, as the MuJoCo environment involves high exploration dimensions for the agent.

MuJoCo Scenario Observation Task Summary
Walker (18, 6, 24) A planar walker tries to roll forward and walk as fast as possible. The reward depends on the velocity and the torso height .
Hopper (14, 4, 15) A one-legged robot is required to move forward and attain a torso height as high as possible. The reward depends on the velocity and the torso height .
Swimmer (10, 2, 13) A robot tries to reach a random target by swimming. The reward will be given when the nose of the robot touches the random target.
Ant (29, 8, 67) A robot has 4 legs and aims to learn to walk as fast as possible. The reward is based on the velocity and the body height .
Cheetah (18, 6, 17) A robot has to learn to move forward as fast as possible. The reward is based on the velocity , where the formula is .
TABLE I: List of 5 MuJoCo Scenarios

As shown in Table I, in this study, our testing environments include 5 scenarios, Walker, Hopper, Swimmer, Cheetah and Ant. The existing approaches, such as traditional RL (PPO) and RL from human preferences (RLHP), and our proposed models, such as RL from human preference scaling (RLHPS) and RL from human preference scaling with demonstrations (RLHPS with Demo), are applied to compare their performance in these 5 scenarios.

Iv-B Settings of the Parameters

Generally, in the policy gradient strategy, such as PPO, the agent starts with the initial policy, interacts with the environment, obtains a predicted reward from human feedback or using the pre-defined reward function, and then finally using the reward to improve the policy. Here, we need to know how much experience the agent should gather transitions (sequences of states, rewards, and actions) before updating the policy, and how to use the transitions for updating with a new policy.

Firstly, we need to deal with experience collections (horizon, minibatches, epochs) before update the policy. For example, PPO collects trajectories as far as the time horizon (

) limit, and then performs a minimum batch size stochastic gradient descent (SGD) update on all collected trajectories within a specified epoch. Secondly, to update the new policy from the old policy, PPO uses a surrogate loss function to keep the step from the old policy to the new policy within a safe range, where we need to consider the discount factor Gamma (

) and the GAE parameter (

). In addition, the remaining parameters are general hyperparameters that can be used in many deep learning experiments, such as learning rate, number of steps, number of hidden units. In this study, the experiments in MuJoCo scenarios are trained for

timesteps for 10 iterations. All MuJoCo scenarios are trained under the PPO strategy with or without human feedback with the hyperparameters stated in Table II.

Hyperparameter Value
Horizon () 2048
Minibatch size 64
Number of epochs 10
Gamma () 0.99
GAE parameter () 0.95
Adam step size
Learning rate
Number of steps
Number of hidden units 64
TABLE II: The Hyperparameters of PPO used in MuJoCo Scenarios

Iv-C Baselines

Traditional RL

The baseline of each scenario is training with the traditional RL (PPO) without any human involvement. The agent has to learn based on the scenario goal only from the rewards they received. The learning performance is the same as the traditional RL process and relies on the design of the reward function of each scenario. The details of the reward design of each scenario are specified in Table I. Our goal for this setup is to set the baseline algorithm and evaluate the performance in each scenario to check what reward values could be achieved without human input.

RL from Human Preferences (RLHP)

We consider that RLHP as another baseline that contains the basic human preferences in the experiment. This is used to replicate the results under the advice from a human [7]. The experimental setup emphasises a preference interface to ask for human preferences, and the user interface will show the rewards and the video clip information to let the human input a large number of preferences. The interface allows the user to input either a left, right or equal option into the preference interface.

V Results

In this study, 5 scenarios (Walker, Hopper, Swimmer, Cheetah and Ant) are our experimental environments to evaluate the training performance among 4 types of RL models by comparing traditional RL baselines (without human preferences) - PPO and RLHP - and our proposed RLHPS and RLHPS with Demo. The RL algorithms involved in this study are summarized in Table III, where human preferences require input of 700 to 1,400 labels, which corresponds to less than 0.01% of the training timesteps.

RL Type Algorithm
No. of Inputs
No. of Inputs
RL without Human Inputs Traditional RL (PPO) N/A N/A
RLHP 1,400 700 RL from human preferences.
RL with Human Inputs (Preferences) RLHPS (ours) 1,400 700 RL from human preference scaling.
RLHPS with Demo (ours) 980 700
RL from human preference scaling
with demonstrations;
30-50% of 1,400 inputs (420-700 inputs)
are generated from the estimated preferences.
TABLE III: Summary of RL Algorithms

V-a RL from Human Preference Scaling (RLHPS)

Based on the experiments in the 5 MuJoCo scenarios, the performance in terms of the cumulative reward values of the two baselines, the traditional RL (PPO) and RLHP, and our proposed approach (RLHPS) are compared, as shown in Figure 3. This figure shows the training of our agent by learning with two types of human preference inputs (700 and 1,400 labels, amounting to less than 0.01% of the training timesteps) for RLHP and our proposed RLHPS. The cumulative reward values trained from our proposed RLHPS are always higher than those from RLHP or PPO, except in the Hopper scenario, where all the RL algorithms achieve similar reward values after the timesteps. From another perspective, the use of 1,400 human input labels could generally achieve higher rewards than 700 human inputs labels, suggesting that more human effort may be beneficial to training a robust RL agent.

Cumulative Reward Values

Particularly in the Walker scenario, our proposed RLHPS can achieve an approximately 1,500 higher reward value than the traditional RL PPO setup and an approximately 1,000 higher reward value than RLHP for the case of 1,400 human preference inputs. For the Swimmer scenario, the reward learning from RLHPS with the 1,400-label setup can achieve an approximately 350 reward value at the end of the experiment, while the RLHP with the 1,400-label setup could only achieve an approximately 300 reward value at the end of the experiment. Both human preference setups (RLHP and RLHPS) are much better than the traditional RL setup (PPO), as PPO only achieves a 150 reward value. In terms of the Cheetah scenario, our proposed RLHPS can also acquire higher rewards, an approximately 4,000 reward value, compared to the range of 3,000-3,500 when trained by RLHP and PPO.

Regarding the special case of the Ant scenario, it requires complete three-dimension movement and observation for the robot to learn. The continuous movement of the agent is very difficult, as the ant robot has to find a way to balance the body and walk. From our observation, during the beginning period of training, the robot does not know how to walk and balance and always flips over, which causes the reward values in the beginning period to always be negative. The experimental results show that the scale-based preferences, in red colours, could yield a good performance compared to the fixed preferences. This outcome is good evidence that our proposed RLHPS can obtain a significant improvement in attaining higher rewards (either with 700 or 1,400 labels) and quickly achieve positive reward values compared to RLHP or PPO.

Fig. 3: Experimental Results of PPO, RLHP, and RLHPS in 5 MuJoCo Scenarios

Instant Reward Distributions

For our proposed RLHPS, we also investigated the instant reward distributions between the starting and finishing training periods in the 5 MuJoCo scenarios. The starting training period is defined as the period where the human has not input a preference label, generally within the initial 300 timesteps. The finishing training period is the final 2,500-3,000 timesteps when the human has input 1,400 preference labels and the RL agent is approaching completion of the training process.

As shown in Figure 4, the instant reward distributions are sampled for these two training periods (starting vs. finishing) in the 5 MuJoCo scenarios. The blue data distribution represents the beginning time of the training before acquiring any preferences, so most of the instant rewards are still negative. The grey data distribution shows the situation of the ending period of the training, where the RL agent can achieve more positive rewards from human preferences. Our findings from the instant reward distributions confirm that our proposed RLHPS has a positive effect on RL, and the agent could use self-management to learn values of human preference scaling.

Fig. 4: RLHPS: Instant Reward Distributions Between Starting and Finishing Training Periods in the 5 MuJoCo Scenarios

In summary, our findings show that the setting with human preference scaling is always much better than the baseline with fixed-based human preferences, which can confirm that our scaling model is beneficial for the agent to learn in higher dimensional environments, as our setting did reflect the natural human intentions. Additionally, from the comparison results of the instant reward distributions, RL from human preference scaling is confirmed to have an excellent training performance, as instant reward values are shifted to be more favourable than those in the initial stages without any preferences received by the agent.

V-B RL from Human Preference Scaling with Demonstrations (RLHPS with Demo)

Fig. 5: Experimental Results of PPO, RLHPS, and RLHPS with Demo in the 5 MuJoCo Scenarios

After the experiments on RLHPS, we implemented the estimator interface so that it linked the preference interface and the human based on our proposed approach - RLHPS with Demo. As addressed in the previous Section III-D, we performed two types of data splitting: 30% and 50% of data from the preference database for testing. As 1,400 human preference inputs could achieve better performance across the different scenarios according to the previous experiment, we keep the same amount of labels - 1,400 - to test the performance of RLHPS with Demo. As shown in Figure 5, the experiment employing RLHPS with Demo (30%) indicates that we will reduce 30% of human preference inputs, so humans only need to input 70% of 1,400 (980) preference labels, and the remaining 30% of the 1,400 preferences will be predicted by the regression model. The experiment employing RLHPS with Demo (50%) aims to provide further observations of the influence of reducing the number of human preferences on the cumulative reward values. Similarly, the experiment employing RLHPS with Demo (50%) indicates that 50% of the 1,400 (700) preference labels are input by the human, while 50% of the 1,400 preferences are predicted by linear regression or the SVR model with the smallest MSE, as shown in Table IV, which presents the prediction accuracies using the averaged MSE trained by linear regression or the SVR (with RBF kernel) model in the 5 scenarios.

Cumulative Reward Values

Figure 5 generally indicates that RLHPS with Demo including 30% estimated human preferences could achieve similar or superior cumulative reward values compared to the other approaches: RLHPS with Demo including 50% estimated human preferences or RLHPS (without Demo) excluding the human-demonstration stage. For RLHPS with Demo including 50% estimated human preferences, where we cut half of the amount of human preference inputs, it can only achieve a similar performance to PPO, far from the achievement of using RLHPS.

In terms of the Walker and Hopper scenarios, RLHPS with Demo including 30% estimated human preferences shows the highest cumulative reward values compared to the other approaches. For the Swimmer and Ant scenarios, RLHPS with and RLHPS without Demo have similar performances, suggesting that we can employ 30% less human inputs to achieve similar outcomes. In only one particular case, in the Cheetah scenario, was RLHPS with Demo unable to achieve a comparable performance to that of RLHPS, suggesting that the estimated preferences may not be of benefit for the Cheetah scenario to guide walking well.

Linear Regression SVR (RBF kernel)
Scenario Mean Standard Mean Standard
Deviation Deviation
Walker 0.0514 0.0752 0.0099 0.0184
Hopper 0.0688 0.0383 0.0122 0.0716
Swimmer 0.0239 0.0397 0.0737 0.0665
Ant 0.0356 0.0846 0.0637 0.1210
Cheetah 0.0607 0.0769 0.0725 0.0870
TABLE IV: MSEs of Preference Estimations

V-C Observation of Testing Behaviours

We released a video ( to demonstrate behaviours of the trained agent in all MuJoCo scenerios, including walker, hopper, swimmer, ant, and cheetah. Generally, we can see the our proposed method, RLHPS or RLHPS with Demo can obtain more reasonable behaviors to fit in the expected goal, compared to RLHP or PPO. Only in some special cases, the goal of walker scenerio is the agent to roll forward and walk but by observing the behavior of the trained agent we found the agent performed very slow to achieve this goal. In terms of ant sceneiro, our RLHPS with Demo seems to take more time to learn to walk around, compared to RLHP or PPO.

V-D Limitations

This study still has some limitations. Our findings provide insights into some human preference labels that could be generated by the prediction model, which could reduce the number of human inputs without sacrificing the training performance while maintaining an excellent reward return. However, there is still a limit to this reduction process. From the experimental results, when half of the human preference labels are predicted, the training performance is dramatically impacted. Therefore, in the current settings, using up to 30% estimated preference inputs in place of human inputs can prevent the negative impact and maintain the training performance.

Furthermore, the frame selections are made on a random basis, which may influence the human preferences for some cases close to the equal option. We suggest that it could be refined by having an intelligent selection model select segments that have diverse rewards or at time points when the prediction model is unable to judge between two segments. In addition, the normalisation step of our human preference scaling model could be further investigated since the distribution of rewards could affect preference levels for the prediction model.

Vi Conclusion

Our study proposed a human preference scaling model with demonstrations that aims to effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion - MuJoCo games. We attempted to optimise RL from human preferences in two ways: enhancement of the human preference details by scaling preference levels and reduction of the number of human preference inputs by replacing some inputs with preference labels estimated using a prediction model.

Our two developed models, RLHPS and RLHPS with Demo, achieve higher cumulative reward values and significantly reduce up to 30% of the cost of human inputs compared to the existing approaches PPO and RLHP. To present the flexibility of our approach, we released a video ( showing comparisons of behaviours of agents trained with different types of human inputs.

Given the high scalability of deep RL, we believe that our proposed approaches, RLHPS and RLHPS with Demo, could help the agent to learn natural human preferences with fewer inputs to enhance the training performance. Our contribution to the improvement of RL-based robotic movement is potentially approaching human thinking in more complex situations.


The code of this paper can be found at GitHub


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §IV.
  • [2] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §I.
  • [3] K. Bogert, J. F. Lin, P. Doshi, and D. Kulic (2016) Expectation-maximization for inverse reinforcement learning with hidden data. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 1034–1042. Cited by: §I.
  • [4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IV-A, §IV.
  • [5] Z. Cao and C. Lin (2019) Reinforcement learning from hierarchical critics. arXiv preprint arXiv:1902.03079. Cited by: §I.
  • [6] Z. Cao, K. Wong, Q. Bai, and C. Lin (2020) Hierarchical and non-hierarchical multi-agent interactions based on unity reinforcement learning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2095–2097. Cited by: §I.
  • [7] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei Deep reinforcement learning from human preferences. Conference Proceedings In Advances in Neural Information Processing Systems, pp. 4299–4307. Cited by: §I, §I, §II-C, §II-D, §II-D, §III-A, §III-C, §III-C, §III-C, §IV-C.
  • [8] J. Fürnkranz, E. Hüllermeier, W. Cheng, and S. Park (2012) Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning 89 (1-2), pp. 123–156. Cited by: §II-A, §II-B.
  • [9] J. Ho, J. Gupta, and S. Ermon (2016) Model-free imitation learning with policy optimization. In International Conference on Machine Learning, pp. 2760–2769. Cited by: §III-C.
  • [10] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne (2017) Imitation learning: a survey of learning methods. ACM Computing Surveys (CSUR) 50 (2), pp. 1–35. External Links: ISSN 0360-0300 Cited by: §I.
  • [11] Z. Ke, Z. Li, Z. Cao, and P. Liu (2020) Enhancing transferability of deep reinforcement learning-based variable speed limit

    endgraf control using transfer learning

    IEEE Transactions on Intelligent Transportation Systems. Cited by: §I.
  • [12] D. Kreps (1988) Notes on the theory of choice. Westview press. Cited by: §II-B, §II-D.
  • [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §I.
  • [14] A. Y. Ng and S. J. Russell Algorithms for inverse reinforcement learning. Conference Proceedings In Icml, Vol. 1, pp. 663–670. Cited by: §I.
  • [15] S. Nikolaidis, S. Nath, A. D. Procaccia, and S. Srinivasa (2017) Game-theoretic modeling of human adaptation in human-robot collaboration. In Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, pp. 323–331. Cited by: §I.
  • [16] S. Russell (2016) Should we fear supersmart robots?. Scientific American 314 (6), pp. 58–59. Cited by: §I.
  • [17] Y. Schroecker and C. L. Isbell (2017) State aware imitation learning. In Advances in Neural Information Processing Systems, pp. 2911–2920. Cited by: §I.
  • [18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III-C.
  • [19] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, and A. Lefrancq (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §IV-A.
  • [20] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §IV-A, §IV.
  • [21] E. Todorov, T. Erez, and Y. Tassa Mujoco: a physics engine for model-based control. Conference Proceedings In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. External Links: ISBN 1467317365 Cited by: item 2, §IV-A.
  • [22] A. Wilson, A. Fern, and P. Tadepalli (2012) A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, pp. 1133–1141. Cited by: §III-B, §III-B.
  • [23] C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz (2017) A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research 18 (1), pp. 4945–4990. Cited by: §II-B.
  • [24] B. Woodworth, F. Ferrari, T. E. Zosa, and L. D. Riek (2018) Preference learning in assistive robotics: observational repeated inverse reinforcement learning. In Machine Learning for Healthcare Conference, pp. 420–439. Cited by: §II-A.
  • [25] F. Xiao, Z. Cao, and A. Jolfaei (2020) A novel conflict measurement in decision making and its application in fault diagnosis. IEEE Transactions on Fuzzy Systems. Cited by: §I.