RL agents Implicitly Learning Human Preferences

02/14/2020 ∙ by Nevan Wichers, et al. ∙ Google 0

In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In most RL applications it’s assumed that there is a given reward function the agent should optimize. However, in the real world, it is less clear what an agent should be rewarded for, and the wrong choice of reward function could be harmful. We would like RL reward functions to be informed by what humans want. Learning from Human preferences (Christiano et al., 2017) is a possible solution. It trains a neural network to map states of the environment to how much human preferences are satisfied. That neural network is used as a reward function for a RL agent, to train the agent to satisfy human preferences. One problem the authors observed is that the agent exploited inaccuracies in the reward network if the reward network wasn’t updated frequently enough. It will help the learning from human preferences method if the reward function can generalize better based on less data.

The good news is that a RL agent in the real world will probably have to learn human preferences implicitly in order to achieve its goal, just like how humans have to understand each other’s preferences in order to achieve our goals. Even if the RL agent is misaligned with human values, the agent will still have to understand human values in order to determine when the humans will shut it down so it can take precautions. These preferences will be represented by weights in the agent’s neural network. Humans have evolved to represent other people’s emotions in their brains

(Gallese & Goldman, 1998), so a RL agent may also.

These human preferences can be extracted from the agent’s neural network to give a mapping between states of the environment and how much human preferences are satisfied. This will probably be better than learning human preferences based on human feedback alone because the RL agent will have seen much more states of the world than can it be labeled by a human. So the agents’ understanding of human preferences will likely generalize better.

An iterative bootstrapping approach can be used to give the agent a reward function aligned with human values. First the agent can be trained with an initial imperfect reward function. Once the agent implicitly learns about human preferences, the reward function can replaced with one based on the extracted human preferences. Then the agent can be trained farther on this new reward function and the process can repeat.

This work is relevant for ML in the real world because

  • [noitemsep,topsep=0pt]

  • We give evidence that a RL agent in the real world will implicitly learn human preferences, whereas an RL agent in an environment without at least simulated humans won’t learn human preferences.

  • Our method of extracting a reward function based on human preferences would help overcome the challenge of designing a safe reward function in real world RL.

We make the following contributions:

  • [noitemsep,topsep=0pt]

  • We show that RL agents implicitly learn the preferences of humans in their environment.

  • We explore different methods to extract the human preferences from the RL agent.

  • We propose the iterative bootstrapping approach explained above.

2 Method

We focus on exploring different methods for extracting a model of human preferences from a trained RL agent. Christiano et al. (2017) already show that if we have a model of human preferences it can be used as a reward function.

We assume that we have a small amount of supervised data. In our experiments, we assume that the supervised data is pairs of a state of the environment and a boolean representing if human preferences are satisfied or not. However, our approaches can work with other forms of supervised data.

We explore the following techniques to attempt to extract human preferences from the hidden activation of the trained RL model.

Using a single activation

This technique computes which activation has the highest AUC with the human preferences in the supervised data, and uses that activation as the human preference predictor. This technique will work if a single activation represents how much human preferences are satisfied.

Neural network

In this technique, a neural network is trained on the supervised data to predict if the human preferences are satisfied given the agent’s hidden activations as input. This will help if human preferences are represented in a more complicated way.

2.1 Unsupervised

Since there is only a small amount of labeled data available relative to the number of states the RL agent has visited, unsupervised learning techniques may be helpful.


In this technique, a clustering algorithm is applied to the hidden activations. Then each of the clusters are labeled based on a majority vote of the items in the cluster. Note that this technique will only work if the human preference data is discrete instead of continuous. George et al. (2017)

show that clustering can help with transfer learning.

Dimensionality reduction

This involves applying dimensionality reduction methods such as PCA or NMF to the hidden activations of the RL agent. After the dimensionality is reduced, any of the techniques described above can be applied on the data with reduced dimensions. OpenAI et al. (2019) had success applying NMF to a neural network trained to solve a Rubik’s cube.

3 Related work

Multi agent RL

Hernandez-Leal et al. (2019) Show that training a RL agent with the auxiliary task of modeling another agent improves performance in multi agent environments. Our work shows that the agent learns to model others even if it isn’t trained to.

Raileanu et al. (2018) show that if the other agent is sufficiently similar, the agent itself can be used to model the other agent. Our work doesn’t make the assumption that the human is similar to it.

RL interpretability

Our work seeks to interpret the hidden activations of a RL agent. There are also other techniques that have been explored for interpreting RL agents. Greydanus et al. (2018) and Zahavy et al. (2016) apply interpretability techniques like saliency maps, and t-SNE to RL agents. Our work applies interpretability techniques, among others, and does so with the goal of extracting human preferences from the agent.

Theory of mind

Rabinowitz et al. (2018)

trains a supervised learning algorithm to predict the actions and goals of a RL agent. Our work can be viewed as showing that a RL agent learns an implicit theory of mind without being trained to.

4 Environment

Figure 1: Example of the environment

We use a grid world environment with a simulated human to compare different methods. The agent is rewarded for pressing a button to collect apples. If the agent has taken too many apples, the simulated human becomes angry and starts taking them from the agent, so the agent loses reward. The way for the agent to prevent this is to press another button to activate an electric fence. The electric fence costs reward at every step, and scares the human away if it tries to take an apple while the fence is active. The fence automatically deactivated after this.

The environment is designed so the optimal strategy for the agent is to only activate the electric fence when it knows the human is about to become angry and start taking apples. If the agent activates the fence too soon, it loses too much reward because of the fence. And if it activates the fence too late, the agent loses reward from the human taking apples. Even though the environment is simple, the fact that the agent has to understand if the human’s preferences are satisfied, or if they’re angry in order to do well mirrors the real world.

The top part of the environment shows the human and the apples that the agent has collected. The bottom part of the environment shows the agent and also the buttons it can click. The agents actions are to move left or right or to click the button under it. The environment is represented to the RL agent as an image. The human is shown as horizontal stripes and the collected apples as vertical stripes, against a checkerboard background. The colors and the locations of the buttons and apples are randomized to make the environment more diverse and realistic. The episode terminates after a random number of steps, and the time-steps remaining are represented in the state.

5 Baselines

Training from the state environment

This uses the same methods described above only given the environment state as input instead of the activations. We trained a CNN instead of a DNN in this baseline. This will demonstrate if extracting the preferences from the agent instead of the environment state helps.

Agent not penalized when human takes apples

The environment is the same except the RL agent isn’t penalized when the human takes apples. In this environment, the agent has no reason to care about the simulated human’s preferences. If training a human preference model is useful simply because the network extracts useful features, a preference model trained on this agent will work just as well. However, if it’s important that the agent has extracted the human preferences instead of only useful features, this won’t work as well.

Training from Q values

This uses the same methods described above only given the Q values of the agent as input instead of the hidden activations of the agent as input.


Train an autoencoder to reduce the dimensionality of the image. Then train a network to predict the human preferences from the hidden activations in the middle of the autoencoder.

6 Experiments

We trained DQN agents with different hyperparameters and used the network from the agent with the highest reward for these experiments. We then collected a random sample of activations from the last hidden layer of the RL agent to use to train our methods. Each method is trained to predict 0 if the human was scared away by the fence, or is angry, and 1 otherwise.

For each dataset and method we did a random search over hyperparameters using 50 training examples and 100 eval examples. We ran each hyperparameter combination 4 times with different training and eval splits and averaged the results so it wouldn’t be too dependent on which set of training examples was chosen. We also used early stopping in each of the techniques to prevent overfitting. The unsupervised methods were trained using about 20,000 unlabeled examples. We trained the model with the best hyperparameters 10 times with 50 training examples and 500 validation examples. For the models trained on activations, we repeated this process 4 times on the 4 best performing RL agents, to make sure that the methods aren’t sensitive to noise in the RL training process. An average of the results are reported in table 1.

We tuned the hyperparameters of the autoencoder to reduce the dimensions of the image and decode it with high accuracy. The reconstruction is good enough that one can count the number of apples in the environment. We trained the autoencoder with 4 different random initializations. For each of these we found the best hyperparameters for the neural network trained off of the hidden activations. The results from the autoencoder which got the highest AUC is reported in table 1.

7 Results


NN Single Reduce + NN Reduce + single
Activations (ours) 0.93 0.88 0.92 0.87
Image 0.8 N/A 0.54 0.5
Q values 0.79 0.63 N/A N/A
Activations (no penalty) 0.79 0.73 0.77 0.76
Image (no penalty) 0.78 N/A 0.56 0.5
Autoencoder 0.6 N/A N/A N/A
Table 1: AUC of the experiments. No penalty means the RL agent wasn’t penalized when the human ate an apple. Reduce means that a dimensionality reduction techniques was applied before the other method. Single means that the input with the best AUC was used. Activation means the network was trained from the RL agent’s activations. Image means the network was trained on the raw state of the environment. N/A means that a certain combination didn’t make sense, or wasn’t evaluated.

We weren’t seeing promising initial results from clustering methods, so we don’t show results for them here.

Using the activations got .13 more AUC than using any of the other inputs (.93 vs .8). It’s also noteworthy that using only a single activation from the RL agent also got better performance than using any of the other inputs. This means the neural network represents most of the information about human preferences in a single neuron. Dimensionality reduction techniques didn’t help the performance. Training from the state of the autoencoder didn’t do well.

Training from the image gets about the same result if the agent is penalized for the human taking an apple or not, showing that both of these datasets are the same difficulty.

8 Conclusion

Our results suggest that agents implicitly learn about the preferences of humans in their environment, and that extracting those preferences can make predictors more data efficient. We think that this will help agents perform better for humans in the real world, since their reward function will be tied to their robust understanding of human preferences.

In future work we would like to validate our method in more complex environments. We would also like to train an agent using the human preference predictor as the reward function.


  • Christiano et al. (2017) Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei.

    Deep reinforcement learning from human preferences.

    In NIPS, 2017.
  • Gallese & Goldman (1998) Vittorio Gallese and Alvin Goldman. Mirror neurons and the simulation theory of mind-reading. Trends in Cognitive Sciences, 2(12):493–501, dec 1998. ISSN 1364-6613. doi: 10.1016/S1364-6613(98)01262-5. URL https://doi.org/10.1016/S1364-6613(98)01262-5.
  • George et al. (2017) Daniel George, Hongyu Shen, and E. A. Huerta. Deep transfer learning: A new deep learning glitch classification method for advanced ligo. ArXiv, abs/1706.07446, 2017.
  • Greydanus et al. (2018) Sam Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. Visualizing and understanding atari agents. In

    International Conference on Machine Learning

    , pp. 1787–1796, 2018.
  • Hernandez-Leal et al. (2019) Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor. Agent modeling as auxiliary task for deep reinforcement learning. In AIIDE, 2019.
  • OpenAI et al. (2019) OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. Solving rubik’s cube with a robot hand, 2019.
  • Rabinowitz et al. (2018) Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and Matthew Botvinick. Machine theory of mind. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4218–4227, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/rabinowitz18a.html.
  • Raileanu et al. (2018) Roberta Raileanu, Emily Denton, Arthur Szlam, and Robert Fergus. Modeling others using oneself in multi-agent reinforcement learning. In Andreas Krause and Jennifer Dy (eds.), 35th International Conference on Machine Learning, ICML 2018, volume 10, pp. 6779–6788. International Machine Learning Society (IMLS), 1 2018.
  • Zahavy et al. (2016) Tom Zahavy, Nir Ben Zrihem, and Shie Mannor. Graying the black box: Understanding dqns. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1899–1908. JMLR.org, 2016.