Affect-based Intrinsic Rewards for Learning General Representations

12/01/2019 ∙ by Dean Zadok, et al. ∙ Technion Microsoft 24

Positive affect has been linked to increased interest, curiosity and satisfaction in human learning. In reinforcement learning extrinsic rewards are often sparse and difficult to define, intrinsically motivated learning can help address these challenges. We argue that positive affect is an important intrinsic reward that effectively helps drive exploration that is useful in gathering experiences critical to learning general representations. We present a novel approach leveraging a task-independent intrinsic reward function trained on spontaneous smile behavior that captures positive affect. To evaluate our approach we trained several downstream computer vision tasks on data collected with our policy and several baseline methods. We show that the policy based on intrinsic affective rewards successfully increases the duration of episodes, area explored and reduces collisions. The impact is increased speed of learning for several downstream computer vision tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 7

Code Repositories

affectbased

Affect-based Intrinsic Rewards for Learning General Representations: https://arxiv.org/abs/1912.00403


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We present a novel approach leveraging a positive affect-based intrinsic reward to motivate exploration. We use this policy to collect data for self-supervised pre-training and then use the learned representations for multiple downstream computer vision tasks. The red regions highlight the parts of the architecture trained at each stage.

Reinforcement learning (RL) is most commonly achieved via policy specific rewards that are designed for a predefined task or goal. Such extrinsic rewards can be sparse and difficult to define and/or only apply to the task at hand. We are interested in exploring the hypothesis that RL frameworks can be designed in a task agnostic fashion, and that this will enable us to efficiently learn general representations which are useful in solving several machine intelligence tasks. In particular, we consider intrinsic rewards that are akin to affect mechanisms in humans and encourage efficient and safe explorations. These rewards are task independent; thus, the experiences they gather are not specific to any particular activity and can be harnessed to build general representations. Furthermore, intrinsically motivated learning can have advantages over extrinsic rewards as they can reduce the sample complexity by producing rewards signals that indicate success or failure before the end of an episode [mcduff2018visceral].

A key question we seek to answer is how to define such an intrinsic policy. We propose a framework that comprises of mechanisms motivated by human affect. The core insight is that learning agents motivated by drives such as delight, fear, curiosity, hunger etc. can garner rich experiences that are useful in solving multiple types of tasks. For instance, in reinforcement learning contexts there is a need for an agent to adequately explore its environment [frank2014curiosity]. This can be performed by randomly selecting actions, or employing a more intelligent strategy, directed exploration, that incentives exploration of unexplored regions [chen2018learning]. Curiosity is often defined by using the prediction error as the reward signal [pathak18largescale, pathakICMl17curiosity]. As such, the uncertainty, or mistakes, made by the system are assumed to represent what the system should want to learn more about. However, this is a simplistic view as it fails to take into account that new stimuli are not always very informative or useful [savinov2018episodic]. Savinov et al. use the analogy of becoming glued to a TV channel surfing, when there is the rest of the world outside the window. Their work proposed a new novelty bonus that features episodic memory. McDuff and Kapoor [mcduff2018visceral] took another approach focusing on safe exploration, proposing intrinsic rewards mimicking responses to that of a human’s sympathetic nervous system (SNS) to avoid catastrophic mistakes during the learning phase, but not necessarily promoting exploration.

In this paper, we specifically focus on the role of positive emotions and study how such intrinsic motivations can enable learning agents to explore efficiently and build useful representations. In research on education, positive affect has been shown to be related to increased interest, involvement and arousal in learning contexts [masters1979affective]. Kort et al.’s [kort2001affective] model of emotions in learning posits that the states of curiosity and satisfaction are associated with positive affect in the context of a constructive learning experience. Human physiology is informative about underlying affective states. Smile behavior [jaques2018learning] and physiological signals [mcduff2018visceral] have been effectively used as feedback in learning systems but not in the context of intrinsic motivation or curiosity. We leverage facial expressions as an unobtrusive measure of expressed positive affect. The key challenges here entail both designing a computer vision system that can first model the intrinsic reward appropriately, and then building a learning framework that can efficiently use the data to solve multiple downstream computer vision tasks.

The core contributions of this paper (summarized in Fig. 1) are to: (1) present a novel learning framework in which reward mechanisms motivated by positive affect-mechanisms in humans are used to carry out explorations while being agnostic to any specific machine intelligence tasks, (2) show how the data collected in such an unsupervised

manner can be used to build general representations useful for solving downstream tasks with minimal task-specific fine-tuning, and (3) report the results on experiments showing that the framework improves safe exploration as well as enabling efficient learning for solving multiple machine intelligence tasks. In summary, we argue that such an intrinsically motivated learning framework inspired by affective mechanisms can be effective in increasing the coverage during exploration, decreasing the number catastrophic failures, and that the garnered experiences can help us learn general representations for solving tasks including: depth estimation, scene segmentation and sketch-to-image translation.

2 Related Work

Our work is inspired by intrinsically motivated learning [chentanez2005intrinsically, haber2018learning, zheng2018learning]. One key property of intrinsic rewards is that they are non-sparse [savinov2018episodic], this helps aid learning even if the signal is weak. Much of the work in this domain uses a combination of intrinsic and extrinsic rewards in learning. Curiosity is one example of an intrinsic reward that grants a bonus when an agent discovers something new and is vital for discovering successful behavioral strategies. For example, [pathak18largescale, pathakICMl17curiosity] models curiosity via the prediction error as a surrogate and shows that such intrinsic reward mechanism performed similarly to hand-designed extrinsic rewards in many environments. Similarly, Savinov et al. [savinov2018episodic] defined a different curiosity metric based on how many steps it takes to reach the current observation from those in memory, thus capturing environment dynamics. Their motivation was that previous approaches were too simplistic in assuming that all changes in the environment should be considered equal. McDuff and Kapoor [mcduff2018visceral] provided an example of how an intrinsic reward mechanism could motivate safer learning, utilizing human physiological responses to shorten the training time and avoid critical states. Our work is inspired by the prior art, but with the key distinction that we specifically aim to build intrinsic reward mechanisms that are visceral and trained on signals correlated with human affective responses.

Imitation learning (IL) is a popular method for deriving policies. In IL, the model is trained on previously generated examples to imitate the recorded behaviour. It has been successfully implemented using data collected from many domains [billard2001learning, blukis2018following, bojarski2016end, cardamone2009learning, ross2011reduction, ross2013learning]. Simulated environments have been successfully used for training and evaluating IL systems [codevilla2018end, zadok2019explorations]. We use IL as a baseline in our work and perform experiments to show how a combination of IL and positive affect-based rewards can lead to greater and safer exploration.

One of our goals is to explore whether our intrinsic motivation policy can help us learn general representations. Fortunately, the rise of unsupervised generative models, such as generative adversarial networks (GANs) [goodfellow2014generative] and variational auto-encoders (VAE) [kingma2013auto]

, has led to progress across many interesting challenges in computer vision that involve image-to-image translation 

[isola2017image, zhu2017unpaired]. In our work, we use three tasks in our data evaluation process: scene segmentation [DEEPLAB], depth estimation [godard2017unsupervised] and sketch-to-image translation [chen2018sketchygan]. The first two tasks are common in driving scenarios and augmented reality [pan2017virtual, swan2007egocentric, wang2019monocular], while the third is known for helping people render visual content [eitz2012humans] or synthesize imaginary images [chen2009sketch2photo]. In this paper, we show that by pre-training a VAE in a self-supervised way first using our exploration policy we can obtain better results on all three tasks.

3 Our Framework

Fig. 1 describes the overall framework. The core idea behind the proposed methodology is that the agents have intrinsic motivations that lead to extensive exploration. Consequently, an agent on its own is able to gather experiences and data in a wide variety of conditions. Note that unlike traditional machine intelligence approaches, the agent is not fixated on a given task - all it is encouraged to do is explore as extensively as possible without getting into perilous situations. The rich data that is being gathered then needs to be harnessed into building representations that will eventually be useful in solving many perceptions tasks. Thus, the framework consists of three core components: (1) A positive-affect based exploration policy, (2) a self-supervised representation learning component and (3) mechanisms that utilize these representations efficiently to solve various vision tasks.

3.1 Affect-Based Exploration Policy

The approach here is to create a model that encourages the agent to explore the environment. We want a reward mechanism that positively reinforces behaviors that mimics a human’s affective responses and lead to discovery and joy.

Positive Intrinsic Reward Mechanism:

In our work we use a Convolutional Neural Network (CNN) to model the affective responses of a human, as if we were in the same scenario as the agent. We train the CNN model to predict human smile responses as the exploration evolves. Based on the fact that positive affect plays a central role in curiosity and learning 

[kort2001affective], we chose to measure smiles as an approximate measure of positive affect. Smiles are consistently linked with positive emotional valence [brown1980relationships, kassam2010assessment] and have a long history of study [lafrance2011lip] using electromyography [brown1980relationships] and automated facial coding [martinez2017automatic]. We must emphasize that in this work we were not attempting to model the psychological processes that cause people to smile explicitly. We are only using smiles as an outward indicator of situations that are correlated with positive affect as people explore new environments. In particular the NN was trained to infer the reward directly given that an action was taken when at state . We defer the details of the NN architecture, the process of data collection and the training procedure to Section 4.

Choosing Actions with Intrinsic Rewards: Given the intrinsic reward mechanism, we can use any off-the shelf sequential decision making framework, such as RL [mcduff2018visceral], to learn a policy. It is also feasible to modify an existing policy that is trained to explore or collect data. While the former approach is a desirable one theoretically, it requires a very large number of training episodes to return a useful policy. We focus on the later, where we assume there exists a function

which can predict a vector of actions probabilities

when the agent observes state . Formally, given an observation and a model , the next action, , is gets selected as: . Such a function can be trained on human demonstrations while they explore the environment.

We then use the intrinsic positive affect model to change the action selection such that it biases the actions that promise to provide better intrinsic rewards. Intuitively, instead of simply using the output of the pre-trained policy to decide on the next action, we consider the impact of the intrinsic motivation for every possible action in consideration. Formally, given the positive affect model , a pre-trained exploration policy , observation , the next action, , being selected becomes:

(1)

The above equation adds a weighted intrinsic motivation component to the action probabilities from the original model . The weighting parameter defines the trade-off between the original policy and the effect of the intrinsic reward and in this paper is determined empirically.

3.2 Self-Supervised Learning

Given the exploration policy, the agent has the ability to explore and collect rich data. The next component aims to use this data to build rich representations that eventually could be used for various visual recognition and understanding tasks.

The challenge here is that since the collected data was task agnostic there are no clear labels that could be used for supervised learning. We consequently, use the task of jointly learning an auto encoder and decoder, through a low-dimensional latent representation. Formally, we use a variational autoencoder (VAE) to build such representations. For example, a VAE can be trained to restore just the input image, with the loss constructed as the combination of negative log likelihood and KL divergence, as follows:

(2)

Where the encoder is denoted by , the decoder is denoted by and denotes the low-dimensional projection of the input . Note that such an architecture could be trained not only on the task of reconstructing one frame, but a sequence of frames. This includes predicting future or past frames (or a combination of those). The key intuition here is that if the VAE can successfully encode and decode sequence of frames then implicitly it is considering aspects such as depth, motion, segmentation, dynamics that are critical to making successfully predictions. Thus, it should be possible to tweak and fine-tune these VAE networks to solve a host of visual recognition and perception tasks with a minimal effort.


Figure 2: An example of the smile response for a six minute (360s) period during one of the driving sessions. Frames from the environment and from the webcam video are shown as reference.

3.3 Fine-tuning for Vision Tasks

Given the VAE representation, our goal now is to reuse the learnt weights to solve standard machine perception tasks. We accomplish this by fine-tuning a minimal number of decoder weights, while the rest of the network stays unchanged. Formally, given some labeled data corresponding to a vision task, similar to supervised learning, we optimize the negative log likelihood:

(3)

Note that the goal is to minimally modify the network. In our experiments, we show how we can solve depth map estimation and scene segmentation by only tweaking the weights for the first or last few layers just before the decoder output. We also show how we can use those weights for sketch-to-image translation, even with a small amount of annotated samples.

4 Experiments

We conducted experiments to analyze (1) the potential advantages of affect-based exploration policy over other heuristics, (2) the ability to learn general representations and (3) if it was possible to solve diverse computer vision tasks by building upon and minimally modifying the general model.

We conducted our experiments on a high fidelity simulation environment  [shah2018airsim] for autonomous systems. Specifically, we used a customized 3D maze (dimensions: 2,490 meters by 1,500 meters), a top down view can be seen in Fig. 3. The maze is composed of walls and ramps, frames from the environment can be seen in Fig. 2. The agent we used was a vehicle capable of maneuvering comfortably within the maze. To generate random starting points that allowed us to deploy the agent into the maze, we constructed a navigable area, according to the vehicle dimensions and surroundings (green region in Fig. 3).

4.1 Data and Model Training

Affect-based Intrinsic Reward:

We collected data from five subjects (four males, one female; ages: 23 - 40 years) exploring in the simulated environment. All participants were qualified drivers with multiple years driving experience. Simultaneously we collected synchronized video of their face. The participants drove for an average of 11 minutes each, providing a total of over 64,000 frames. The protocol was approved by our institutional review board. The participants were told to explore the environment but were given no additional instruction about other objectives. We used a well validated and open source algorithm to calculate the smile response of the drivers from the webcam videos 

[baltrusaitis2018openface]. An example of smile response from one subject can be seen in Fig. 2. Using these data we trained our affect-based intrinsic motivation model. The image frames from the camera sensor in the environment served as input to the network and the smile probability in the corresponding webcam frame served as output. The input frames were downsampled to pixels, and normalized to be in the range . The model architecture was composed of three convolutional and two fully connected layers, and was trained to infer the reward directly given an observation taken from the environment.

(a)
(b)
(c)
(d)
(e)
Figure 3: Visualization of the experiment from Table 1 using heat maps. From this visualization we can observe that the better the policy, the longer the paths that are recorded during the trials.
# Method Duration () Coverage () Coverage/sec () Collisions
(1) Random 7.57 107.79 14.23 230
(2) Straight 8.32 115.33 13.86 206
(3) IL 52.87 727.46 13.75 38
(4) Affect-based 79.76 1059.29 13.28 27
Table 1: Evaluation of the driving policies. Given a random starting point, duration is the average time the car drove before a collision and coverage is the average area the car covered.

Exploration Policy:

The model architecture was composed of three convolutional and two fully connected layers, and was trained to classify the desired steering angle. The input space of the model contains four consecutive images, down sampled to

, similar to many DQN applications [mnih2013playing]. The action space is discrete and composed of five possible different steering angles: 40, 20, 0, -20 and -40°. We first train the base policy by imitation learning, where we recorded data while a single human driver was driving in the simulation using the vehicle. The data set has 50,000 images, which were normalized to [0, 1] and corresponding human actions. To increase the variations in the collected data and cope better with sharp turns, we implemented Shifted Driving [zadok2019explorations], in which we shift the observed frame and post-process the steering angle accordingly. The final exploration policy embeds this as described in Section 3.1 and the final policy considers the affective-rewards. Specifically, the reward mechanism was computed for each one of the steering angles, so the positive intrinsic values represent the values inferred from looking directly towards the respective driving directions. We set which was determined via cross-validation.

Representation Learning: As the vehicle explores the environment the data it sees is used to train the VAE. Each episode was initiated by placing the vehicle at a random starting point and letting it drive until collision. Here we use the task of frame restoration to train the VAE model. The model architecture composed of three convolutional and two fully connected layers for the encoder, and two fully connected and three transposed convolutional layers for the decoder. The input and generated images were of pixels, and the latent space was dimensional.

For evaluating performance on depth map estimation and scene segmentation, we collected 2000 images with ground truth, captured by placing simulated cameras in randomly in the environment. For sketch-to-image translation, we used the same method except that the sketches were computed by finding the image contours.


Figure 4:

Test loss as a function of number of episodes for depth map estimation, scene segmentation and sketch-to-image translation. Results are averaged over 30 trials. The error bars reflect the standard error.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
(ae)
(af)
(ag)
(ah)
(ai)
(aj)
Figure 5: Samples generated using VAEs trained on each of the three tasks. Notice that there are fewer distortions in the depth map estimations, better classification of structures in the scene segmentation, and better generation of the images from sketches when using our proposed policy.

4.2 Results

How good is affect-driven exploration?

We compare the proposed method to three additional methods - random, straight and IL. For the random policy, we simply draw a random action for each timestamp according to a uniform distribution, regardless of the input space. For the straight policy, the model drives straight, without changing course. The IL policy is simply the base policy

, but without the intrinsic affect motivation.

For this experiment, we select a starting point randomly and let the policy drive until a collision is detected. Then, we reset the vehicle to a random starting position. We continue this process for seconds, with the vehicle driving in a speed of . We then consider the mean values of the duration and of the total area covered during exploration per episode. Longer episodes reflects that the policy is able to reason about free spaces and consider that with the vehicle dynamics, whereas higher coverage suggests that the policy indeed encourages novel experiences and that the vehicle is not simply going in circles (or stationary). Coverage is defined as a union of circles, with meter radius, centered around the car.

Fig. 3 shows the map of the environment and how different policies explore the space. The heat signature (yellow is more), indicates the amount of time a vehicle spends at that location. We observe that our policy driven by intrinsic reward is able to go further and cover significantly bigger space in the allocated 2000 seconds. We present the numerical results in Table 1, which shows that compared to the policies the proposed model is capable of driving longer episodes, with much fewer collisions and covers a larger area.

How well can we solve other tasks? In this experiment, we explore how well the VAE trained on the task-agnostic data collected via the exploration (as described in Section 3.2) can help solve depth-map estimation, scene segmentation and sketch-to-image translation. First, we explore how much do we really need to perturb the original VAE model to get reasonable performance on these tasks. Specifically, we study the performance as we only retrain the top few layers of the decoder. Fig. 6

plots the log loss, where loss is the task-specific reconstruction L2 loss, as the training evolves over several epochs. The figure shows curves when we choose different numbers of layers to retrain. What we find is that the biggest gains in performance happens with just the top-2 layers. Note that since the loss is in log-scale the difference in performance between training the top-2 layers and retraining a higher number of layers is small. Note that for the sketch-to-image translation it was necessary to retrain the encoder as the input sketches differ from images.

Next, we also study the effect of exploration policies on the three tasks. For these experiments, as the vehicle explores the environment, we in-parallel train the VAE model on the data gathered so far and measure the performance on the three-tasks. We report our results by averaging over 30 trials. Fig. 4 shows the mean test L2 loss as the number of episodes evolve for various exploration policies. The shaded region represent the standard error. While for all the policies we observe that the error goes down with the number of episodes, for affect-based policy achieves lower errors with significantly fewer number of episodes. For example, on the task of scene-segmentation, the number of episodes required to achieve a loss of 0.006, is approximately half the number of episodes when using the proposed method than when using the IL policy.

Finally, besides the L2 loss we also examine the realism of the output using the Frechet Inception Distance (FID) [heusel2017gans], a metric that is frequently used to evaluate the realism of images generated using GANs, . The results are presented in Table 2 and show that better FID scores are obtained using the proposed framework.


Figure 6: Test loss as we vary the number of layers tuned for scene segmentation. Fine-tuning just a couple of layers is enough for this task.

How efficient is the framework?

Given that our reward mechanism requires computation via a CNN, we also explore what performance penalty we might be paying. We conducted timed runs for each policy and logged the average frame rate. Our code is implemented using TensorFlow 2.0, with CUDA 10.0 and CUDNN 7.6. These experiments were performed using a mobile version of GTX1060, and the results are averaged over 100 seconds of driving. We observed an average of 16.3 fps using our framework, which was only slightly worse than 19.1 fps using the IL policy.

5 Discussion and Conclusion

This paper explores how using positive affect as an intrinsic motivation for an agent can successfully spur exploration. Greater exploration by the agent can lead to a better representation of the environment and this in turn leads to improved performance in a range of down stream tasks. We argued that positive affect is an important drive that spurs safer and more curios behavior. Modeling positive affect as an intrinsic reward led to a exploration policy with 51% longer duration, 46% greater coverage and 29% fewer collisions. Fig. 3 illustrates this performance as a heat map.

Central to our argument is that affective responses to stimuli are intrinsic sources of feedback that leads to exploration and discovery of examples that generalize across contexts. We used our general representations to perform experiments across multiple machine intelligence tasks. Comparing performance with and without our affective reward we found a large benefit in using the policy with intrinsic motivation based on the positive affect signal. Qualitative examples (see Fig. 5) show that this led to better reconstruction of the respective outputs across the tasks.

Method Frame Rest. Sketch-to-Img.
Random 276.1 273.6
Straight 260.9 271.4
IL 275.4 276.3
Affect-based 246.5 259.3
Table 2: FID scores calculated for the image generation tasks. The FID was calculated on 2000 test images. We compute the metric between the reconstruction and the ground truth. The results are averaged over 30 runs.

Here we were not attempting to mimic affective processes. But rather to show that functions trained on affect like signals can lead to improved performance. Smiles are complex nonverbal behaviors that are both common and nuanced; while smiles are interpreted as expressing positive emotion they communicate a variety of interpersonal states [ekmanFeltFalseMiserable1982, rychlowskaFunctionalSmilesTools2017]. This work establishes the potential for affect like mechanisms in machine intelligence. Extension to other physiological signals presents a further opportunity worth exploring.

References