Attend Before you Act: Leveraging human visual attention for continual learning

07/25/2018 ∙ by Khimya Khetarpal, et al. ∙ 0

When humans perform a task, such as playing a game, they selectively pay attention to certain parts of the visual input, gathering relevant information and sequentially combining it to build a representation from the sensory data. In this work, we explore leveraging where humans look in an image as an implicit indication of what is salient for decision making. We build on top of the UNREAL architecture in DeepMind Lab's 3D navigation maze environment. We train the agent both with original images and foveated images, which were generated by overlaying the original images with saliency maps generated using a real-time spectral residual technique. We investigate the effectiveness of this approach in transfer learning by measuring performance in the context of noise in the environment.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowing where to look plays an important role in people’s ability to learn and solve new tasks quickly. While some cues in an image are naturally attractive and lead to bottom-up saliency [(Harel et al., 2007), (Walther & Koch, 2006)], others need voluntary effort and are more task-dependent, leading to top-down saliency [(Sprague & Ballard, 2004) (Borji et al., 2011)

]. When humans perform a specific task, a combined model of attentional selection and object recognition is usually at work. Bottom-up feature extraction coupled with a hierarchical representation of object classes and motor commands governs subsequent eye movements in order to maximize information gain

(Itti & Koch, 2001).

The human attention mechanism is very complicated and depends on various factors ranging from task complexity, the nature of the task, external factors such as rewards or distractors, and internal factors such as curiosity. The work of (Triesch et al., 2003) concluded that what we see is highly dependent on what we need. Human visual attention can also be seen as relying on a hierarchical approach (Baylis & Driver, 1993). In particular, when performing a complex task which involves various subgoals, humans use selective attention to parts of the visual scene, sequential deployment of gaze in a temporal sequence of frames, before performing motor actions. More importantly, human attention reuses past understanding of concepts, relations, and world models.

Intuitively, the fact that people focus only on specific parts of an image before acting should lead both to robustness in the presence of noise and deliberate distractors, as well as to the ability to generalize knowledge over different tasks. For example, we can navigate through any building regardless of the color of the walls or the interior decor. Hence, we would like to investigate if using this mechanism also provides robustness and ability to transfer knowledge for reinforcement learning (RL) agents as well.

Our goal is to explore how foveating around the regions where humans look impacts the reinforcement learning process, especially focusing on robustness and continual learning. Because of this goal, we build on top of the UNREAL agent (Jaderberg et al., 2016), which aims to construct a better representation for continual learning, by focusing not only on learning the optimal value function for the given task, but also on optimizing several pseudo-rewards or auxiliary tasks. We investigate the impact of overlaying the real image with a mask that is determined by a model of human attention. We use the spectral residual saliency method (Hou & Zhang, 2007) to foveate around salient regions and train the UNREAL agents on a maze navigation task from DeepMind Lab (Beattie et al., 2016). We use varying degrees of foveation, in order to evaluate the impact on the learning process. Our hypothesis was that more foveation should lead to more robustness to distractors and noise, but also to worse final task performance. We also empirically explore if knowing where to look facilitates continual learning and leads learnt policies to be robust to variations in the data distribution.

Figure 1: Sample image from navigation maze environment in DeepMind Lab overlaid with a saliency heat map generated from the pre-trained model of Cornia et al..

2 Algorithmic approach

We started our approach by investigating saliency maps generated from the state-of-the-art Saliency Attentive Model (SAM)

(Cornia et al., 2016) according to the MIT Saliency Benchmark (Bylinskii et al., 2015). Figure 1 shows a sample input image from a static maze navigation task overlaid with a heat map generated from SAM. SAM uses a Convolutional LSTM to focus on specific parts of the image and iteratively refines the visual attention. Once a gray scale saliency map is generated from SAM, we overlay it on the original image using jet color map. More salient regions in the image are indicated by the hotness of the map i.e. the red color, whereas relatively insignificant regions are indicated by coolness of the map i.e. the blue color. While the saliency maps generated by SAM look very intuitive, using a SAM model pre-trained on the VGG dataset is computationally very expensive in terms of speed of training. For faster training, instead of SAM, we decided to use a real-time saliency computation technique called the Spectral Residual method (Hou & Zhang, 2007). The key idea of this method is to compute the average frequency domain and subtract it from a specific image domain to obtain the spectral residual. The log spectrum of each image is analyzed to obtain the spectral residual, then it is transformed to a spatial domain with the location of the proto-objects. Proto-objects are pre-attentive structures with limited spatial and temporal coherence within a visual stimuli, which generate the perception of an object when attended to.

Figure 2: Leveraging different degrees of foveation around where humans look in an image: Different degrees of attention indicated by the parameter signify the importance of salient regions with respect to the original image. We consider a full range from only looking at salient regions in Figure 1(a) to focus relatively more on the whole image in Figure 1(d). Saliency Attentive Model (SAM) is used to compute the saliency maps in this figure.

We first explore if foveating around the salient locations in the image helps the agent to learn faster. It is natural for humans to look at an entire visual scene, yet, automatically focus around salient regions while eliminating others which are not so important. With this intuition, instead of explicitly providing the attention map along with the original image, we blend the attention map with the original image, as follows:


where is the normalized saliency map for all pixels , denotes the original image, and is the amount of foveation, and controls the amount of blending desired. This is also depicted in Figure 2 for ranging from to . For instance, a value of indicates removing all distractors and focusing on salient regions alone (Figure 1(a)), whereas a value of implies looking largely at the original image.

We train the UNREAL agent on DeepMind Lab’s (Beattie et al., 2016) static navigation maze task (nav maze static ) with all auxiliary tasks on as our baseline. We keep the network architecture consistent with the Jaderberg et al., with a CNN-LSTM base agent trained on-policy with A3C (Mnih et al., 2016). The input to the agent at each timestep was an * RGB image. The network consists of two convolutional layers with and filters respectively. This is followed by a fully connected layer with

units. RELU activation function is used for all three layers. An LSTM is used with the inputs concatenated from the fully connected layer, previous action taken, and previous reward. Three auxiliary tasks include the pixel-control task, value-function replay and the reward prediction task as described in

(Jaderberg et al., 2016). We use timestep rollouts for the base process. After every environment steps, the auxiliary tasks are performed corresponding to every update of the base A3C agent. We used the online open source-code of UNREAL111 as our baseline.

Next, we introduce the Visually-Attentive UNREAL agent 222The source code is available at by foveating around the salient regions in each image. This is done in the base process of online A3C , as shown in the pseudo code in Algorithm 1.

   is factor controlling the foveation
   Obtain original Input Image of from the Lab environment
   SpectralSaliencyMethod ()
   SaliencyOverlay (, , )
  Process Base A3C CNN-LSTM (Foveated Image)
  Process Auxiliary Tasks (Foveated Image)
Algorithm 1 Visually Attentive UNREAL Agent

The training used parallel threads for all our experiments. For our preliminary experiments, we explored different values of . We can observe that on one hand, foveating on the salient regions alone removes a lot of context from the important aspects of an image and results in little to no learning, as seen in the Figure 3. This is also intuitive from the visualization in the Figure 1(a). On the other hand, values of in the range of show a boost in performance in the preliminary learning curves, as shown in Figure 3.

Figure 3: Learning with varying degrees of visual attention to navigate the maze environment. Specific degrees of visual attention helps in learning better than baseline UNREAL agent. Here speeds up the learning as compared to other settings for this instance of runs.

3 Experiments

Agent Testing Continual Learning
Easy Moderate Difficult
Visually-Attentive UNREAL
Table 1: Transfer Learning : Average Performance over games once training is completed. UNREAL agent and Visually-Attentive UNREAL agent are evaluated once training is stopped and also for transfer learning. Transfer is evaluated on three variations of training categorized as - Easy: Simple Gaussian noise is added in the original frames, Moderate: Tinting of frames is done by randomly flipping a coin with the same hue of 0.25, and Difficult: At random, some frames are tinted with different amounts of hue ranging from to

. Scores here are averaged for 25 games with standard deviation across these games in the brackets.

Based on our preliminary results, we further trained the Visually-attentive UNREAL agents only using the value of , which showed a boost in performance in Figure 3. We ran multiple runs for both the baseline and visually-attentive agent. Figure 4 shows the learning curves for time steps averaged across runs. The Visually-Attentive UNREAL agent learns marginally slower than the baseline on an average. Moreover, the amount of foveation determines the impact on the learning. However, the learning curve only suggests how these agents perform in the same environment over time. Next we explore, how visually-attentive agents compare to the baseline in transfer of learning. In other words, does visual attention facilitate continual learning?.

Figure 4: Learning curves in the

Navigation Maze Static. On an average, UNREAL agent learns better than Visually-Attentive UNREAL agent during the training phase. However, both agents suffer from a lot of variance during training phase.

To evaluate the trained models for continual learning, we introduce three types of perturbations in the input frames and the average performance over k=25 games is recorded. Table 1 depicts the performance averaged over games for both these agents. These variations include addition of Gaussian noise, tinting of images at random with the same hue, and tinting of images at random with different hues, categorized as three levels of difficulty namely easy, medium and hard. To tint the frames, we generate a flickering effect in the sequence of frames by scaling RGB values and by adjusting colors in the HSV color-space. From the mean scores in Table 1 one can note that both baseline and the visually attentive UNREAL agent remain unaffected in performance by relatively small amounts of Gaussian noise. Upon encountering flickering in frames at random, the visually-attentive UNREAL agent is still able to perform as well as the baseline and is relatively more robust to distractors in both easy and moderate categories of evaluation. However, both agents struggle to perform transfer learning when the amount of distraction is larger than what they have seen during training. For a qualitative analysis, we present the visualization of both these agents in all three test-scenarios as additional results in the supplementary material333

4 Discussion and Future Work

We present an exploratory study to understand the role of visual attention in learning to perform a task and evaluating its effect in continual learning. Our key hypothesis is that knowing where to look in an image helps in learning a task, because this knowledge could be transferred to new tasks. We train the visually-attentive UNREAL agent which foveates around regions of an image salient to the human eye. The performance evaluation on perturbations in the train setting demonstrate promising results for further analysis of continual learning with visual attention.

In this work, we employed a fundamental spectral residual saliency method which is based on the log spectra representation of images. However, this technique does not take into account the motion features which could be a limiting factor in terms of performance of the visually-attentive agent. This was further confirmed by qualitative analysis of the attention maps generated by the spectral residual saliency method as shown in the Figure 5. It is interesting to note that this model focuses a lot more on the score region of the frame than the objects in the maze. One of the potential reasons for limited performance is that computed attention maps focus on one most important object in the frame as opposed to all salient regions. We note that our approach can be used as a wrapper around any saliency model, so it would be easy to try better approaches.

Figure 5: Qualitative analysis of Spectral Residual Method: Attention maps computed from the spectral residual method are used to generate different degrees of foveation indicated by the parameter . This model attends a lot more to the score than the other objects in the image.

A possible future direction in understanding the role of attention could involve training saliency models explicitly for images encountered in game playing. Even using pre-trained SAM model in an optimized fashion would potentially impact the performance. One could employ a better saliency model to help the agent foveate on regions which capture the dynamics of the rewards and the feature-representation. More importantly, it would be interesting to study a setting where the agent can actively learn to control where to attend to, rather than using a static attention model.

Jayaraman & Grauman ’s work in learning object representation in a dynamic interactive setting relates to similar line of thought. Thus, an open question remains: how can we ensure that an agent controls the visit to the most visually attended states?