TensorFlow Implementation of Crossmodal Attentive Skill Learner (CASL)
This paper presents the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic (A2OC) architecture [Harb et al., 2017] to enable hierarchical reinforcement learning across multiple sensory inputs. We provide concrete examples where the approach not only improves performance in a single task, but accelerates transfer to new tasks. We demonstrate the attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. We modify the Arcade Learning Environment [Bellemare et al., 2013] to support audio queries, and conduct evaluations of crossmodal learning in the Atari 2600 game Amidar. Finally, building on the recent work of Babaeizadeh et al. , we open-source a fast hybrid CPU-GPU implementation of CASL.READ FULL TEXT VIEW PDF
TensorFlow Implementation of Crossmodal Attentive Skill Learner (CASL)
Intelligent agents should be capable of disambiguating local sensory streams to realize long-term goals. In recent years, the combined progress of computational capabilities and algorithmic innovations has afforded reinforcement learning (RL) (Sutton and Barto, 1998) approaches the ability to achieve this desiderata in impressive domains, exceeding expert-level human performance in tasks such as Atari and Go (Mnih et al., 2015; Silver et al., 2017). Nonetheless, many of these algorithms thrive primarily in well-defined mission scenarios learned in isolation from one another; such monolithic approaches are not sufficiently scalable for missions where goals may be less clearly defined, and sensory inputs found salient in one domain may be less relevant in another.
How should agents learn effectively in domains of high dimensionality, where tasks are durative, agents receive sparse feedback, and sensors compete for limited computational resources? One promising avenue is hierarchical reinforcement learning (HRL), focusing on problem decomposition for learning transferable skills. Temporal abstraction enables exploitation of domain regularities to provide the agent hierarchical guidance in the form of options or sub-goals (Sutton et al., 1999; Kulkarni et al., 2016)
. Options help agents improve learning by mitigating scalability issues in long-duration missions, by reducing the effective number of decision epochs. In the parallel field of supervised learning, temporal dependencies have been captured proficiently using attention mechanisms applied to encoder-decoder based sequence-to-sequence models(Bahdanau et al., 2014; Luong et al., 2015). Attention
empowers the learner to focus on the most pertinent stimuli and capture longer-term correlations in its encoded state, for instance to conduct neural machine translation or video captioning(Yeung et al., 2015; Yang et al., 2016). Recent works also show benefits of spatio-temporal attention in RL (Mnih et al., 2014; Sorokin et al., 2015).
One can interpret the above approaches as conducting dimensionality reduction, where the target dimension is time. In view of this insight, this paper proposes an RL paradigm exploiting hierarchies in the dimensions of time and sensor modalities
. Our aim is to learn rich skills that attend to and exploit pertinent crossmodal (multi-sensor) signals at the appropriate moments. The introduced crossmodal skill learning approach largely benefits an agent learning in a high-dimensional domain (e.g., a robot equipped with many sensors). Instead of the expensive operation of processing and/or storing data from all sensors, we demonstrate that our approach enables such an agent to focus on important sensors; this, in turn, leads to more efficient use of the agent’s limited computational and storage resources (e.g., its finite-sized memory).
In this paper, we focus on combining two sensor modalities: audio and video. While these modalities have been previously used for supervised learning (Ngiam et al., 2011), to our knowledge they have yet to be exploited for crossmodal skill learning. We provide concrete examples where the proposed HRL approach not only improves performance in a single task, but accelerates transfer to new tasks. We demonstrate the attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. We also show preliminary results in the Arcade Learning Environment (Bellemare et al., 2013), which we modified to support audio queries. In addition, we provide insight into how our model functions internally by analyzing the interactions of attention and memory. Building on the recent work of Babaeizadeh et al. (2017), we open-source a fast hybrid CPU-GPU implementation of our framework. Finally, note that despite this paper’s focus on audio-video sensors, the framework presented is general and readily applicable to additional sensory inputs.
This work considers an agent operating in a partially-observable stochastic environment, modeled as a POMDP (Kaelbling et al., 1998). , , and are, respectively, the state, action, and observation spaces. At timestep , the agent executes action in state , transitions to state , receives observation , and reward . The value of state under policy is the expected return , given horizon and discount factor . The objective is to learn an optimal policy , which maximizes the value.
The framework of options provides an RL agent the ability to plan using temporally-extended actions (Sutton et al., 1999). Option is defined by initiation set , intra-option policy , and termination condition . Initially, a policy over options chooses an option among those that satisfy the initiation set. The selected option executes its intra-option policy until termination, upon which a new option is chosen. This process iterates until the goal state is reached. Recently, the Asynchronous Advantage Actor-Critic framework (A3C) (Mnih et al., 2016)
has been applied to POMDP learning in a computationally-efficient manner by combining parallel actor-learners and Long Short-Term Memory (LSTM) cells(Hochreiter and Schmidhuber, 1997). Asynchronous Advantage Option-Critic (A2OC) extends A3C and enables learning option-value functions, intra-option policies, and termination conditions in an end-to-end fashion (Harb et al., 2017). The option-value function models the value of state in option ,
where is a primitive action and represents the option utility function,
A2OC introduces deliberation cost, , in the utility function to address the issue of options terminating too frequently. Intuitively, the role of is to impose an added penalty when options terminate, leading them to terminate less frequently. The value function over options, , is defined,
where is the policy over options (e.g., an epsilon-greedy policy over ). Assuming use of a differentiable representation, option parameters are learned using gradient descent.
Our goal is to design a mechanism that enables the learner to modulate high-dimensional sensory inputs, focusing on pertinent stimuli that may lead to more efficient skill learning. This section presents motivations behind attentive skill learning, then introduces the proposed framework.
Before presenting the proposed architecture, let us first motivate our interests towards attentive skill learning. One might argue that the combination of deep learning and RL already affords agents the representation learning capabilities necessary for proficient decision-making in high-dimensional domains; i.e., why the need for crossmodal attention?
Our ideas are motivated by the studies in behavioral neuroscience that suggest the interplay of attention and choice bias humans’ value of information during learning, playing a key factor in solving tasks with high-dimensional information streams (Leong et al., 2017)
. Works studying learning in the brain also suggest a natural pairing of attention and hierarchical learning, where domain regularities are embedded as priors into skills and combined with attention to alleviate the curse of dimensionality(Niv et al., 2015). Works also suggest attention plays a role in the intrinsic curiosity of agents during learning, through direction of focus to regions predicted to have high reward (Mackintosh, 1975), high uncertainty (Pearce and Hall, 1980), or both (Pearce and Mackintosh, 2010).
In view of these studies, we conjecture that crossmodal attention, in combination with HRL, improves representations of relevant environmental features that lead to superior learning and decision-making. Specifically, using crossmodal attention, agents combine internal beliefs with external stimuli to more effectively exploit multiple modes of input features for learning. As we later demonstrate, our approach captures temporal crossmodal dependencies, and enables faster and more proficient learning of skills in the domains examined.
We propose Crossmodal Attentive Skill Learner (CASL), a novel end-to-end framework for HRL. One may consider many blueprints for integration of multi-sensory attention into the options framework. Our proposed architecture is primarily motivated by the literature that taxonomizes attention into two classes: exogeneous and endogeneous. The former is an involuntary mechanism triggered automatically by the inherent saliency of the sensory inputs, whereas the latter is driven by the intrinsic and possibly long-term goals, intents, and beliefs of the agent (Carrasco, 2011)
. Previous attention-based neural architectures take advantage of both classes, for instance, to solve natural language processing problems(Vinyals et al., 2015). Our approach follows this schema.
The CASL network architecture is visualized in Fig. 1. Let be the number of sensor modalities (e.g., vision, audio, etc.) and denote extracted features from the -th sensor, where . For instance,
may correspond to feature outputs of a convolutional neural network given an image input. Given extracted features for allsensors at timestep , as well as hidden state , the proposed crossmodal attention layer learns the relative importance of each modality , where is the -simplex:
Weight matrices , ,
and bias vectors, , are trainable parameters and nonlinearities are applied element-wise.
Both exogeneous attention over sensory features and endogeneous attention over LSTM hidden state are captured in (4). The sensory feature extractor used in experiments consists of convolutional layers, each with filters of size
, and ReLU activations. Attended featuresmay be combined via summation or concatenation (per (6
)), then fed to an LSTM cell. The LSTM output captures temporal dependencies used to estimate option values, intra-option policies, and termination conditions (, , in Fig. 1, respectively),
where weight matrices , , and bias vectors , , are trainable parameters for the current option , and
is the sigmoid function. Network parameters are updated using gradient descent. Entropy regularization of attention outputswas found to encourage exploration of crossmodal attention behaviors during training.
The proposed framework is evaluated on a variety of learning tasks with inherent reward sparsity and transition noise. We evaluate our approach in three domains: a door puzzle domain, a 2D-Minecraft like domain, and the Arcade Learning Environment (Bellemare et al., 2013)
. These environments include challenging combinations of reward sparsity and/or complex audio-video sensory input modalities that may not always be useful to the agent. The first objective of our experiments is to analyze performance of CASL in terms of learning rate and transfer learning. The second objective is to understand relationships between attention and memory mechanisms (as captured in the LSTM cell state). Finally, we modify the Arcade Learning Environment to support audio queries, and evaluate crossmodal learning in the Atari 2600 game Amidar.
We first evaluate crossmodal attention in a sequential door puzzle game, where the agent spawns in a 2D world with two locked doors and a key at fixed positions. The key type is randomly generated, and its observable color indicates the associated door. The agent hears a fixed sound (but receives no reward) when adjacent to the key, and hears noise otherwise. The agent must find and pick up the key (which then disappears), then find and open the correct door to receive reward (with discount ). The game terminates upon opening of either door. The agent’s sensory inputs are vision (grayscale image) and audio spectrogram. This task was designed in such a way that audio is not necessary to achieve the task – the agent can certainly focus on learning a policy mapping from visual features to open the correct door. However, audio provides potentially useful signals that may accelerate learning, making this a domain of interest for analyzing the interplay of attention and sensor modalities.
Figure 1(a) shows ablative training results for several network architectures. The three LSTM-based skill learners (including CASL) converge to the optimal value. Interestingly, the network that ignores audio inputs (V-O-LSTM) converges faster than its audio-enabled counterpart (V-A-O-LSTM), indicating the latter is overwhelmed by the extra sensory modality. Introduction of crossmodal attention enables CASL to converge faster than all other networks, using roughly half the training data of the others. The feedforward networks all fail to attain optimal value, with the non-option cases (V-A-FF and V-FF) repeatedly opening one door due to lack of memory of key color. Notably, the option-based feedforward nets exploit the option index to implicitly remember the key color, leading to higher value. Interplay between explicit memory mechanisms and use of options as pseudo-memory may be an interesting line of future work.
We also evaluate crossmodal attention for transfer learning (Fig. 1(b)), using the more promising option-based networks. The door puzzle domain is modified to randomize the key position, with pre-trained options from the fixed-position variant used for initialization. All networks benefit from an empirical return jumpstart of 0.2 at the beginning of training, due to skill transfer. Once again, CASL converges fastest, indicating more effective use of the available audio-video data. While the asymptotic performance of CASL is only slightly higher than the V-A-O-LSTM network, the reduction in number of samples needed to achieve a high score (e.g, after 100K episodes) makes it advantageous for domains with high sampling cost.
Temporal behaviors of the attention mechanism are also evaluated in a 2D Minecraft-like domain, where the agent must pick an appropriate tool (pickaxe or shovel) to mine either gold or iron ore (Figs. 2(c), 2(b) and 2(a)). Critically, the agent observes identical images for both ore types, but unique audio features when near the ore, making long-term audio storage necessary for selection of the correct tool. The agent receives reward for correct tool selection, for incorrect selection, and step cost. Compared to the door puzzle game, the mining domain is posed in such a way that the interplay between audio-video features is emphasized. Specifically, an optimal policy for this task must utilize both audio and video features: visual inputs enable detection of locations of the ore, agent, tools, whereas audio is used to identify the ore type.
Visual occlusion of the ore type, interplay of audio-video features, and sparse positive rewards cause the non-attentive network to fail to learn in the mining domain, as opposed to the attentive case (Fig. 2(d)). Figure 3(a) plots a sequence of frames where the agent anticipates salient audio features as it nears the ore at , gradually increasing audio attention, then sharply reducing it to 0 after hearing the signal.
While the anticipatory nature of crossmodal attention in the mining domain is interesting, it also points to additional lines of investigation regarding interactions of attention and updates of the agent’s internal belief (as encoded in its LSTM cell state). Specifically, one might wonder whether it is necessary for the agent to place any attention on the non-useful audio signals prior to timestep in Fig. 3(a), and also whether this behavior implies inefficient usage of its finite-size memory state.
Motivated by the above concerns, we conduct more detailed analysis of the interplay between the agent’s attention and memory mechanisms as used in the CASL architecture (Fig. 1). Readers are referred to the appendix (Section 6.1) for details on how this analysis was conducted, as well as a brief overview of LSTM units. Given the sequence of audio-video inputs in Fig. 3(a), we plot overall activations of the forget and input LSTM gates (averaged across all cell state elements), in Fig. 3(b) and Fig. 3(c), respectively. Critically, these plots also indicate the relative influence of the forget and input LSTM gates’ contributing variables (audio input, video input, hidden state, and bias term) to the overall activation.
Interestingly, prior to timestep , the contribution of audio to the forget gate and input gates is essentially zero, despite the positive attention on audio (in Fig. 3(a)). Recall a low forget gate activation corresponds to complete forgetting of the previous LSTM cell state element, whereas a high input gate activation corresponds to complete throughput of the corresponding input element. At , the forget gate activation drops, while the input gate experiences a sudden increase, indicating major overwriting of previous memory states with new information. Critically, the plots indicate that the attended audio input is the key contributing factor of both behaviors. In Fig. 3(a), after the agent hears the necessary audio signal, it moves attention entirely to video; the contribution of audio to the forget and input activations also drops to zero. These behaviors indicate that the agent attends to audio in anticipation of an upcoming pertinent signal, but chooses not to embed it into memory until the appropriate moment. Attention filters irrelevant sensor modalities, given the contextual clues provided by exogeneous and endogeneous input features; it, therefore, enables the LSTM gates to focus on learning when and how to update the agent’s internal state.
Preliminary evaluation of crossmodal attention was conducted in the Arcade Learning Environment (ALE) (Bellemare et al., 2013). We modified ALE to support audio queries, as it previously did not have this feature; we plan to add this code to the ALE repository.
|Mnih et al. (2015)||✗||Video||739.5|
|Mnih et al. (2016)||✗||Video||283.9|
|Babaeizadeh et al. (2017)||✗||Video||218|
|Ours (without options)||✗||Audio & Video|
|Harb et al. (2017)||✓||Video||880.0|
|Vezhnevets et al. (2017)||✓||Video||2500|
Our current line of investigation in ALE considers impacts of crossmodal attention on agent behavior, focusing on the primitive action case prior to moving to option learning. Experiments were conducted in the Atari 2600 game Amidar (Fig. 5), one of the games in which deep Q-networks failed to exceed human-level performance (Mnih et al., 2015). The objective in Amidar is to collect rewards in a rectilinear maze while avoiding patrolling enemies. Rewards are collected by painting segments of the maze, killing enemies at opportune moments, or collecting bonuses. Background audio plays throughout the game, and specific audio signals play when the agent crosses previously-unseen segment vertices. Figure 5 reveals that the agent anticipates and increases audio attention when near these critical vertices, which are especially difficult to observe when the agent sprite is overlapping them (e.g., zoom into RGB sequences of Fig. 5).
Our crossmodal attentive agent achieves a mean score of 900 in Amidar, over 30 test runs. To the best of our knowledge, this is the state-of-the-art score for non-hierarchical methods (Table 1). Note that we also beat the score of the hierarchical approach of Harb et al. (2017). We emphasize these are not direct comparisons due to our method leveraging additional sensory inputs, but mainly meant to highlight the performance benefits of crossmodal learning. We are currently conducting experiments with CASL in audio-enabled ALE, to evaluate it against the state-of-the-art performance of the hierarchical FeUdal Networks (Vezhnevets et al., 2017).
This work introduced the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic (A2OC) architecture (Harb et al., 2017) to enable hierarchical reinforcement learning across multiple sensory inputs. We provided concrete examples where CASL not only improves performance in a single task, but accelerates transfer to new tasks. We demonstrated the learned attention mechanism anticipates and identifies useful sensory features, while filtering irrelevant sensor modalities during execution. We modified the Arcade Learning Environment (Bellemare et al., 2013) to support audio queries, and evaluations of crossmodal learning were conducted in the Atari 2600 game Amidar. Finally, building on the recent work of Babaeizadeh et al. (2017), we open-source a fast hybrid CPU-GPU implementation of CASL. This investigation indicates crossmodal skill learning as a promising avenue for future works in HRL that target domains with high-dimensional, multimodal inputs.
This work was supported by Boeing Research & Technology and ONR BRC Grant N000141712072.
Journal of Artificial Intelligence Research, 47:253–279, 06 2013.
International Conference on Machine Learning, pages 1928–1937, 2016.
Attention and associative learning: From brain to behaviour, pages 11–39, 2010.
International Journal of Computer Vision, pages 1–15, 2015.
We provide a brief overview of LSTM networks to enable more rigorous discussion of attention-memory interactions. At timestep , LSTM cell state encodes the agent’s memory given its previous stream of inputs. The cell state is updated as follows,
where is the forget gate activation vector, is the input gate activation vector, is the previous hidden state vector, is attended feature vector, and refers to the Hadamard product. Weights , , and biases , , are trainable parameters. The cell state update in (12) first forgets certain elements ( term), and then adds contributions from new inputs ( term). Note that a forget gate activation of corresponds to complete forgetting of the previous cell state element, and that an input gate activation of corresponds to complete throughput of the corresponding input element.
Our goal is to not only analyze the overall forget/input activations throughout the gameplay episode, but also to quantify the relative impact of each contributing variable (audio input, video input, hidden state, and bias term) to the overall activations. Many methods may be used for analysis of the contribution of explanatory variables in nonlinear models (i.e., (10) to (12
)). We introduce a means of quantifying the correlation of each variable with respect to the corresponding activation function. In the following, we focus on the forget gate activation, but the same analysis applies to the input gate. First, expanding the definition of forget gate activation in (10), assuming use of concatenated attention (per (6)), yields,
where and are, respectively, the audio and video input features, and
is the identity matrix. Defineas the forget gate activation if the -th contributing variable were removed. For example, if audio input were to be removed, then,
Define the forget gate activation residual as (i.e., the difference in output resulting from removal of the -th contributing variable). Then, one can define a ‘pseudo correlation’ of the -th contributing variable with respect to the true activation,
This provides an approximate quantification of the relative contribution of the -th variable (audio input, video input, hidden unit, or bias) to the overall activation of the forget and input gates. Armed with this toolset, we can analyze the interplay between attention and LSTM memory.