Recent works have shown how low-dimensional representations captured by generative models can be successfully exploited in reinforcement learning (RL) settings. Among others, these generative models have been used to learn low-dimensional latent representations of the state space to improve the learning efficiency of RL algorithms zhang2018arxiv; gelada2019arxiv, or to allow the generalization of policies learned on a source domain to other target domains finn2016icra; higgins2017icml
. The DisentAngled Representation Learning Agent (DARLA) approachhiggins2017icml
, in particular, builds such latent representations using variational autoencoder (VAE) methodskingma2013arxiv; rezende2014arxiv, and shows how learning disentangled features of the observed environment can allow an RL agent to learn a policy robust to some shifts in the original domain.
In this work, we explore the application of these latent representations in capturing different input sensory modalities to be considered in the context of RL tasks. We build upon recent work that extends VAE methods to learn joint distributions of multiple modalities, by forcing the individual latent representations of each modality to be similarsuzuki2016arxiv; yin2017aaai. These multimodal VAEs allow for cross-modality inference, replicating more closely what seems to be the nature of the multimodal representation learning performed by humans damasio1989cognition; meyer2009tns. Inspired by these advances, we explore the impact of such multimodal latent representations in allowing a reinforcement learning agent to learn and exploit policies over different input modalities. Among others, we envision, for example, scenarios where reinforcement learning agents are provided the ability of learning a visual policy (a policy learned over image inputs), and then (re-)using such policy at test time when only sound inputs are available. Figure 1 instantiates such example to the case of video games—a policy is learned over images and then re-used when only the game sounds are available, i.e., when playing “in the dark”.
To achieve this, we contribute an approach for multimodal transfer reinforcement learning, which effectively allows an RL agent to learn robust policies over input modalities, achieving better out-of-the-box performance when compared to different baselines. We start by first learning a generalized latent space over the different input modalities that the agent has access to. This latent space is constructed using a multimodal generative model, allowing the agent to establish mappings between the different modalities—for example, “which sounds do I typically associate with this visual sensory information”. Then, in the second step, the RL agent learns a policy directly on top of this latent space. Importantly, during this training step, the agent may only have access to a subset of the input modalities (say, images but not sound). In practice, this translates in the RL agent learning a policy over a latent space constructed relying only on some modalities. Finally, the transfer occurs in the third step, where, at test time, the agent may have access to a different subset of modalities, but still perform the task using the same policy. These results hold consistently across different OpenAI Gym brockman2016arxiv and Atari-like bellemare2013jair environments. This is the case even when using different multimodal generative models yin2017aaai and reinforcement learning algorithms mnih2015nature; lillicrap2015arxiv.
The third and last step unveils what sets our work apart from the existing literature. By using (single-modality) VAE methods, the current state-of-art approaches implicitly assume that the source and target domains are characterized by similar inputs, such as raw observations of a camera. In these approaches, the latent space is used to capture isolated properties (such as colors or shapes) that may vary throughout the tasks. This is in contrast with our approach, where the latent space is seen as a mechanism to create a mapping between different input modalities.
The remainder of the paper is structured as follows. We start in Section 2 by introducing relevant background and related work on generative models and reinforcement learning. Then, in Section 3 we introduce our approach to multimodal transfer reinforcement learning, and evaluate it in Section 4. We finish with some final considerations in Section 5.
This section introduces required background on deep generative models and deep reinforcement learning.
2.1. Deep Generative Models
2.1.1. Variational Autoencoders
Deep generative models have shown great promise in learning generalized representations of data. For single-modality data, the variational autoencoder model (VAE) is widely used. The VAE model kingma2013arxiv learns a joint distribution of data , which is generated by a latent variable . Figure 1(a)
depicts this model. The latent variable is often of lower dimensionality in comparison with the modality itself, and acts as the representation vector in which data is encoded.
The joint distribution takes the form , where (the prior distribution) is often a unitary Gaussian (). The generative distribution , parameterized by , is usually composed with a simple likelihood term (e.g. Bernoulli or Gaussian).
The training procedure of the VAE model involves the maximization of the evidence likelihood , by marginalizing over the latent variable and resorting to an inference network to approximate the posterior distribution. We obtain a lower-bound on the log-likelihood of the evidence (ELBO) , with
where the Kullback-Leibler divergence term
promotes a balance between the latent channel’s capacity and the encoding process of data. Moreover, in the model’s training procedure, the hyperparametersand weight the importance of reconstruction quality and latent space independence, respectively. The optimization of the ELBO is performed resorting to gradient-based methods.
2.1.2. Multimodal Variational Autoencoders
VAE models have been extended in order to perform inference across different modalities. The Associative Variational Autoencoder (AVAE) model yin2017aaai, depicted in Figure 1(b), is able to learn a common latent representation of two modalities (). It does so by imposing a similarity restriction on the separate single-modality latent representations (), employing a KL divergence term on the ELBO of the model:
where is the symmetrical Kullback-Leibler between two distributions and , and is a constant that weights the importance of keeping similar latent spaces in the training procedure yin2017aaai
. We note that each modality is associated with a different encoder-decoder pair. Moreover, the encoder and the decoder can be implemented as neural networks with different architectures.
Other models aim at learning a joint distribution of both modalities . Examples include the Joint Multimodal Variational Autoencoder (JMVAE) suzuki2016arxiv or the Multi-Modal Variational Autoencoder (MVAE) korthals2019arxiv. These generative models are able to build a representation space of both modalities simultaneously while maintaining similarity restrictions with the single-modality representations, as shown in the JMVAE model presented in Figure 1(c).
However, a fundamental feature of all multimodal generative models is the ability to perform cross-modality inference, that is the ability to input modality-specific data, encode the corresponding latent representation, and, from that representation, generate data of a different modality. This is possible due to the forced approximation of the latent representations of each modality, and the process follows the orange and green arrows in Figure 2.
2.2. Reinforcement Learning
Reinforcement learning (RL) is a framework for optimizing the behaviour of an agent operating in a given environment. This framework is formalized as a Markov decision process (MDP)—a tuplethat describes a sequential decision problem under uncertainty. and are the state and action spaces, respectively, and both are known by the agent. When the agent takes an action while in state , the world transitions to state
with probabilityand the agent receives an immediate reward . Typically, functions and are unknown to the agent. Finally, the discount factor sets the relative importance of present and future rewards.
Solving the MDP consists in finding an optimal policy —a mapping from states to actions—which ensures that the agent collects as much reward as possible. Such policy can be found from the optimal -function, which is defined recursively for every state action pair as
Multiple methods can be used in computing this function sutton1998reinforcement, for example -learning watkins89phd.
More recently, research has geared towards applying deep learning methods in RL problems, leading to new methods. For example, DeepNetwork (DQN) is a variant of the -learning algorithm that uses a deep neural network to parameterize an approximation of the -functions , with parameters . DQN assumes discrete action spaces , and has been proved suitable for learning policies that beat Atari games mnih2015nature. Continuous action spaces require specialized algorithms. For example, Deep Deterministic Policy Gradient (DDPG) is an actor-critic, policy gradient algorithm that can deal with continuous action spaces, and has been shown to perform well in complex control tasks lillicrap2015arxiv.
3. Multimodal Transfer Reinforcement Learning
Consider an agent facing a sequential decision problem described as an MDP . This agent is endowed with a set of different input modalities, which can be used in perceiving the world and building a possibly partial observation of the current state . Different modalities may provide more, or less, perceptual information than others. Some modalities may be redundant (i.e., provide the same perceptual information) or complement each other (i.e., jointly provide more information). Figure 3 provides an abstract illustration of the connection between different input modalities, and corresponding impact in the state space that can be perceived by the agent.
Our goal is for the agent to learn a policy while observing only a subset of input modalities , and then use that same policy when observing a possibly different subset of modalities, , with as minimal performance degradation as possible.
Our approach consists of a three-stages pipeline:
Learn a perceptual model of the world.
Learn to act in the world.
We now discuss each step in further detail.
3.1. Learn a perceptual model of the world
Let denote the Cartesian product of input modalities, . Intuitively, we can think of as the complete perceptual space of the agent. We write to denote an element of . Figure 3(a) depicts an example on a game, where the agent can have access to two modalities, and , corresponding to visual and sound information.
At each moment, the agent may not have access to the complete perception , but only to a partial view thereof. Following our discussion in Section 1, we are interested in learning a multimodal latent representation of the perceptions in . Such representation amounts to a set of latent mappings . Each map takes the form , where is a common latent space and projects to some subspace of modalities, . In Figure 3(a) the set of mappings is used to compute a latent representation from sound and image data.
To learn such mappings, we start by collecting a dataset of examples of simultaneous sensorial information:
We then follow an unsupervised learning approach, and train a multimodal VAE on datasetto learn a generalized latent space over the agent’s input modalities. The latent mappings in correspond to the encoders of the VAE model, while the decoders can be seen as a set of inverse latent mappings, that allow for modality reconstruction and cross-modality inference. Figure 3(b) depicts an example of how the multimodal latent space can be used for performing cross-modality inference of sound data given an image input using the modality-specific maps.
The collection of the initial data needed to generate can be easier/harder depending on the complexity of the task. In Section 4 we discuss mechanisms to perform this.
3.2. Learn to act in the world
After learning a perceptual model of the world, the agent then learns how to perform the task. We follow a reinforcement learning approach to learn an optimal policy for the task described by MDP . During this learning phase, we assume the agent may only have access to a subset of input modalities . As a result, during its interaction with the environment, the agent collects a sequence of triplets
where , , correspond to the perceptual observations, action executed, and rewards obtained at timestep , respectively.
However, our reinforcement learning agent does not use this sequence of triplets directly. Instead, it pre-processes the perceptual observations using the previously learned latent maps to encode the multimodal latent state at each time step as , where maps into . In practice, the RL agent uses a sequence of triplets
to learn a policy , that maps the latent states to actions. Any continuous-state space reinforcement learning algorithm can be used to learn this policy over the latent states. These latent states are encoded using the generative model trained in the previous section, and as such, the weights of this model should remain frozen during the RL training.
3.3. Transfer policy
The transfer of policies happens once the agent has learned how to perceive and act in the world. At this time, we assume the agent may now have access to a subset of input modalities , potentially different from , i.e., the set of modalities it used in learning the task policy . As a result, during its interaction with the environment, at each time step , the agent will now observe perceptual information .
In order to reuse the policy , the agent starts by pre-processing this perceptual observation, again using the set of maps previously trained, but now generating a latent state , where now maps into . Since policy maps the latent space to the action space , it can now be used directly to select the optimal action at the new state .
Effectively, the agent is reusing a policy that was learned over a (possibly) different set of input modalities, with no additional training. This corresponds to a zero-shot transfer of policies. Figure 5 summarizes the three-steps pipeline hereby described.
4. Experimental Evaluation
We evaluate and analyze the performance of our approach on different scenarios of increasing complexity, not only on the task but also on the input modalities. We start by considering a modified version of the pendulum environment from OpenAI gym, with a simple sound source. Then, we consider hyperhot, a space invaders-like game that assesses the performance of our approach in scenarios with more complex and realistic generation of sounds.
We consider a modified version of the pendulum environment from OpenAI gym—a classic control problem, where the goal is to swing the pendulum up so it stays upright. We modify this environment so that the observations include both an image and a sound component. For the sound component, we assume that the tip of the pendulum emits a constant frequency , which is received by a set of sound receivers . Figure 6 depicts this scenario, where the pendulum and its sound are in red, and the sound receivers correspond to the circles.
Formally, we let denote the complete perceptual space of the agent. The visual input modality of the agent, , consists of the raw image observation of the environment. On the other hand, the sound input modality, , consists of the frequency and amplitude received by each of the microphones of the agent. Moreover, both image and sound observations may be stacked to account for the dynamics of the scenario.
In this scenario, we assume a simple model for the sound generation. Specifically, we assume that, at each timestep, the frequency heard by each sound receiver follows the Doppler effect. The Doppler effect measures the change in frequency heard by an observer as it moves towards or away from the source. Slightly abusing our notation, we let denote the position of sound receiver and the position of the sound emitter. Formally,
where is the speed of sound and we use the dot notation to represent velocities. Figure 6(a) depicts the Doppler effect in the pendulum scenario.
We then let the amplitude heard by receiver follow the inverse square law
where is a scaling constant. Figure 6(b) depicts the inverse square law applied to the pendulum scenario, showing how the amplitude of the sound generated decreases with the distance to the source.
We now provide details on how our approach was set up. All constants and training hyper-parameters used are summarized in Appendix A.1.
4.1.1. Learn a perceptual model of the world
For this task, we adopted the Associative Variational AutoEncoder (AVAE) to learn the family of latent mappings . The AVAE was trained over a dataset with observations of images and sounds , collected using a random controller. The random controller proved to be enough to cover the state space. Before training, the images were preprocessed to black and white and resized to pixels. The sounds were normalized to the range , assuming the minimum and maximum values found in the samples.
For the image-specific encoder we adopted an architecture with two convolutional layers and two fully connected layers. The two convolutional layers learned and filters, respectively, each with kernel size
and padding. The two fully connected layers had neurons each. Swish activations ramachandran2017arxiv were used. For the sound-specific encoder, we adopted an architecture with two fully connected layers, each with
neurons. One dimension batch normalization was used between the two layers. The decoders followed similar architectures. The optimization used Adam gradient withpytorch’s default parameters, and learning rate .
The AVAE loss function penalized poor reconstruction of the image and sound. Image reconstruction loss was measured by binary cross entropy scaled by constant, and sound reconstruction loss was measured by mean squared error scaled by constant . The prior divergence loss terms were scaled by , and the symmetrical KL divergence term by .
4.1.2. Learn to act in the world
The agent learned how to perform the task using the DDPG algorithm, while only having access to the image input modality—that is . These image observations are encoded into the latent space using —the image-specific encoder of the AVAE trained in 4.1.1. Thus, the agent learns a policy that maps latent states to actions.
The actor and critic networks consisted of two fully connected layers of neurons each. The replay buffer was initially filled with samples obtained using a controller based on the Ornstein-Uhlenbeck process, with the parameters suggested by lillicrap2015arxiv. The Adam gradient was used for optimizing both networks, with learning rates and .
4.1.3. Transfer policy
We evaluated the performance of the policy trained in 4.1.2, when the agent only has access to the sound input modality, i.e., .
Given a sound observation, the agent first preprocesses it using the latent map , generating a multimodal latent state —we denote this process as avaes. The agent then uses the policy to select the optimal action in this latent state.
As a result, we are measuring the zero-shot transfer performance of policy —that is, the ability of the agent to perform its task while being provided perceptual information that is completely different from what it saw during the reinforcement learning step, without any further training. Table 1 summarizes the transfer performance in terms of average reward observations throughout an episode of frames. Our approach avaes + ddpg is compared with two baselines:
random baseline, which depicts the performance of an untrained agent. This effectively simulates the performance one would expect from a non-transferable policy trained over image inputs, and later tested over sound inputs.
sound ddpg baseline, a DDPG agent trained directly over sound inputs (i.e.
the sounds correspond to the states). Provides an estimate on the performance an agent trained directly over the test input modality may achieve.
From Table 1, we conclude our approach provides the agent with an out-of-box performance improvement of over , when compared to the untrained agent (non-transferable policy). It is also interesting to observe that the difference in performance between our agent and sound ddpg seems small, supporting our empirical observation that the transfer policy succeeds very often in the task: swinging the pole up111We also note that the performance achieved by the sound ddpg agent is similar to that reported in the OpenAI gym leaderboard for the pendulum scenario with state observations as the position and velocity of the pendulum..
We consider the hyperhot scenario, a novel top-down shooter game scenario inspired by the space invaders Atari game222We opted to use a custom environment implemented in pygame, since the space invaders environment in OpenAI gym does not provide access to game state, making it hard to generate simulated sounds., where the goal of the agent is to shoot the enemies above, while avoiding their bullets by moving left and right.
|avaes + ddpg|
Similarly to the pendulum, in this scenario, the observations of the environment include both image and sound components. In hyperhot, however, the environmental sound is generated by multiple entities emitting a predefined frequency :
Left-side enemy units, , and right-side enemy units, , emit sounds with frequencies and , respectively.
Enemy bullets. , emit sounds with frequency .
The agent’s bullets, , emit sounds with frequency .
The sounds produced by these entities are received by a set of sound receivers . Figure 8 depicts the scenario, where the yellow circles are the enemies; the green and blue bullets are friendly and enemy fire, respectively; the the agent is in red; and the sound receivers correspond to the white circles. The agent is rewarded for shooting the enemies, with the following reward function:
The environment resets whenever the agent collects a non-zero reward, be it due to winning or losing the game.
We assume the perceptual space of the agent as , with the visual input modality of the agent, , consisting in the raw image observation of the environment. The sound, however, is generated in a more complex and realistic way. We model the sinusoidal wave of each sound-emitter considering its specific frequency and amplitude . At every frame, we consider the sound waves of every emitter present in the screen, according to their distance to each sound receiver in . The sound wave generated by emitter is observed by receiver as
where is a scaling constant, and denote the positions of sound emmitter and sound receiver , respectively. We generate each sinusoidal sound wave for a total of discrete time steps, considering an audio sample rate of Hz and a video frame-rate of 30 fps. As such, each sinusoidal sound wave represents the sound heard for the duration of a single video-frame of the game333This is similar to what is performed in Atari videogames.. Finally, for each sound receiver, we sum all emitted waves and encode the amplitude values in 16-bit audio depth, considering a maximum amplitude value of and a minimum value of .
We now provide details on how our approach was set up. All constants and training hyper-parameters used are again summarized in Appendix A.1.
4.2.1. Learn a perceptual model of the world
We trained an AVAE model to learn the family of latent mapping , with a dataset with observations of images and sounds collected using a random controller. Before training, the images were preprocessed to black and white and resized to pixels, and the sounds normalized to the range .
For the image-specific encoder we adopted an architecture with three convolutional layers and two fully connected layers. The three convolutional layers learned and filters, respectively. The filters were parameterized by kernel sizes and ; strides and ; and paddings and . ReLU activations were used throughout. For the sound-specific component, we used two fully connected layers of neurons each, with one dimension batch normalization between the layers. The decoders followed similar architectures. The increase in size of these layers when compared to the pendulum task is due to more complex nature of the sounds considered in this scenario. The optimizer and loss function were configured in the same way as in the previous scenario.
4.2.2. Learn to act in the world
The agent learned how to play the game using the DQN algorithm, while having access only to image observations, , corresponding to the video game frames. The image observations are encoded into the latent space using —the image-specific encoder of the AVAE model trained in the previous step. As such, the policy learned to play the game, maps these latent states to actions.
The policy and target networks consisted of two fully connected layers of neurons each, and we adopted a decaying -greedy policy.
4.2.3. Transfer policy
We then evaluated the performance of the policy learned with image inputs, when the agent only has access to the sound modality, i.e., . Given a sound observation, the agent preprocesses it using the latent map , thus generating a multimodal latent state —this process is denoted as avaes. The agent then uses the policy to select the optimal action in this latent state.
Table 2 summarizes the transfer performance of the policy produced by our approach avaes + dqn, in terms of average discounted rewards and game win rates over episodes. We compare the performance of our approach with additional baselines:
avaev + dqn, an agent similar to ours, but encodes the latent space with visual observations (as opposed to sounds).
image dqn, a DQN agent trained directly over the visual inputs.
Considering the results in Table 2, we observe:
A considerable performance improvement of our approach over the untrained agent. The average discounted reward of the random baseline is negative, meaning this agent tends to get shot often, and rather quickly. This is in contrast with the positive rewards achieved by our approach. Moreover, the win rates achieved by our approach surpass those of the untrained agent by -fold.
A performance comparable to that of the agent trained directly on the sound, sound dqn. In fact, the average discounted rewards achieved by our approach are slightly higher. However, we note that the sound dqn agent followed the same DQN architecture and number of training steps used in our approach. It is plausible that with further parameter tuning, the sound dqn agent could achieve better performances.
The approach that could fine-tune to the most informative perceptual modality, image dqn, achieved the highest performances. Our approach, while achieving lower rewards, is the only able to perform cross-modality policy transfer, that is, being able to reuse a policy trained on a different modality. One may argue that this trade-off is worthwhile.
The DQN networks of all approaches followed similar architectures and were trained for the same number of iterations.
|Approach||avg std||avg std|
|avaes + dqn|
|avaev + dqn|
The experimental evaluation performed shows the efficacy and applicability of our approach. The results show that this approach effectively enables an agent to learn and exploit policies over different subsets of input modalities. This sets our work apart from existing ideas in the literature. For example, DARLA follows a similar three-stages architecture to allow RL agents to learn policies that are robust to some shifts in the original domains higgins2017icml. However, that approach implicitly assumes that the source and target domains are characterized by similar inputs, such as raw observations of a camera. This is in contrast with our work, which allows agents to transfer policies across different input modalities.
Our approach achieves this by first learning a shared latent representation that captures the different input modalities. In our experimental evaluation, for this first step, we used the AVAE model, which approximates modality-specific latent representations, as discussed in Section 3.1. This model is well-suited to the scenarios considered, since these focused on the transfer of policies trained and reused over distinct input modalities. We envision other scenarios where training could potentially take into account multiple input modalities at the same time. Our approach supports these scenarios as well, when considering a generative model such as JMVAE suzuki2016arxiv, which can learn joint modality distributions and encode/decode both modalities simultaneously.
Furthermore, our approach also supports scenarios where the agent has access to more than two input modalities. The AVAE model can be extended to approximate additional modalities, by introducing extra loss terms that compute the divergence of the new modality specific latent spaces. However, it may be beneficial to employ generative models specialized on larger number of modalities, such as the MVAE korthals2019arxiv.
In this paper we explored the use of multimodal latent representations for capturing multiple input modalities, in order to allow agents to learn and reuse policies over different modalities. We were particularly motivated by scenarios of RL agents that learn visual policies to perform their tasks, and which afterwards, at test time, may only have access to sound inputs.
To this end, we formalized the multimodal transfer reinforcement learning problem, and contributed a three stages approach that effectively allows RL agents to learn robust policies over input modalities. The first step builds upon recent advances in multimodal variational autoencoders, to create a generalized latent space that captures the dependencies between the different input modalities of the agent, and allow for cross-modality inference. In the second step, the agent learns how to perform its task over this latent space. During this training step, the agent may only have access to a subset of input modalities, with the latent space being encoded accordingly. Finally, at test time, the agent may execute its task while having access to a possibly different subset of modalities.
We assessed the applicability and efficacy of our approach in different domains of increasing complexity. We extended well-known scenarios in the reinforcement learning literature to include, both the typical raw image observations, but also the novel sound components. The results show that the policies learned by our approach were robust to these different input modalities, effectively enabling reinforcement learning agents to play games in the dark.
This work was partially supported by national funds through the Portuguese Fundação para a Ciência e a Tecnologia under project UID/CEC/50021/2019 (INESC-ID multi annual funding) and the Carnegie Mellon Portugal Program and its Information and Communications Technologies Institute, under project CMUP-ERI/HCI/0051/ 2013. Rui Silva acknowledges the PhD grant SFRH/BD/113695/2015. Miguel Vasco acknowledges the PhD grant SFRH/BD/139362/2018.
Appendix A Appendix
a.1. Constants and hyper-parameters
|max episode length||300 frames|
|max episode length||450 frames|