Noisy Agents: Self-supervised Exploration by Predicting Auditory Events

07/27/2020 ∙ by Chuang Gan, et al. ∙ 1

Humans integrate multiple sensory modalities (e.g. visual and audio) to build a causal understanding of the physical world. In this work, we propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions through auditory event prediction. First, we allow the agent to collect a small amount of acoustic data and use K-means to discover underlying auditory event clusters. We then train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration. Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines. We further visualize our noisy agents' behavior in a physics environment and demonstrate that our newly designed intrinsic reward leads to the emergence of physical interaction behaviors (e.g. contact with objects).



There are no comments yet.


page 3

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning algorithms aim to learn a policy of an agent to maximize its cumulative rewards by interacting with environments and have demonstrated substantial success in a wide range of application domains, such as video game  Mnih et al. (2015), board games Silver et al. (2016), and visual navigation Zhu et al. (2017). While these results are remarkable, one of the critical constraints is the prerequisite of carefully engineered dense reward signals, which are not always accessible. To overcome these constraints, researchers have proposed a range of intrinsic reward function. For example, curiosity-driven intrinsic reward based on prediction error of current Burda et al. (2018b) or future statePathak et al. (2017) on the latent feature spaces have shown promising results. Nevertheless, visual state prediction is a non-trivial problem as visual state is high-dimensional and tends to be highly stochastic in real-world environments.

The occurrence of physical events (e.g. objects coming into contact with each other, or changing state) often correlates with both visual and auditory signals. Both sensory modalities should thus offer useful cues to agents learning how to act in the world. Indeed, classic experiments in cognitive and developmental psychology show that humans naturally attend to both visual and auditory cues, and their temporal coincidence, to arrive at a rich understanding of physical events and human activity such as speech  Spelke (1976); McGurk and MacDonald (1976)

. In artificial intelligence, however, much more attention has been paid to the ways visual signals (e.g., patterns in pixels) can drive learning. We believe this misses important structure learners could exploit. As compared to visual cues, sounds are often more directly or easily observable causal effects of actions and interactions. This is clearly true when agents interact: most communication uses speech or other nonverbal but audible signals. However, it is just as much in physics. Almost any time two objects collide, rub or slide against each other, or touch in any way, they make a sound. That sound is often clearly distinct from background auditory textures, localized in both time and spectral properties, hence relatively easy to detect and identify; in contrast, specific visual events can be much harder to separate from all the ways high-dimensional pixel inputs are changing over the course of a scene. The sounds that result from object interactions also allow us to estimate underlying causally relevant variables, such as material properties (

e.g., whether objects are hard or soft, solid or hollow, smooth, or rough), which can be critical for planning actions.

These facts bring a natural question of how to use audio signals to benefit policy learning in RL. In this paper, our main idea is to use sound prediction as an intrinsic reward to guide RL exploration. Intuitively, we want to exploit the fact that sounds are frequently made when objects interact, or other causally significant events occur, like cues to causal structure or candidate subgoals an agent could discover and aim for. A naïve strategy would be to directly regress feature embeddings of audio clips and use feature prediction errors as intrinsic rewards. However, prediction errors on feature space do not accurately reflect how well the agent understands the underlying causal structure of events and goals. It also remains an open problem on how to perform appropriate normalizations to solve intrinsic reward diminishing issues. To bypass these limitations, we formulate the sound-prediction task as a classification problem, in which we train a neural network to predict auditory events that occurred after applying action to a visual scene. We use classification errors as an exploration bonus for deep reinforcement learning. Concretely, our pipeline consists of two exploration phases. In the beginning, the agent receives an incentive to actively collect a small amount of auditory data by interacting with the environment. Then we cluster the sound data into auditory events using K-means. In the second phase, we train a neural network to predict the auditory events conditioned on the embedding of visual observations and actions. The state that has the wrong prediction is rewarded and encouraged to be visited more. We demonstrate the effectiveness of our intrinsic motivation module on 25 Atari Games and a rolling robot multi-modal physic simulation platform build on top of TDW Gan et al. (2020c). In summary, our work makes the following contributions:

  • We introduce a novel and effective auditory event prediction (AEP) framework to make use of the auditory signals as intrinsic rewards for RL exploration.

  • Our system outperforms previous state-of-the-art vision only curiosity-driven exploration agents on most of the Atari games.

  • We show that our new intrinsic module is more stable in the 3D multi-modal physical world environment and can encourage interest actions that involved physical interactions.

2 Related Work

Audio-Visual Learning.

In recent years, audio-visual learning has been studied extensively. By leveraging audio-visual correspondences in videos, it can help to learn powerful audio and visual representations through self-supervised learning 

Owens et al. (2016b); Aytar et al. (2016); Arandjelovic and Zisserman (2017); Korbar et al. (2018); Owens and Efros (2018). Other interesting applications using audio-visual knowledge transfer include sounding object localizationSenocak et al. (2018); Arandjelovic and Zisserman (2018), sound source separation Gao et al. (2018); Gan et al. (2020b); Zhao et al. (2018); Ephrat et al. (2018); Zhao et al. (2019); Afouras et al. (2018), biometric matching Nagrani et al. (2018), sound generation for videos Owens et al. (2016a); Zhou et al. (2017); Gao and Grauman (2019); Morgado et al. (2018); Gan et al. (2020a), audio-visual co-segmentation Rouditchenko et al. (2019), auditory vehicle tracking Gan et al. (2019) and action recognition Long et al. (2018b, a); Nagrani et al. (2020). In contrast to the widely used correspondences between these two modalities, we take a step further by considering sound as causal effects of actions.

RL Explorations. The problem of exploration in Reinforcement Learning (RL) has been an active research topic for decades. There are various solutions that have been investigated for encouraging the agent to explore novel states, including rewarding information gain Little and Sommer (2013), surprise Schmidhuber (1991, 2010), state visitation counts Tang et al. (2017); Bellemare et al. (2016), empowerment Klyubin et al. (2005), curiosity Pathak et al. (2017); Burda et al. (2018a) disagreement Pathak et al. (2019) and so on. A separate line of work Osband ; Osband et al. (2016)

studies adopt parameter noises and Tompson sampling heuristics for exploration. For example, Osband

Osband trains multiple value functions and makes use of the bootstraps for deep exploration. Here, we mainly focus on the problem of using intrinsic rewards to drive explorations. The most widely used intrinsic motivation could be roughly divided into two families. The first one is count-based approaches Strehl and Littman (2008); Bellemare et al. (2016); Tang et al. (2017); Ostrovski et al. (2017); Martin et al. (2017); Burda et al. (2018b), which encourage the agent to visit novel states. For example, Burda Burda et al. (2018b)

employs the prediction errors of a self-state feature extracted from a fixed and random initialized network as exploration bonuses and encourage the agent to visit more previous unseen states. Another one is the curiosity-based approach 

Stadie et al. (2015); Pathak et al. (2017); Haber et al. (2018); Burda et al. (2018a), which is formulated as the uncertainty in predicting the consequences of the agent’s actions. For instance, Pathak et al. (2017); Burda et al. (2018a) uses the errors of predicting the next state in the latent feature space as rewards. The agent is then encouraged to improve its knowledge about the environment dynamics. In contrast to previous work that purely works on visual observations, we make use of the sound signals as rewards for RL explorations.

Sounds and Actions. There are numerous works to explore the associations between sounds and actions. For example, Owens Owens et al. (2016a) made the first attempt to collect an audio-video dataset through physical interaction with objects and train an RNN model to generate sounds for silent videos. Shlizerman et al. (2018); Ginosar et al. (2019) explore the problem of predicting body dynamics from music and body gesture from speech. Gan  Gan et al. (2020d) and Chen  Chen et al. (2019) introduce an interesting audio-visual embodied task in 3D simulation environments. More recently, Gandhi Dhiraj et al. (2020) collected a large sound-action-vision dataset using Tilt-bolt and demonstrates sound signals could provide valuable information for find-grained object recognition, inverse model learning, and forward dynamic model prediction. More related to us are the papers from Aytar et al. (2018) and Omidshafiei et al. (2018)

, which have shown that the sound signals could provide useful supervisions for imitation learning and reinforcement learning in Atari games. Concurrent to our work,

Dean et al. (2020) uses novel associations of audio and visual signals as intrinsic rewards to guide RL exploration. Different from them, we mainly studied if the sound signals alone could be utilized as intrinsic rewards for RL explorations.

3 Method

In this section, we first introduce some background knowledge of reinforcement learning and intrinsic rewards. Then we will present the representations of auditory events. Finally, we elaborate on the pipeline of self-supervised exploration through auditory event predictions. The pipeline of our system is outlined in Figure 1.

Figure 1: An overview of our framework. Our model consists of two stages: sound clustering and auditory event prediction. The agent start to collect a diverse set of sound through limited environment interactions (e.g. 10K) and then cluster them into auditory event classes. In the second stage, the agent use errors of auditory events predictions as intrinsic reward to explore the environment.

3.1 Background


We formalize the decision procedure in our context as a standard Markov Decision Process (MDP), defined as

. , and denote the sets of state, action and the distribution of initial state respectively. The transition function

defines the transition probability to next-step state

if the agent takes action at current state . The agent will receive a reward after taking an action according to the reward function , discounted by . The goal of training reinforcement learning is to learn an optimal policy that can maximize the expected rewards under the discount factor as


where represents the agent’s trajectory, namely . The agent chooses an action from a policy that specifies the probability of taking action under state . In this paper, we concentrate on the MDPs whose states are raw image-based observations as well as audio clips, actions are discrete, and is provided by the game engine.

Intrinsic Rewards for Exploration. Designing intrinsic rewards for exploration has been widely used to resolve the sparse reward issues in the deep RL communities. One effective approach is to use the errors of a predictive model as exploration bonuses  Pathak et al. (2017); Haber et al. (2018); Burda et al. (2018a). The intrinsic rewards will encourage the agent to explore those states with less familiarity. In this paper, we aim to train a policy that can maximize the errors of auditory event predictions.

3.2 Representations of Auditory Events

Consider an agent that sees a visual observation , takes an action and transits to the next state with visual observation and sound effect . The main objective of our intrinsic module is to predict auditory events of the next state, given feature representations of the current visual observation and the agent’ action . We hypothesize that the agents, through this process, could learn the underlying causal structure of the physical world and use that to make predictions about what will happen next, and as well as plan actions to achieve their goals.

To better capture the statistic of the raw auditory data, we extract sound textures McDermott and Simoncelli (2011) to represent each audio clip . For the task of auditory event predictions, perhaps the most straightforward option is to directly regress the sound features given the feature embeddings of the image observation and agent’ actions . Nevertheless, we find that not very effective. The reasons are mainly in two folds: 1) the sound textures do not explicitly capture the high-level events information; 2) the distances between the sound textures could not accurately reflect how well the agents grasp the underlying causal structure of these auditory events. For example, from the position of an aircraft and the shooting action, we hope the agent can infer that a critical event like an explosion will happen, rather than the intricate rhythm of the bang. Therefore, we choose instead to define explicit auditory events categories and formulate this auditory event prediction problem as a classification task, similar to Owens et al. (2016b).

Figure 2: The first and second rows show the auditory events that we discovered by the K-means algorithm in Frostbite and Assault.

3.3 Auditory Events Prediction Based Intrinsic Reward

Our AEP framework consists of two stages: sound clustering and auditory event prediction. We need to collect a small set of diverse auditory data in the first stage and use them to define the underlying auditory event classes. To achieve this goal, we first train an RL policy that rewards the agents based on sound novelty. And then, we run a K-means algorithm to group these data into several auditory events. In the second phase, we train a forward dynamic network that takes input as the embedding of visual observation and action and predicts which auditory event will happen next. The prediction error is then utilized as an intrinsic reward to encourage the agent to explore those auditory events with more uncertainty, thus improving its ability to understand the consequences of its actions. We will elaborate on the details of these two-phase below.

Sound clustering. The agents start to collect audio data by interacting with the environment. The goal of this phase is to gather diverse data that could be used to define auditory events. For this purpose, we train an RL policy by maximizing the occurrences of novel sound effects. In particular, we design an online clustering-based intrinsic motivation module to guide explorations. Assuming we have a series of sound embeddings and temporarily grouped into clusters. Given a new coming sound embedding , we compute its distance to the closest cluster centers and use that as an exploration bonus. Formally,


where denotes an intrinsic reward at time , and represents a cluster center. During this exploration, the number of clusters will grow, and each cluster’s center will also be updated. Through this process, the agent is encouraged to collect novel auditory data that could enrich cluster diversity. After the number of the clusters is saturated, we then perform the K-Means clustering algorithm Lloyd (1982) on the collected data to define the auditory event classes and use the center of each cluster for the subsequent auditory event prediction task. We visualize the corresponding visual states in two games (Frostbite and Assault) that belong to the same sound clusters, and it can be observed that each cluster always contains identical or similar auditory events (see Figure 2)

Auditory event predictions. Since we have already explicitly defined the auditory event categories, the prediction problem can then be easily formulated as a classification task. We label each sound texture with the index of the closest center, and then train a forward dynamic network that takes the embeddings of visual observation and action as input to predict which auditory event cluster the incurred sound belongs to. The forward dynamic model is trained on collected data using gradient descent to minimize the cross-entropy loss between the estimated class probabilities with the ground truth distributions as:


The prediction is expected to fail for novel associations of visual and audio data. We will reward the agent at the that stage and encourage it to visit more since it is uncertain with about this scenario. In practice, we do find that the agent can learn to avoid dying scenario in the games since that gives a similar sound effect it has already encountered many times and can predict very well. By avoiding potential dangers and keeping seeking novel events, agents can learn causal knowledge of the world for planing their actions to achieve the goal.

4 Experiments

In the experiments, we investigate the following questions:

  • Does the proposed audio-driven explorations outperform other intrinsic reward modules on learning skills without extrinsic rewards?

  • Can our approach be combined with extrinsic rewards to improve policy learning on hard exploration Atari games?

  • What is the agent’s behavior using different methods in a 3D multi-modality physical environment?

  • Is each component in our methods necessary?

4.1 Setup

Atari Game Environment Our primary goal is to investigate whether we could use auditory event prediction as intrinsic rewards to help RL exploration. For this purpose, we use the Gym Retro Nichol et al. (2018) as a testbed to measure agents’ competency quantitatively. Gym Retro consists of a diverse set of Atari games, and also support an audio API to provide the sound effects of each state. We first use 20 familiar video games that contain the sound effects to compare the intrinsic reward only exploration against several previous state-of-the-art intrinsic modules. Then, we use five hard exploration games to investigate whether the newly designed motivation could be combined with extrinsic rewards to improve the results

Rolling Robot Multi-Modality Simulation Platform. We also test our module on a 3D multi-modal physic simulation platform. We build this platform on top of the Unity game engine and physics-based impact sound simulation toolbox from TDW Gan et al. (2020c). As shown in Figure 5 , we place a rolling robot agent (i.e., red sphere) on a billiard table. The agent can execute actions to interact with objects of different materials and shapes. When two objects collide, the environment could generate collision sound based on the physical properties of objects. We would like to compare the agent’s behaviors in this 3D world physical environment using different intrinsic rewards.

Implementation details. For all the experiments, we choose PPO algorithm Schulman et al. (2017)

based on the PyTorch implementation to train our RL agent since it is robust and requires little hyper-parameter tuning. For experiments on Atari games, we use gray-scale image observations of size 84

84 and 60ms audio clip. We set the skip frame S4 for all the experiments. We use a 4-layer CNN as the encoder of the policy network. As for the auditory prediction network, we choose a 3-layer CNN to encode the image observation and use 2-layer MLP to predict the auditory events.

Figure 3: Average extrinsic rewards of our model against baselines in 20 Atari games.

4.2 Explorations without Extrinsic Rewards

We first aim to compare how agents use different intrinsic rewards to explore the environment without extrinsic rewards. To quantify how well an exploration strategy, we use the external rewards it can achieve as an evaluation metric. It is important to note that the extrinsic reward is only used for evaluation, not for training. We consider five state-of-the-arts intrinsic motivation modules for comparisons, including Intrinsic Curiosity Module (ICM) 

Pathak et al. (2017), Random Feature Networks (RFN) Burda et al. (2018a), Random Network Distillation (RND) Burda et al. (2018a), and Model Disagreement (DIS) Pathak et al. (2019). We run all the experiments for 10 million steps with 8 parallel environments.

Figure 3

summarizes the evaluation curves of mean extrinsic reward in 20 Atari games. For each method, we experiment with three different random seeds and report the standard deviation in the shaded region. As the figure shows, our module achieves significantly better results than previous vision-only intrinsic motivation modules in 15 out of 20 games and is comparable in the other 5 games. Another interesting observation is that the earned score keeps going up, even only using intrinsic rewards. We observe that the agent could learn to avoid dangerous situations with dead sound, which might frequently happen at the beginning of explorations. There are also some failure cases in the above Atari games. For example, our method falls short on the games with background music or noises that could not reflect any auditory events (

e.g. Freeway and Time Pilot). Advanced audio processing algorithms might help, and we leave this to future work. More model analysis and demo videos could be found in supplementary materials.

4.3 Combining Extrinsic and Intrinsic Rewards for Hard Explorations

The intrinsic rewards could serve as incentives that allow the agent to distinguish novel and fruitful states, but the lack of extrinsic rewards impedes the awareness of auditory events where agents can earn more rewards and need to visit again. In this section, we investigate whether the audio-driven intrinsic rewards could be further utilized to improve policy learning in the hard exploration scenarios, where extrinsic rewards exist but very sparse. We use five hard exploration environments in Atari games, including Venture, Solaris, Private Eye, Pitfall!, and Gravitar for experiments. Following the strategy proposed by RND Burda et al. (2018a)

, we use two value heads separately for the intrinsic and extrinsic reward module and then combine their returns. We also normalize intrinsic rewards to make up the variances among different environments. We compare our model against the plain PPO using purely extrinsic rewards. All experiments are run for 4 million steps with 32 parallel environments. We use the max episodic returns to measure the ability of explorations.

The comparison results are shown in Figure 4. We find that our audio-driven exploration can significantly improve the policy learning of hard exploration games in Atari. For example, when combining the intrinsic reward, the agent can earn four times rewards on game Venture and almost 30 times rewards Private Eye.

Figure 4: Comparison of combining intrinsic and extrinsic reward against using extrinsic reward only on 5 hard exploration Atari Games.

4.4 Understanding Explorations Behaviors in 3D Physical World

In this section, we would like to see the agent’s behaviors in a near photo-realistic 3D physical world. A curious rolling robot is required to use interactions to build causal models of the physical world.

Figure 5: Explorations on a multi-modal physics environment. From left to right: physical scene, collision events, and intrinsic reward changes

Setup. For this experiment, we take an image observation of 8484 size and 50ms audio clip as input. We use a three-layer convolutional network to encode the image and extract sound textures from the audio clip. Same as the previous experiment, we train the policy using the PPO algorithm. The action space consists of moving to eight directions and stop. The action is repeated 4 times on each frame. We run all the experiments for 200K steps with 8 parallel environments.

Result Analysis. To understand and quantify the agent’s behaviors in this environment, we show the number of collision events and intrinsic reward rewards in Figure 5. We noticed two major issues with the previous vision-based curiosity model (e.g. RND and ICM). First, the prediction errors on latent features space could not accurately reflect subtle object state changes in the 3D photo-realistic world, in which a physical event happens. Second, the intrinsic reward can diminish quickly during training, since the learned predictive model usually converges to a stable state representation of the environment. Instead, our auditory event prediction driven exploration will lead agents to interact more with objects in the physical world, which is critical to learning the dynamics of the environments.

4.5 Ablated Study

In this section, we perform in-depth ablation studies to evaluate each component of our model using 3 Atari games with apparent sound effects: Amidar, Battle Zone, and Carnival. All experiments are run for 2 million steps with 8 parallel environments.

Predict auditory events or sound features? One main contribution of our paper is to use auditory event prediction as an intrinsic reward. To further understand this module’s ability, we conduct an ablated study by replacing this module with sound feature prediction module. In particular, we train a neural network that takes the embedding of visual state and action as input and predicts the sound textures. The comparison curves are plotted in Figure 7. We observe that the auditory event prediction module indeed earned more rewards. This result demonstrates the advantage of using auditory event classes over latent sound feature embedding for RL explorations. We speculate that auditory events provide more structured knowledge of the world, thus lead to better policy learning.

Sound clustering or auditory event prediction? We adopt a two-stage exploration strategy. A natural question is if this is necessary. We show the curve of using sound clustering only as an intrinsic reward in Figure 7. We notice that the returned extrinsic reward is similar to sound feature prediction, but worse than auditory event prediction.

Active exploration or random explorations? We propose an online clustering-based intrinsic module for active audio data collections. To verify its efficacy, we replace this module with random explorations. For fair comparisons, we allow both models to use 10K interaction data to define the event classes with the same K-mean clustering algorithm. The comparison results are shown in Figure 7. We can find that the proposed active explorations indeed achieve better results. We also compute the cluster distances of both two models and find that the sound clusters discovered by active exploration are much diverse, thus facilitate the agents to perform in-depth explorations.

Figure 6: Comparisons on earned extrinsic rewards between our auditory event prediction module and sound feature prediction module.
Figure 7: Comparisons on earned extrinsic rewards between our active exploration and a random exploration strategy.

5 Conclusions

In this work, we introduce an intrinsic reward function of predicting sound for RL exploration. Our model employs the errors of auditory event prediction as an exploration bonuses, which allows RL agent to explore novel physical interactions of objects. We demonstrate our proposed methodology and compared it against a number of baselines on Atari games. Based on the experimental result above, we therefore conclude that sound conveys rich information and is powerful for agents to build a causal model of the physical world for efficient RL explorations. We hope our work could inspire more works on using multi-modality cues for planing and control.

Broader Impact

Our work is on the basic science of multimodal learning and exploration in RL. Since auditory signals are prevalent in real-world scenarios, we believe that combining them with visual signals could help guide exploration in many robotic applications Dhiraj et al. (2020); Gan et al. (2020d); Chen et al. (2019). For example, the honk of a car may be a useful signal that a self-driving agent has entered an unexpected situation. This example also raises a potential negative use of the ideas in this paper: if a self-driving car explores by seeking out honks it would likely put humans in danger. Future work should therefore consider how to combine the ideas in this paper with the notion of safe exploration.

Studying the role of auditory and visual signals could also be especially relevant for sight and hearing impaired populations. For example, if we better understand the role of audition in exploration, perhaps we can develop applications that better serve deaf users, who lack this signal.

A limitation of our work is that it only experiments on synthetic environments, which may not reflect realistic scenarios. For example, Atari games have sound effects that often correlate with game achievements, whereas the correlation between sound and reward in nature is likely much more complex. The findings in our paper can therefore be considered to be biased by the design of the synthetic environments. Future work will be necessary to validate our methods on real world applications.


  • [1] T. Afouras, J. S. Chung, and A. Zisserman (2018) The conversation: deep audio-visual speech enhancement. ICASSP. Cited by: §2.
  • [2] R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In

    2017 IEEE International Conference on Computer Vision (ICCV)

    pp. 609–617. Cited by: §2.
  • [3] R. Arandjelovic and A. Zisserman (2018) Objects that sound. In ECCV, pp. 435–451. Cited by: §2.
  • [4] Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang, and N. de Freitas (2018) Playing hard exploration games by watching youtube. In NIPS, pp. 2930–2941. Cited by: §2.
  • [5] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pp. 892–900. Cited by: §2.
  • [6] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In NIPS, pp. 1471–1479. Cited by: §2.
  • [7] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros (2018) Large-scale study of curiosity-driven learning. ICLR. Cited by: §2, §3.1, §4.2, §4.3.
  • [8] Y. Burda, H. Edwards, A. Storkey, and O. Klimov (2018) Exploration by random network distillation. ICLR. Cited by: §1, §2.
  • [9] C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman (2019) Audio-visual embodied navigation. arXiv preprint arXiv:1912.11474. Cited by: §2, Broader Impact.
  • [10] V. Dean, S. Tulsiani, and A. Gupta (2020) See, hear, explore: curiosity via audio-visual association. arXiv preprint arXiv:2007.03669. Cited by: §2.
  • [11] G. Dhiraj, G. Abhinav, and P. Lerrel (2020) Swoosh! rattle! thump! - actions that sound. Cited by: §2, Broader Impact.
  • [12] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. SIGGRAPH. Cited by: §2.
  • [13] C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and A. Torralba (2020) Foley music: learning to generate music from videos. In CVPR, pp. 10478–10487. Cited by: §2.
  • [14] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba (2020) Music gesture for visual sound separation. In CVPR, pp. 10478–10487. Cited by: §2.
  • [15] C. Gan, J. Schwartz, S. Alter, M. Schrimpf, J. Traer, J. De Freitas, J. Kubilius, A. Bhandwaldar, N. Haber, M. Sano, et al. (2020) ThreeDWorld: a platform for interactive multi-modal physical simulation. arXiv preprint arXiv:2007.04954. Cited by: §1, §4.1.
  • [16] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B. Tenenbaum (2020) Look, listen, and act: towards audio-visual embodied navigation. ICRA. Cited by: §2, Broader Impact.
  • [17] C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba (2019) Self-supervised moving vehicle tracking with stereo sound. In ICCV, pp. 7053–7062. Cited by: §2.
  • [18] R. Gao, R. Feris, and K. Grauman (2018) Learning to separate object sounds by watching unlabeled video. In ECCV, Cited by: §2.
  • [19] R. Gao and K. Grauman (2019) 2.5D visual sound. CVPR. Cited by: §2.
  • [20] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik (2019) Learning individual styles of conversational gesture. In CVPR, pp. 3497–3506. Cited by: §2.
  • [21] N. Haber, D. Mrowca, S. Wang, L. F. Fei-Fei, and D. L. Yamins (2018) Learning to play with intrinsically-motivated, self-aware agents. In NIPS, pp. 8388–8399. Cited by: §2, §3.1.
  • [22] A. S. Klyubin, D. Polani, and C. L. Nehaniv (2005) Empowerment: a universal agent-centric measure of control. In

    2005 IEEE Congress on Evolutionary Computation

    Vol. 1, pp. 128–135. Cited by: §2.
  • [23] B. Korbar, D. Tran, and L. Torresani (2018) Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230. Cited by: §2.
  • [24] D. Y. Little and F. T. Sommer (2013) Learning and exploration in action-perception loops. Frontiers in neural circuits 7, pp. 37. Cited by: §2.
  • [25] S. P. Lloyd (1982) Least squares quantization in pcm. IEEE Trans. Inf. Theory 28, pp. 129–136. Cited by: §3.3.
  • [26] X. Long, C. Gan, G. De Melo, X. Liu, Y. Li, F. Li, and S. Wen (2018) Multimodal keyless attention fusion for video classification. In AAAI, Cited by: §2.
  • [27] X. Long, C. Gan, G. De Melo, J. Wu, X. Liu, and S. Wen (2018) Attention clusters: purely attention based local feature integration for video classification. In CVPR, pp. 7834–7843. Cited by: §2.
  • [28] J. Martin, S. N. Sasikumar, T. Everitt, and M. Hutter (2017) Count-based exploration in feature space for reinforcement learning. IJCAI. Cited by: §2.
  • [29] J. H. McDermott and E. P. Simoncelli (2011) Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71 (5), pp. 926–940. Cited by: §3.2.
  • [30] H. McGurk and J. MacDonald (1976) Hearing lips and seeing voices. Nature 264 (5588), pp. 746. Cited by: §1.
  • [31] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
  • [32] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang (2018) Self-supervised generation of spatial audio for 360 video. In NIPS, Cited by: §2.
  • [33] A. Nagrani, S. Albanie, and A. Zisserman (2018) Seeing voices and hearing faces: cross-modal biometric matching. arXiv preprint arXiv:1804.00326. Cited by: §2.
  • [34] A. Nagrani, C. Sun, D. Ross, R. Sukthankar, C. Schmid, and A. Zisserman (2020) Speech2Action: cross-modal supervision for action recognition. CVPR. Cited by: §2.
  • [35] A. Nichol, V. Pfau, C. Hesse, O. Klimov, and J. Schulman (2018) Gotta learn fast: a new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720. Cited by: §4.1.
  • [36] S. Omidshafiei, D. Kim, J. Pazis, and J. P. How (2018) Crossmodal attentive skill learner. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 139–146. Cited by: §2.
  • [37] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. In NIPS, pp. 4026–4034. Cited by: §2.
  • [38] I. Osband Deep exploration via randomized value functions.

    Journal of Machine Learning Research

    Cited by: §2.
  • [39] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos (2017) Count-based exploration with neural density models. In ICML, pp. 2721–2730. Cited by: §2.
  • [40] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. ECCV. Cited by: §2.
  • [41] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman (2016) Visually indicated sounds. In CVPR, pp. 2405–2413. Cited by: §2, §2.
  • [42] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba (2016) Ambient sound provides supervision for visual learning. In European Conference on Computer Vision, pp. 801–816. Cited by: §2, §3.2.
  • [43] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In ICML, Cited by: §1, §2, §3.1, §4.2.
  • [44] D. Pathak, D. Gandhi, and A. Gupta (2019) Self-supervised exploration via disagreement. ICML. Cited by: §2, §4.2.
  • [45] A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba (2019) Self-supervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. Cited by: §2.
  • [46] J. Schmidhuber (1991) A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227. Cited by: §2.
  • [47] J. Schmidhuber (2010) Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development 2 (3), pp. 230–247. Cited by: §2.
  • [48] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.1.
  • [49] A. Senocak, T. Oh, J. Kim, M. Yang, and I. So Kweon (2018) Learning to localize sound source in visual scenes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4358–4366. Cited by: §2.
  • [50] E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman (2018) Audio to body dynamics. In CVPR, pp. 7574–7583. Cited by: §2.
  • [51] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
  • [52] E. Spelke (1976) Infants’ intermodal perception of events. Cognitive psychology 8 (4), pp. 553–560. Cited by: §1.
  • [53] B. C. Stadie, S. Levine, and P. Abbeel (2015) Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814. Cited by: §2.
  • [54] A. L. Strehl and M. L. Littman (2008) An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74 (8), pp. 1309–1331. Cited by: §2.
  • [55] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel (2017) # exploration: a study of count-based exploration for deep reinforcement learning. In NIPS, pp. 2753–2762. Cited by: §2.
  • [56] H. Zhao, C. Gan, W. Ma, and A. Torralba (2019) The sound of motions. ICCV. Cited by: §2.
  • [57] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018-09) The sound of pixels. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [58] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg (2017) Visual to sound: generating natural sound for videos in the wild. arXiv preprint arXiv:1712.01393. Cited by: §2.
  • [59] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, pp. 3357–3364. Cited by: §1.

Appendix A Result Analysis

In this section, we would like to provide an in-depth understanding of our algorithm works well under what circumstances. The sound effects in Atari games fall into three different categories: 1) event-driven sounds which emitted when agents achieve a specific condition (e.g., picking up a coin, the explosion of an aircraft, etc.); 2) action-driven sounds which emitted when agents implement a specific action (e.g., shooting, jumping, etc.) and 3) background noise/music. According to the dominant sound effects in each game, we summarize the 20 Atari games in Table 1.

Dominant sound effects Atari games
Event-driven sounds
Amidar, Carnival, NameThisGame, Frostbite,
FishingDerby, MsPacman
Action-driven sounds
AirRaid, Assault, Jamesbond, ChopperCommand, StarGunner,
Tutankham, WizardOfWor, Gopher, DemonAttack
Background sounds Asteroids, Freeway, TimePilot, BattleZone, CrazyClimber
Table 1: Category results of 20 Atari games according to the dominant type of sound effects. We label the games in which our method performs the best in bold front.

Based on the category defined in Table 1 and the performance shown in Figure 3 in the main paper, we can draw three conclusions. First, both event-driven and action-driven sounds boost the performance of our algorithm. Since the sound is more observable effects of action and events, understanding these casual effects is essential to learn a better exploration policy. Second, our algorithm performs better on the games dominant with event-driven sounds compare with those with action-driven sounds. We believe that event-driven sounds contain higher-level information, such as the explosion of an aircraft or collecting coins, which can be more beneficial for the agent to understand the physical world. Third, our algorithm falls short in comparison with baselines when the sounds effects mainly consist of meaningless background noise or background music (i.e. CrazyClimber, MsPacman, Gopher, Tutankham, WizardOfWor, and BattleZone). These sounds have little relevance to visual clues and cannot provide useful information or rewards to agents.

Appendix B Training Details

Table 2 shows the hyper-parameters used in our algorithm.

Hyperparameter Value
Rollout length 128
Number of minibatches 4
Learning rate 2.5e-4
Clip parameter 0.1
Entropy coefficient 0.01
Table 2: Hyper-parameters used in our algorithms.

Appendix C Ablated Study

In this section, we carry out ablated experiments to demonstrate that the gains in our method are caused by the audio-event prediction, rather than the use of multi-modality information. For four baselines (i.e. RND, RFN, ICM, and DIS), instead of predicting audio-event, they consider sound information by concatenating both visual and sound features to predict the image embedding in the next time step. As shown in Figure 8, our algorithm significantly outperforms other baselines in five Atari games. This indicates that it is non-trivial to exploit sound information for RL, and our algorithm benefits from the carefully designed audio-event prediction as an intrinsic reward.

Figure 8: Average extrinsic rewards of our model against baselines combined with sound in 5 Atari games