Music Gesture for Visual Sound Separation

04/20/2020 ∙ by Chuang Gan, et al. ∙ 2

Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple violins in a scene. To address this, we propose "Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals. Experimental results on three music performance datasets show: 1) strong improvements upon benchmark metrics for hetero-musical separation tasks (i.e. different instruments); 2) new ability for effective homo-musical separation for piano, flute, and trumpet duets, which to our best knowledge has never been achieved with alternative methods. Project page:



There are no comments yet.


page 1

page 3

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Music performance is a profoundly physical activity. The interactions between body and the instrument in nuanced gestures produce unique sounds [21]. When performing, pianists may strike the keys at a lower register or “tickle the ivory” up high; Violin players may move vigorously through a progression while another player sways gently with a melodic base; Flautists press a combination of keys to produce a specific note. As humans, we have the remarkable ability to distinguish different sounds from one another, and associate the sound we hear with the corresponding visual perception from the musician’s bodily gestures.

Inspired by this human ability, we propose “Music Gesture” (shown in Figure 1), a structured keypoint-based visual representations to makes use of the body motion cues for sound source separation. Our model is built on the mix-and-separate self-supervised training procedure initially proposed by Zhao et al[57]. Instead of purely relying on visual semantic cues [57, 17, 19, 53] or low-level optical-flow like motion representations [56], we consider to exploit the explicit human body and hand movements in the videos. To achieve this goal, we design a new framework, which consists of a video analysis network and an audio-visual separation network. The video analysis network extracts body dynamics and semantic context of musical instruments from video frames. The audio-visual separation network is then responsible for separating each sound source based on the visual context. In order to better leverage the body dynamic motions for sound separations, we further design a new audio-visual fusion module in the middle of the audio-visual separation network to adjust sound features conditioned on visual features.

We demonstrate the effectiveness of our model on three musical instrument datasets, URMP [31], MUSIC [57] and AtinPiano [36]. Experimental results show that by explicitly modeling the body dynamics through the keypoint-based structured visual representations, our approach performs favorably against state-of-the-art methods on both hetero-musical and homo-musical separation task. In summary, our work makes the following contributions:

  • We pave a new research direction on exploiting body dynamic motions with structured keypoint-based video representations to guide the sound source separation.

  • We propose a novel audio-visual fusion module to associate human body motion cues with the sound signals.

  • Our system outperforms previous state-of-the-arts approaches on hetero-musical separation tasks by a large margin.

  • We show that the keypoint-based structured representations open up new opportunities to solve harder homo-musical separation problem for piano, flute, and trumpet duets.

2 Related Work

Sound separation.

Sound separation is a central problem in the audio signal processing area [34, 22], while the classic solutions for it are based on Non-negative Matrix Factorization (NMF) [52, 11, 48]. These are not very effective as they rely on low-level correlations in the signals. Deep learning based methods are taking over in the recent years. Simpson et al[47] and Chandna et al[9] proposed CNN models to predict time-frequency masks for music source separation and enhancement. Another challenging problem in speech separation is identity permutation: a spectrogram classification model could not deal with the case with arbitrary number of speakers talking simultaneously. To solve this problem, Hershey et al[24] proposed Deep Clustering and Yu et al[55] proposed a speaker-independent training framework.

Visual sound separation.

Our work falls into the category of visual sound separation. Early works [5] leveraged the tight associations between audio and visual onset signal to perform audio-visual sound attribution. Recently, Zhao et al[57] proposed a framework that learns from unlabeled videos to separate and localize sounds with the help of visual semantics cues. Gao et al[17] combined deep networks with NMF for sound separation. Ephrat et al[12] and Owens et al[38] proposed to used vision to improve the quality of speech separation. Xu et al[53] and Gao et al[19] further improved the models with recursive models and co-separation loss. Those works all demonstrated how semantic appearances could help with sound separation. However, these methods have limited capabilities to capture the motion cues, thus restricts their applicability to solve harder sound source separation problems.

Most recently, Zhao et al[56] proposed to leverage temporal motion information to improve the vision sound separation. However, this algorithm has not yet seen wide applicability to sound separation on real mixtures. This is primarily due to the trajectory and optical flow like motion features they used are still limited to model the human-object interactions, thus can not provide strong visual conditions for the sound separation. Our work overcomes these limitations in that we study the explicit body movement cues using structured keypoint-based structured representations for audio-visual learning, which has never been explored in the audio-visual sound separation tasks.

Audio-visual learning.

With the emergence of deep neural networks, bridging signals of different modalities becomes easier. A series of works have been published in the past few years on audio-visual learning. By learning audio model and image model jointly or separately by distillation, good audio/visual representations can be achieved 

[39, 4, 2, 33, 32, 14, 30]. Another interesting problem is sounding object localization, where the goal is to associate sounds in the visual input spatially [26, 25, 3, 44, 57]. Some other interesting directions include biometric matching [37], sound generation for videos [58], auditory vehicle tracking [16], emotion recognition [1], audio-visual co-segmentation [43], audio-visual navigation [15], and 360/stereo sound from videos [18, 35].

Audio and body dynamics.

Figure 2: An overview of our model architecture. It consists of two components: a video analysis network and a visual-audio separation network. The video analysis network first takes video frames to extract global context and keypoint coordinates; Then a GCN is applied to integrate the body dynamic with semantic context, and outputs a latent representation. Finally, an audio-visual separation network separates sources form the mixture audio conditioned on the visual features.

There are numerous works to explore the associations between speech and facial movements [7, 6]. Multi-model signals extracted from face and speech has been used to do facial animations using speech [28, 50], generate high-quality talking face from audio [49, 27] separate mixed speech signals of multiple speakers [12], on/off screen audio source separation[38], and lip reading from raw videos [10]

. In contrast, the correlations between body pose with sound were less explored. The most relevant to us are recent works on predicting body dynamics from music 

[45] and body rhythms from speech [20]. This is the inverse of our goal to separate sound sources using body dynamic cues.

3 Approach

We first formalize the visual sound separation task and summarize our system pipeline in Section 3.1. Then we present the video analysis network for learning structured representation (Section 3.2) and audio-visual separation model (Section 3.3). Finally, we introduce our training objective and inference procedures in Section 3.4.

3.1 Pipeline Overview

Our goal is to associate the body dynamics with the audio signals for sound source separation. We adopt the commonly used “mix-and-separate” self-supervised training procedure introduced in [57]. The main idea of this training procedure is to create synthetic training data by mixing arbitrary sound sources from different video clips. Then the learning objective is to separate each sound from mixtures conditioned on its associated visual context.

Concretely, our framework consists of two major components: a video analysis network and a audio-visual separation network (see Figure 2). During training, we randomly select video clips with paired video frames and audio signal , and then mix their audios by linear combinations of the audio inputs to form a synthetic mixture . Given a video clip , the video analysis network extracts global context and body dynamic features from videos. The audio-visual separation network is then responsible for separating its audio signal from the mixture audio conditioned on the corresponding visual context

. To be noted, we trained the neural network in a supervised fashion, but it learned from unlabeled video data. Therefore, we consider the training pipeline as self-supervised learning.

3.2 Video Analysis Network

Our proposed video analysis network integrates keypoint-based structured visual representations, together with global semantic context features.

Visual semantic and keypoint representations.

To extract global semantic features from video frames, we use ResNet-50 [23]

to extract the features after the last spatial average pooling layer from the first frame of each video clip. Therefore, we obtain a 2048-dimensional context feature vector for each video clip. We also aim to capture the explicit movement of the human body parts and hand fingers through the keypoint representations. To achieve that, we adopt the AlphaPose toolbox 


to estimate the 2D locations of human body joints. For estimation of hand pose, we first apply a pre-trained hand detection model and then use the OpenPose 

[8] hand API [46] to estimate the coordinates of hand keypoints. As a result, we extract 18 keypoints for human body and 21 keypoints for each hand. Since the keypoints estimation in videos in the wild is challenging and noisy, we maintain both 2D coordinates and the confidence score of each estimated keypoint.

Context-Aware Graph CNN.

Once the visual semantic feature and keypoints are extracted from the raw video, we adopt a context-aware Graph CNN (CT-GCN) to fuse the semantic context of instruments and human body dynamics. This architecture is designed for the non-grid data, suitable for explicitly modeling the spatial-temporal relationships among different keypoints on the body and hands.

The network architecture design is inspired by previous work on action recognition [54] and human shape reconstruction [29]. Similar to [54], we start by constructing a undirected spatial-temporal graph on a human skeleton sequence. In this graph, each node corresponds to a keypoint of the human body; edges reflect the natural connectivity of body keypoints.

The input features for each node is represented as 2D coordinates and the confidence score of a detected keypoint over time

. To model the spatial-temporal body dynamics, we first apply a Graph Convolution Network to encode the pose at each time step independently. Then, we perform a standard temporal convolution on the resulting tensor to fuse the temporal information. The encoded pose feature

is defined as follows:


where is the input features, and are the weight matrices of spatial graph convolution and 2D convolution, and is the row-normalized adjacency matrix of the graph; represents the number of keypoints; represents the feature dimension for each input node. Inspired by previous work [54], we define the adjacency matrix based on the joint connections of the body and fingers. The output of the GCNs is updated features of each keypoint node.

To further incorporate the visual semantic cues, we concatenated the visual appearance context features to each node feature as the final output of the video analysis network. The context-aware graph CNN is capable of modeling both semantic context and body dynamics, thus providing strong visual cues to guide sound separations. There could be other model designs options. We leave this to future work.

3.3 Audio-Visual Separation Network

Finally, we have an audio-visual separation network, which takes the spectrogram of mixture audio with visual representation produced by the video analysis network as input, to predict a spectrogram mask and generate the audio signal for the selected video.

Audio Network.

We adopt a U-Net style architecture [42], namely an encoder-decoder network with skip connections for the audio network. It consists of 4 dilated convolution layers and 4 dilated up-convolution layers. All dilated convolutions and up-convolutions use 3

3 spatial filters with stride 2, dilation 1 and followed by a BatchNorm layer and a Leaky ReLU. The input of the audio network is a 2D time-frequency spectrogram of mixture sound and the output is a same-size binary spectrogram mask. We infuse the visual features into the middle part of the U-Net for guiding the sound separation.

Figure 3: Audio-visual fusion module of the model in Figure  2.

Audio-visual fusion.

To better leverage body dynamic cues to guide the sound separation, we adopt a self-attention [51] based cross-modal early fusion module to capture the correlations between body movement with the sound signals. As shown in Figure 3 , the fused feature at each time step is defined as follows:


where and represents visual and sound features at time step . , , and

denote the frequency bases of the sound spectrogram, the dimensions of visual features, and the dimension of sound features, respectively. The softmax computation is along the dimension of visual feature channels. The visual feature is then weighted by attention matrix and concatenated with the sound feature. We further add a multi-layer perceptron (MPL) with residual connection to produce the output features. The MLP is implemented with two fully-connected layers with a ReLU activation function. This attention mechanism enforces the model to focus more on the discriminative body keypoints, and associate them with the corresponding sound components on the spectrogram.

3.4 Training and Inferences

The learning objective of our model is to estimate a binary mask . The ground truth mask of -th video is calculated whether the target sound is the dominant component in the input mixed sound on magnitude spectrogram , i.e.,


where represents the time-frequency coordinates in the sound spectrogram. The network is trained by minimizing the per-pixel sigmoid cross entropy loss between the estimated masks and the ground-truth binary masks. Then the predicted mask is thresholded and multiplied with the input complex STFT coefficients to get a predicted sound spectrogram. Finally, we apply an inverse short-time Fourier Tranform (iSTFT) with the same transformation parameters on the predicted spectrogram to reconstruct the waveform of separated sound.

During testing, our model takes a single realistic multi-source video to perform sound source separation. We first localize human in the video frames. For each detected person, we use the video analysis network to extract visual feature to isolate the portion of the sound belonging to this musician from the mixed audio.

4 Experiments

In this section, we discuss our experiments, implementation details, comparisons and evaluations.

4.1 Dataset

We perform experiments on three video music performance datasets, namely MUSIC-21[56], URMP [31] and AtinPiano [36]. MUSIC-21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. This dataset is relatively clean and collected for the purpose of training and evaluating visual sound source separation models. URMP [31] is a high quality multi-instrument video dataset recorded in studio and provides ground truth labels for each sound source. AtinPiano [36] is a dataset where the piano video recordings are filmed in a way that camera is looking down on the keyboard and hands.

Methods 2-Mix 3-Mix

NMF [52]
2.78 6.70 2.01 2.08
Deep Separation [9] 4.75 7.00 - -
MIML [17] 4.25 6.23 - -
Sound of Pixels [57] 7.52 13.01 3.65 8.77
Co-Separation [19] 7.64 13.8 3.94 8.93
Sound of Motion [56] 8.31 14.82 4.87 9.48

10.12 15.81 5.41 11.47

Table 1: Sound source separation performance ( mixture) on different instruments. Compared to previous approaches, our models with body dynamic motion information perform better in sound separation.

4.2 Hetero-musical Separation

We first evaluate the model performance in the task of separating sounds from different kinds of instruments on the MUSIC dataset.

Baseline and evaluation metrics

We consider 5 state-of-the-art systems to compare against.

  • NMF [52] is a well established pipeline for audio-only source separation based on matrix factorization;

  • Deep Separation [9] is a CNN-based audio-only source separation approach;

  • MIML [17] is a model that combines NMF decomposition and multi-instance multi-label learning;

  • Sound of Pixels [57] is a pioneering work that uses vision for sound source separations;

  • Co-separation [19] devices a new model that incorporates an object-level co-separation loss into the mix-and-separate framework [57];

  • Sound of Motions [56] is a recently proposed self-supervised model which leverages trajectory motion cues.

We adopt the blind separation metrics, including signal-to-distortion ratio (SDR), and signal-to-interference ratio (SIR) to quantitatively compare the quality of the sound separation. The results reported in this paper were obtained by using the open-source

mir_eval [41] library.

Experimental Setup

Following the experiment protocol in Zhao et al[56], we split all videos on MUSCI dataset into a training set and a test set. We train and evaluate our model using mix-2 and mix-3 samples, which contain 2 and 3 sound sources of different instruments in mixtures. Since the real mix video data with multiple sounds on the MUSIC dataset do not have ground-truth labels for quantitative evaluation, we construct a synthetic testing set by mixing solo videos. The result of model performances are reported on a validation set with 256 pairs of sound mixtures, the same as [56]. We also perform a human study on the real mixtures on MUSIC and URMP dataset to measure human’s perceptual quality.

Implementation Details

We implement our framework using Pytorch. We first extract a global context feature from a video clip using ResNet-50 

[23] and the coordinates of body and hand key points for each frame using OpenPose [8] and AlphaPose [13]

. Our GCN model consists of 11-layers with residual connections. When training the graph CNN network, we first pass the keypoint coordinates to a batch normalization layer to keep the scale of the input same. During training, we also randomly move the coordinates as data augmentation to avoid overfitting.

For the audio data pre-processing, we first re-sample the audio to 11KHz. During training, we randomly take a 6-second video clip from the dataset. The audio-visual separation network takes a 6-second mixed audio clip as input, and transforms it into spectrogram by Short Time Fourier Transform (STFT). We set the frame size and hop size as 1022 and 256, respectively. The spectrogram is then fed into a U-Net with 4 dilated convolution and 4 deconvolution layers. The ouput of U-Net is an estimated binary mask. We set a threshold of 0.7 to obtain a binary mask, and then multiply it with the input mixture sound spectrogram. An iSTFT with the same parameters as the STFT is applied to obtain the final separated audio waveforms.

We train our model using SGD optimizer with 0.9 momentum. The audio separation Network and the fusion module use a learning rate of 1e-2; the ST-GCN Network and Appearance Network use a learning rate of 1e-3.

Quantitative Evaluation.

Table 1 summarizes the comparison results against state-of-the-art methods on MUSIC. We observe that our method consistently outperforms all baselines in separation accuracy, as captured across metrics. Remarkably, our system outperforms a previous state-of-the-art algorithm [56] by 1.8dB on 2-mix and 0.6dB on 3-mix source separation in term of SDR score. These quantitative results suggest that our model can successfully exploit the explicit body dynamic motions to improve the sound separation quality.

Qualitative evaluation on real mixtures.

Our quantitative results demonstrate that our model achieves better results than baselines. However, these metrics are limited in their ability to reflect the actual perceptual quality of the sound separation result on real-world videos. Therefore, we further conduct a subjective human study using real mixture videos from MUSIC and URMP datasets on Amazon Mechanical Turk (AMT).

Specifically, we compare sound separation results of our own model with best baseline system [56]111The results on real mixture are provided by their authors. The AMT workers are required to compare these two systems and answer the following question: “Which sound separation result is better?.” We randomly shuffle the orders of two models to avoid shortcut solutions. Each job is performed independently by 3 AMT workers. Results are shown in Table 2 using majority voting. From this table, we find workers favor our system for both 2-mix and 3-mix sound separation.

Method 2-Mix 3-Mix
Sound of Motions [56] 24% 16%
Ours 76% 84%

Table 2: Human evaluation results for the sound source separation on mixtures of the different instruments.

4.3 Ablated study

In this section, we perform in-depth ablation studies to evaluate the impact of each component of our model.

Keypoint-based representation.

The main contribution of our paper is to use explicit body motions through keypoint-based structure representations for source separation. To further understand the ability of these representations, we conduct an ablated study using the keypoint-based structure representation only, without the RGB context features. Interestingly,we can observe that keypoint-based representations alone could also achieve very strong results (see Table  3). We hope our findings could inspire more works using structured keypoint-based representations for the audio-visual scene analysis tasks.

Method SDR
Ours w/o fusion 9.64
Ours w/o RGB 10.22
Our 10.12
Table 3: Ablated study on SDR metric for mixtures of 2 different instruments .

Visual-Audio Fusion Module.

We propose a novel attention based audio-visual fusion model. To verify its efficacy, we replace this module with Feature-wise Linear Modulation (FiLM) [40] used in [56]. The comparison results are shown in Table 3. We can find that the proposed audio-visual fusion module brings 0.5dB improvement in term of SDR metric on 2-mix sound source separation.

4.4 Homo-musical Separation

In this section, we conduct experiments on a more challenging task, sound separation when sound is generated by the same instruments.

Experiment Setup

We select 5 kinds of musical instruments whose sounds are closely related to body dynamic: trumpet, flute, piano, violin, and cello for evaluation.

Inspired by previous work [56, 38], we also adopt a 2-stage curriculum learning strategy to train the sound seperation model of the same instruments. In particular, we first pre-train the model on multiple instrument separation, then learn to separate the same instrument. We compare our model against SoM [56], since previous appearance based models fail to produce meaningful results in this challenging setting. The results are measured by both automatic SDR scores and human evaluations on AMT.

Results Analysis.

Results are shown in Table 4 and Table 5. From these tables, we have three key observations: 1) our proposed model consistently outperforms the SoM system [56]

for all five instruments measured by both automatic and human evaluation metrics; 2) The quantitative results on separating violin and cello duets are close (See Table 

4). However, we find that the SoM system is quite brittle when testing on the real mixtures. People tend to vote our system more on real mixtures, as shown in Table 5; 3) The SoM provides much inferior results on trumpet, piano, and flute duets compared to our model, since the gap is larger than 3 dB. This is not very surprising since separating duet of these three instruments mainly relies on hand pose movements. It is very hard for the trajectory and optical flow features to capture such fine-grained hand movements. Our approach can overcome this challenge in that we explicit model the body motions by tracking the coordinates changes of hand keypoints. These results further validate the efficacy of body dynamics motions on solving more and harder visual sound separation problems.

Instrument SoM [56] Ours

1.8 4.9
flute 1.5 5.3
piano 0.8 3.8
violin 6.3 6.7
cello 5.4 6.1
Table 4: Sound source separation performance on duets of the same instruments under the SDR metric.
Instrument SoM [56] Ours

18% 82%
flute 14% 86%
piano 30% 70%
violin 26% 74%
cello 28% 72%
Table 5: Human evaluation result for the sound source separation on mixture of the same instruments.
Figure 4: The attention map of body keypoints. Brighter color means higher attention score.
Figure 5: Qualitative results on visual sound separation compared with Sound of Motions (SoM) [56].

4.5 Visualizations

As a further analysis, we would like to understand how body keypoints matters the sound source separation. Figure 4 visualize the learned attention map of keypoints in the audio-visual fusion module. We observe that our model tends to focus more on hand keypoints when separating guitar and flute sounds, while pays more attention to elbows when separating the cello and violin.

Fig 5 shows qualitative results comparison between our model and the previous state-of-the-art SoM [56] on separating 3 different instruments and 2 same instruments. The first row shows the video frame example, the second row shows the spectrogram of the audio mixture. The third to fifth rows show ground truth masks, masks predicted by SoM, and masks predicted by our method. The sixth to eighth rows show the ground truth spectrogram and comparisons of predicted spectrogram after applying masks on the input spectrogram. We can observe that our system produces cleaner sound separation outputs.

Though the results are remarkable and constitute a noticeable step towards more challenging visual sound separation, our system is still far from perfect. We observed that our method is not resilient against camera viewpoint change and body part occlusions of the musician. We conjecture that unsupervised learning of keypoints from raw images for visual sound separation might be a promising direction to explore for future work.

5 Conclusions and Future Work

In this paper, we show that keypoint-based structured visual representations are powerful for visual sound separation. Extensive evaluations show that, compared to previous appearance and low-level motion-based models, we are able to perform better on audio-visual source separation of different instruments; we can also achieve remarkable results on separating sounds of same instruments (e.g. piano, flute, and trumpet), which was impossible before. We hope our work will open up avenues of using structured visual representations for audio-visual scene analysis. In ther future, we plan to extend our approach to more general audio-visual data with more complex human-object interactions.


This work is supported by ONR MURI N00014-16-1-2007, the Center for Brain, Minds, and Machines (CBMM, NSF STC award CCF-1231216), and IBM Research.


  • [1] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman (2018) Emotion recognition in speech using cross-modal transfer in the wild. ACM Multimedia. Cited by: §2.
  • [2] R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In

    2017 IEEE International Conference on Computer Vision (ICCV)

    pp. 609–617. Cited by: §2.
  • [3] R. Arandjelović and A. Zisserman (2017) Objects that sound. arXiv preprint arXiv:1712.06651. Cited by: §2.
  • [4] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pp. 892–900. Cited by: §2.
  • [5] Z. Barzelay and Y. Y. Schechner (2007) Harmony in motion. In

    Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on

    pp. 1–8. Cited by: §2.
  • [6] M. Brand (1999) Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 21–28. Cited by: §2.
  • [7] C. Bregler, M. Covell, and M. Slaney (1997) Video rewrite: driving visual speech with audio.. Cited by: §2.
  • [8] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018) OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, Cited by: §3.2, §4.2.
  • [9] P. Chandna, M. Miron, J. Janer, and E. Gómez (2017)

    Monoaural audio source separation using deep convolutional neural networks

    In ICLVASS, pp. 258–266. Cited by: §2, 2nd item, Table 1.
  • [10] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman (2017) Lip reading sentences in the wild.. In CVPR, pp. 3444–3453. Cited by: §2.
  • [11] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari (2009) Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons. Cited by: §2.
  • [12] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG) 37 (4), pp. 112. Cited by: §2, §2.
  • [13] H. Fang, S. Xie, Y. Tai, and C. Lu (2017) RMPE: regional multi-person pose estimation. In ICCV, Cited by: §3.2, §4.2.
  • [14] C. Gan, N. Wang, Y. Yang, D. Yeung, and A. G. Hauptmann (2015) Devnet: a deep event network for multimedia event detection and evidence recounting. In CVPR, pp. 2568–2577. Cited by: §2.
  • [15] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B. Tenenbaum (2020) Look, listen, and act: towards audio-visual embodied navigation. ICRA. Cited by: §2.
  • [16] C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba (2019) Self-supervised moving vehicle tracking with stereo sound. In ICCV, pp. 7053–7062. Cited by: §2.
  • [17] R. Gao, R. Feris, and K. Grauman (2018) Learning to separate object sounds by watching unlabeled video. In ECCV, Cited by: §1, §2, 3rd item, Table 1.
  • [18] R. Gao and K. Grauman (2018) 2.5 d visual sound. arXiv preprint arXiv:1812.04204. Cited by: §2.
  • [19] R. Gao and K. Grauman (2019) Co-separating sounds of visual objects. ICCV. Cited by: §1, §2, 5th item, Table 1.
  • [20] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik (2019) Learning individual styles of conversational gesture. In CVPR, pp. 3497–3506. Cited by: §2.
  • [21] R. I. Godøy and M. Leman (2010) Musical gestures: sound, movement, and meaning. Routledge. Cited by: §1.
  • [22] S. Haykin and Z. Chen (2005) The cocktail party problem. Neural computation 17 (9), pp. 1875–1902. Cited by: §2.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2, §4.2.
  • [24] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 31–35. Cited by: §2.
  • [25] J. R. Hershey and J. R. Movellan (2000) Audio vision: using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K. Müller (Eds.), pp. 813–819. External Links: Link Cited by: §2.
  • [26] H. Izadinia, I. Saleemi, and M. Shah (2013) Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia 15 (2), pp. 378–390. Cited by: §2.
  • [27] A. Jamaludin, J. S. Chung, and A. Zisserman (2019) You said that?: synthesising talking faces from audio. International Journal of Computer Vision, pp. 1–13. Cited by: §2.
  • [28] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36 (4), pp. 94. Cited by: §2.
  • [29] N. Kolotouros, G. Pavlakos, and K. Daniilidis (2019) Convolutional mesh regression for single-image human shape reconstruction. In CVPR, pp. 4501–4510. Cited by: §3.2.
  • [30] B. Korbar, D. Tran, and L. Torresani (2018) Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230. Cited by: §2.
  • [31] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma (2018) Creating a multitrack classical music performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Transactions on Multimedia 21 (2), pp. 522–535. Cited by: §1, §4.1.
  • [32] X. Long, C. Gan, G. De Melo, X. Liu, Y. Li, F. Li, and S. Wen (2018) Multimodal keyless attention fusion for video classification. In AAAI, Cited by: §2.
  • [33] X. Long, C. Gan, G. De Melo, J. Wu, X. Liu, and S. Wen (2018) Attention clusters: purely attention based local feature integration for video classification. In CVPR, pp. 7834–7843. Cited by: §2.
  • [34] J. H. McDermott (2009) The cocktail party problem. Current Biology 19 (22), pp. R1024–R1027. Cited by: §2.
  • [35] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang (2018) Self-supervised generation of spatial audio for 360 video. In NIPS, Cited by: §2.
  • [36] A. Moryossef, Y. Elazar, and Y. Goldberg (2020) At your fingertips: automatic piano fingering detection. External Links: Link Cited by: §1, §4.1.
  • [37] A. Nagrani, S. Albanie, and A. Zisserman (2018) Seeing voices and hearing faces: cross-modal biometric matching. arXiv preprint arXiv:1804.00326. Cited by: §2.
  • [38] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. ECCV. Cited by: §2, §2, §4.4.
  • [39] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba (2016) Ambient sound provides supervision for visual learning. In European Conference on Computer Vision, pp. 801–816. Cited by: §2.
  • [40] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2017) Film: visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871. Cited by: §4.3.
  • [41] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, D. P. Ellis, and C. C. Raffel (2014) Mir_eval: a transparent implementation of common mir metrics. In In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR, Cited by: §4.2.
  • [42] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.3.
  • [43] A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba (2019) Self-supervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. Cited by: §2.
  • [44] A. Senocak, T. Oh, J. Kim, M. Yang, and I. S. Kweon (2018) Learning to localize sound source in visual scenes. arXiv preprint arXiv:1803.03849. Cited by: §2.
  • [45] E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman (2018) Audio to body dynamics. In CVPR, pp. 7574–7583. Cited by: §2.
  • [46] T. Simon, H. Joo, I. Matthews, and Y. Sheikh (2017) Hand keypoint detection in single images using multiview bootstrapping. In CVPR, Cited by: §3.2.
  • [47] A. J. Simpson, G. Roma, and M. D. Plumbley (2015) Deep karaoke: extracting vocals from musical mixtures using a convolutional deep neural network. In International Conference on Latent Variable Analysis and Signal Separation, pp. 429–436. Cited by: §2.
  • [48] P. Smaragdis and J. C. Brown (2003) Non-negative matrix factorization for polyphonic music transcription. In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on., pp. 177–180. Cited by: §2.
  • [49] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman (2017) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36 (4), pp. 95. Cited by: §2.
  • [50] S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews (2017) A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36 (4), pp. 93. Cited by: §2.
  • [51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §3.3.
  • [52] T. Virtanen (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing 15 (3), pp. 1066–1074. Cited by: §2, 1st item, Table 1.
  • [53] X. Xu, B. Dai, and D. Lin (2019-10) Recursive visual sound separation using minus-plus net. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [54] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §3.2, §3.2.
  • [55] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 241–245. Cited by: §2.
  • [56] H. Zhao, C. Gan, W. Ma, and A. Torralba (2019) The sound of motions. ICCV. Cited by: §1, §2, Figure 5, 6th item, §4.1, §4.2, §4.2, §4.2, §4.3, §4.4, §4.4, §4.5, Table 1, Table 2, Table 4, Table 5.
  • [57] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018-09) The sound of pixels. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, §2, §3.1, 4th item, 5th item, Table 1.
  • [58] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg (2017) Visual to sound: generating natural sound for videos in the wild. arXiv preprint arXiv:1712.01393. Cited by: §2.