Deep Audio-Visual Learning: A Survey

by   Mandi Luo, et al.

Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend to leverage these two modalities either to improve the performance of previously considered single-modality tasks or to address new challenging problems. In this paper, we provide a comprehensive survey of recent audio-visual learning development. We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation learning. State-of-the-art methods as well as the remaining challenges of each subfield are further discussed. Finally, we summarize the commonly used datasets and performance metrics.


page 1

page 2

page 7

page 9

page 12

page 13


Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Sight and hearing are two senses that play a vital role in human communi...

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

With the development of deep learning and artificial intelligence, audio...

Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

Audio-visual correlation learning aims to capture essential corresponden...

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

The recent success of transformer models in language, such as BERT, has ...

The CORSMAL benchmark for the prediction of the properties of containers

Acoustic and visual sensing can support the contactless estimation of th...

How deep is your encoder: an analysis of features descriptors for an autoencoder-based audio-visual quality metric

The development of audio-visual quality assessment models poses a number...

Audio-guided Album Cover Art Generation with Genetic Algorithms

Over 60,000 songs are released on Spotify every day, and the competition...

1 Introduction

Human perception is multidimensional and includes vision, hearing, touch, taste, and smell. In recent years, along with the vigorous development of artificial intelligence technology, the trend from single-modality learning to multimodality learning has become crucial to better machine perception. Analyses of audio and visual information, representing the two most important perceptual modalities in our daily life, have been widely developed in both academia and industry in the past decades. Prominent achievements include speech recognition [107, 71]

, facial recognition 

[54, 42]

, etc. Audio-visual learning (AVL) using both modalities has been introduced to overcome the limitation of perception tasks in each modality. In addition, exploring the relationship between audio and visual information leads to more interesting and important research topics and ultimately better perspectives on machine learning.

The purpose of this article is to provide an overview of the key methodologies in audio-visual learning, which aims to discover the relationship between audio and visual data for many challenging tasks. In this paper, we mainly divide these efforts into four categories: (1) audio-visual separation and localization, (2) audio-visual corresponding learning, (3) audio and visual generation, and (4) audio-visual representation.

Audio-visual separation and localization aim to separate specific sounds emanating from the corresponding objects and localize each sound in the visual context, as illustrated in Fig. LABEL:fig:overall (a). Audio separation has been investigated extensively in the signal processing community during the past two decades. With the addition of the visual modality, audio separation can be transformed into audio-visual separation, which has proven to be more effective in noisy scenes [43, 3, 39]. Furthermore, introducing the visual modality allows for audio localization, i.e., the localization of a sound in the visual modality according to the audio input. The tasks of audio-visual separation and localization themselves not only lead to valuable applications but also provide the foundation for other audio-visual tasks, e.g., generating spatial audio for 360 video [82]

. Most studies in this area focus on unsupervised learning due to the lack of training labels.

Audio-visual correspondence learning focuses on discovering the global semantic relation between audio and visual modalities, as shown in Fig. LABEL:fig:overall (b). It consists of audio-visual retrieval and audio-visual speech recognition tasks. The former uses audio or an image to search for its counterpart in another modality, while the latter derives from the conventional speech recognition task that leverages visual information to provide a more semantic prior to improve recognition performance. Although both of these two tasks have been extensively studied, they still entail major challenges, especially for fine-grained cross-modality retrieval and homonyms in speech recognition.

Audio-visual generation tries to synthesize the other modality based on one of them, which is different from the above two tasks leveraging both audio and visual modalities as inputs. Trying to make a machine that is creative is always challenging, and many generative models have been proposed [51, 67]. Audio-visual cross-modality generation has recently drawn considerable attention. It aims to generate audio from visual signals, or vice versa. Although it is easy for a human to perceive the natural correlation between sounds and appearance, this task is challenging for machines due to heterogeneity across modalities. As shown in Fig. LABEL:fig:overall

(c), vision to audio generation mainly focuses on recovering speech from lip sequences or predicting the sounds that may occur in the given scenes. In contrast, audio to vision generation can be classified into three categories: audio-driven image generation, body motion generation, and talking face generation.

The last task—audio-visual representation learning—aims to automatically discover the representation from raw data. A human can easily recognize audio or video based on long-term brain cognition. However, machine learning algorithms such as deep learning models are heavily dependent on data representation. Therefore, learning suitable data representations for machine learning algorithms may improve performance.

Unfortunately, real-world data such as images, videos and audio do not possess specific algorithmically defined features [14]. Therefore, an effective representation of data determines the success of machine learning algorithms. Recent studies seeking better representation have designed various tasks, such as audio-visual correspondence (AVC) [7] and audio-visual temporal synchronization (AVTS) [69]. By leveraging such a learned representation, one can more easily solve audio-visual tasks mentioned in the very beginning.

In this paper, we present a comprehensive survey of the above four directions of audio-visual learning. The rest of this paper is organized as follows. We introduce the four directions in Secs. 2, 3, 4 and 5. Sec. 6 summarizes the commonly used public audio-visual datasets. Finally, Sec. 8 concludes the paper.

2 Audio-visual Separation and Localization

The objective of audio-visual separation is to separate different sounds from the corresponding objects, while audio-visual localization mainly focuses on localizing a sound in a visual context. As shown in Fig. 1, we classify types of this task by different identities: speakers (Fig. 1 (a)) and objects (Fig. 1 (b)).The former concentrates on a person’s speech that can be used for television programs to enhance the target speakers’ voice, while the latter is a more general and challenging task that separates arbitrary objects rather than speakers only. In this section, we provide an overview of these two tasks, examining the motivations, network architectures, advantages, and disadvantages.

2.1 Speaker Separation

The speaker separation task is a challenging task and is also known as the ‘cocktail party problem’. It aims to isolate a single speech signal in a noisy scene. Some studies tried to solve the problem of audio separation with only the audio modality and achieved exciting results [64, 80]. Advanced approaches [43, 39]

tried to utilize visual information to aid the speaker separation task and significantly surpassed single modality-based methods. The early attempts leveraged mutual information to learn the joint distribution between the audio and the video

[34, 41]. Subsequently, several methods focused on analyzing videos containing salient motion signals and the corresponding audio events (e.g., a mouth starting to move or a hand on piano suddenly accelerating) [17, 98].

Figure 1: Illustration of audio-visual separation and localization task. Paths 1 and 2 denote separation and localization tasks, respectively.
Figure 2: Basic pipeline of a noisy audio filter.

Gabbay et al.[43] proposed isolating the voice of a specific speaker and eliminating other sounds in an audio-visual manner. Instead of directly extracting the target speaker’s voice from the noisy sound, which may bias the training model, the researchers first fed the video frames into a video-to-speech model and then predicted the speaker’s voice by the facial movements captured in the video. Afterwards, the predicted voice was used to filter the mixtures of sounds, as shown in Fig. 2.

Although Gabbay et al. [43] improved the quality of separated voice by adding the visual modality, their approach was only applicable in controlled environments. To obtain intelligible speech in an unconstrained environment, Afouras et al. [3] proposed a deep audio-visual speech enhancement network to separate the speaker’s voice of the given lip region by predicting both the magnitude and phase of the target signal. The authors treated the spectrograms as temporal signals rather than images for a network. Additionally, instead of directly predicting clean signal magnitudes, they also tried to generate a more effective soft mask for filtering.

In contrast to previous approaches that require training a separate model for each speaker of interest (speaker-dependent models), Ephrat et al. [39] proposed a speaker-independent model that was only trained once and was then applicable to any speaker. This approach even outperformed the state-of-the-art speaker-dependent audio-visual speech separation methods. The relevant model consists of multiple visual streams and one audio stream, concatenating the features from different streams into a joint audio-visual representation. This feature is further processed by a bidirectional LSTM and three fully connected layers. Finally, an elaborate spectrogram mask is learned for each speaker to be multiplied by the noisy input. Finally, the researchers converted it back to waveforms to obtain an isolated speech signal for each speaker. Lu et al. [79] designed a network similar to that of [39]. The difference is that the authors enforced an audio-visual matching network to distinguish the correspondence between speech and human lip movements. Therefore, they could obtain clear speech.

Instead of directly utilizing video as a condition, Morrone et al. [83] further introduced landmarks as a fine-grained feature to generate time-frequency masks to filter mixed-speech spectrogram.

2.2 Separating and Localizing Objects’ Sounds

Instead of matching a specific lip movement from a noisy environment as in the speaker separation task, humans focus more on objects while dealing with sound separation and localization. It is difficult to find a clear correspondence between audio and visual modalities due to the challenge of exploring the prior sounds from different objects.

2.2.1 Separation

The early attempt to solve this localization problem can be traced back to 2000 [55] and a study that synchronized low-level features of sounds and videos. Fisher et al. [41] later proposed using a nonparametric approach to learn a joint distribution of visual and audio signals and then project both of them to a learned subspace. Furthermore, several acoustics-based methods [123, 141] were described that required specific devices for surveillance and instrument engineering, such as microphone arrays used to capture the differences in the arrival of sounds.

To learn audio source separation from large-scale in-the-wild videos containing multiple audio sources per video, Gao et al. [44]

suggested learning an audio-visual localization model from unlabeled videos and then exploiting the visual context for audio source separation. Researchers’ approach relied on a multi-instance multilabel learning framework to disentangle the audio frequencies related to individual visual objects even without observing or hearing them in isolation. The multilabel learning framework was fed by a bag of audio basis vectors for each video, and then, the bag-level prediction of the objects presented in the audio was obtained.

Category Method Ideas & Strengths Weaknesses
Gabbay et al. [43]
Predict speaker’s voice based on
faces in video used as a filter
Can only be used
in controlled environments
Afouras et al. [3]
Generate a soft mask for
filtering in the wild
Requires training a
separate model for
each speaker of interest
Speaker Separation Lu et al. [79]
Distinguish the correspondence
between speech and human
speech lip movements
Two speakers only;
hardly applied for
background noise
Ephrat et al. [39]
Predict a complex spectrogram
mask for each speaker;
trained once, applicable to
any speaker
The model is too complicated
and lacks explanation
Morrone et al. [83]
Use landmarks to generate
time-frequency masks
Additional landmark
detection required
Gao et al. [44]
Disentangle audio frequencies
related to visual objects
Separated audio only
Senocak et al [106]
Focus on the primary
area by using attention
Localized sound
source only
Tian et al. [119]
Joint modeling of auditory
and visual modalities
Localized sound
source only
Separate and Localize
Objects’ Sounds
Pu et al. [98]
Use low rank to extract the
sparsely correlated components
Not for the in-the-wild
Zhao et al. [137]
Mix and separate a given audio;
without traditional supervision
Motion information
is not considered
Zhao et al. [136]
Introduce motion trajectory
and curriculum learning
Only suitable for synchronized
video and audio input
Rouditchenko et al. [102]
Separation and localization use
only one modality input
Does not fully utilize
temporal information
Parekh et al. [94]

Weakly supervised learning

via multiple-instance learning
Only a bounding box
proposed on the image
Table 1: Summary of recent audio-visual separation and localization approaches.

2.2.2 Localization

Instead of only separating audio, can machines localize the sound source merely by observing sound and visual scene pairs as a human can? There is evidence both in physiology and psychology that sound localization of acoustic signals is strongly influenced by synchronicity of their visual signals [55]. The past efforts in this domain were limited to requiring specific devices or additional features. Izadinia et al. [65] proposed utilizing the velocity and acceleration of moving objects as visual features to assign sounds to them. Zunino et al. [141] presented a new hybrid device for sound and optical imaging that was primarily suitable for automatic monitoring.

As the number of unlabeled videos on the Internet has been increasing dramatically, recent methods mainly focus on unsupervised learning. Additionally, modeling audio and visual modalities simultaneously tends to outperform independent modeling. Senocak et al. [106] learned to localize sound sources by merely watching and listening to videos. The relevant model mainly consisted of three networks, namely, sound and visual networks and an attention network trained via the distance ratio [56] unsupervised loss.

Attention mechanisms cause the model to focus on the primary area. They provide prior knowledge in a semisupervised setting. As a result, the network can be converted into a unified one that can learn better from data without additional annotations. To enable cross-modality localization, Tian et al. [119] proposed capturing the semantics of sound-emitting objects via the learned attention and leveraging temporal alignment to discover the correlations between the two modalities.

2.2.3 Simultaneous Separation and Localization

Sound source separation and localization can be strongly associated with each other by assigning one modality’s information to another. Therefore, several researchers attempted to perform localization and separation simultaneously. Pu et al. [98] used a low-rank and sparse framework to model the background. The researchers extracted components with sparse correlations between the audio and visual modalities. However, the scenario of this method had a major limitation: it could only be applied to videos with a few sound-generating objects. Therefore, Zhao et al. [137] introduced a system called PixelPlayer that used a two-stream network and presented a mix-and-separate framework to train the entire network. In this framework, audio signals from two different videos were added to produce a mixed signal as input. The input was then fed into the network that was trained to separate the audio source signals based on the corresponding video frames. The two separated sound signals were treated as outputs. The system thus learned to separate individual sources without traditional supervision.

Instead of merely relying on image semantics while ignoring the temporal motion information in the video, Zhao et al. [136] subsequently proposed an end-to-end network called deep dense trajectory to learn the motion information for audio-visual sound separation. Furthermore, due to the lack of training samples, directly separating sound for a single class of instruments tend to lead to overfitting. Therefore, the authors proposed a curriculum strategy, starting by separating sounds from different instruments and proceeding to sounds from the same instrument. This gradual approach provided a good start for the network to converge better on the separation and localization tasks.

The methods of previous studies [98, 137, 136] could only be applied to videos with synchronized audio. Hence, Rouditchenko et al. [102]

tried to perform localization and separation tasks using only video frames or sound by disentangling concepts learned by neural networks. The researchers proposed an approach to produce sparse activations that could correspond to semantic categories in the input using the sigmoid activation function during the training stage and softmax activation during the fine-tuning stage. Afterwards, the researchers assigned these semantic categories to intermediate network feature channels using labels available in the training dataset. In other words, given a video frame or a sound, the approach used the category-to-feature-channel correspondence to select a specific type of source or object for separation or localization. Aiming to introduce weak labels to improve performance, Parekh et al.

[94] designed an approach based on multiple-instance learning, a well-known strategy for weakly supervised learning.

3 Audio-visual Correspondence Learning

In this section, we introduce several studies that explored the global semantic relation between audio and visual modalities. We name this branch of research “audio-visual correspondence learning”; it consists of 1) the audio-visual matching task and 2) the audio-visual speech recognition task.

3.1 Audio-visual Matching

Biometric authentication, ranging from facial recognition to fingerprint and iris authentication, is a popular topic that has been researched over many years, while evidence shows that this system can be attacked maliciously. To detect such attacks, recent studies particularly focus on speech antispoofing measures.

Sriskandaraja et al. [111] proposed a network based on a Siamese architecture to evaluate the similarities between pairs of speech samples. [16] presented a two-stream network, where the first network was a Bayesian neural network assumed to be overfitting, and the second network was a CNN used to improve generalization. Alanis et al. [47] further incorporated LightCNN [133]

and a gated recurrent unit (GRU)

[29] as a robust feature extractor to represent speech signals in utterance-level analysis to improve performance.

We note that cross-modality matching is a special form of such authentication that has recently been extensively studied. It attempts to learn the similarity between pairs. We divide this matching task into fine-grained voice-face matching and coarse-grained audio-image retrieval.

3.1.1 Voice-Facial Matching

Given facial images of different identities and the corresponding audio sequences, voice-facial matching aims to identify the face that the audio belongs to (the V2F task) or vice versa (the F2V task), as shown in Fig. 3. The key point is finding the embedding between audio and visual modalities. Nagrani et al. [85] proposed using three networks to address the audio-visual matching problem: a static network, a dynamic network, and an N-way network. The static network and the dynamic network could only handle the problem with a specific number of images and audio tracks. The difference was that the dynamic network added to each image temporal information such as the optical flow or a 3D convolution [121, 109]. Based on the static network, the authors increased the number of samples to form an N-way network that was able to solve the identification problem.

However, the correlation between the two modalities was not fully utilized in the above method. Therefore, Wen et al. [131] proposed a disjoint mapping network (DIMNets) to fully use the covariates (e.g., gender and nationality)  [63, 76]

to bridge the relation between voice and face information. The intuitive assumption was that for a given voice and face pair, the more covariates were shared between the two modalities, the higher the probability of being a match. The main drawback of this framework was that a large number of covariates led to high data costs. Therefore, Hoover et al.

[59] suggested a low-cost but robust approach of detection and clustering on audio clips and facial images. For the audio stream, the researchers applied a neural network model to detect speech for clustering and subsequently assigned a frame cluster to the given audio cluster according to the majority principle. Doing so required a small amount of data for pretraining.

To further enhance the robustness of the network, Chung et al. [30] proposed an improved two-stream training method that increased the number of negative samples to improve the error-tolerance rate of the network. The cross-modality matching task, which is essentially a classification task, allows for wide-ranging applications of the triplet loss. However, it is fragile in the case of multiple samples. To overcome this defect, Wang et al. [129]

proposed a novel loss function to expand the triplet loss for multiple samples and a new elastic network (called Emnet) based on a two-stream architecture that can tolerate a variable number of inputs to increase the flexibility of the network.

Figure 3: Demonstration of Audio-to-Image retrieval (a) and Image-to-Audio retrieval (b).

3.1.2 Audio-image Retrieval

The cross-modality retrieval task aims to discover the relationship between different modalities. Given one sample in the source modality, the proposed model can retrieve the corresponding sample with the same identity in the target modality. For audio-image retrieval as an example, the aim is to return a relevant piano sound, given a picture of a girl playing a piano. Compared with the previously considered voice and face matching, this task is more coarse-grained.

Unlike other retrieval tasks such as the text-image task [110, 78, 100] or the sound-text task [12], the audio-visual retrieval task mainly focuses on subspace learning. Didac et al. [114]

proposed a new joint embedding model that mapped two modalities into a joint embedding space, and then directly calculated the Euclidean distance between them. The authors leveraged cosine similarity to ensure that the two modalities in the same space were as close as possible while not overlapping. Note that the designed architecture would have a large number of parameters due to the existence of a large number of fully connected layers.

Hong et al. [58] proposed a joint embedding model that relied on pretrained networks and used CNNs to replace fully connected layers to reduce the number of parameters to some extent. The video and music were fed to the pretrained network and then aggregated, followed by a two-stream network trained via the intermodal ranking loss. In addition, to preserve modality-specific characteristics, the researchers proposed a novel soft intramodal structure loss. However, the resulting network was very complex and difficult to apply in practice. To solve this problem, Arsha et al. [84] proposed a cross-modality self-supervised method to learn the embedding of audio and visual information from a video and significantly reduced the complexity of the network. For sample selection, the authors designed a novel curriculum learning schedule to further improve performance. In addition, the resulting joint embedding could be efficiently and effectively applied in practical applications.

Category Method Ideas & Strengths Weaknesses
Nagrani et al.  [85]
The method is novel and
incorporates dynamic information
As the sample size increases,
the accuracy decreases excessively
Wen et al.  [131].
The correlation between
modes is utilized
Dataset acquisition is difficult
Voice-Face Matching Wang et al. [128]
Can deal with multiple samples
Can change the size of input
Static image only;
model complexity
Hoover et al. [59]
Easy to implement
Cannot handle large-scale data
Hong et al. [58]
Preserve modality-
specific characteristics
Soft intra-modality structure loss
Complex network
Audio-visual retrieval Didac et al. [114]
Metric Learning
Using fewer parameters
Only two faces
Static images
Arsha et al. [84]
Curriculum learning
Applied value
Low data cost
Low accuracy for multiple samples
Petridis et al.  [95]
Simultaneously obtain
feature and classification
Lack of audio information
Wand et al.  [126].
Simple method
Audio-visual Speech Recognition Shillingford et al. [10]
CTC loss
No audio information
Chung et al. [26]
Audio and visual information
LRS dataset
Noise is not considered
Trigeorgis et al. [122]
Audio information
The algorithm is robust
Noise is not considered
Afouras et al. [2]
Study noise in audio
LRS2-BBC Dataset
Complex network
Table 2: Summary of audio-visual correspondence learning.
Figure 4: Demonstration of audio-visual speech recognition.

3.2 Audio-visual Speech Recognition

The recognition of content of a given speech clip has been studied for many years, yet despite great achievements, researchers are still aiming for satisfactory performance in challenging scenarios. Due to the correlation between audio and vision, combining these two modalities tends to offer more prior information. For example, one can predict the scene where the conversation took place, which provides a strong prior for speech recognition, as shown in Fig. 4.

Earlier efforts on audio-visual fusion models usually consisted of two steps: 1) extracting features from the image and audio signals and 2) combining the features for joint classification  [37, 96, 97]

. Later, taking advantage of deep learning, feature extraction was replaced with a neural network encoder

[60, 87, 88]. Several recently studies have shown a tendency to use an end-to-end approach to visual speech recognition. These studies can be mainly divided into two groups. They either leverage the fully connected layers and LSTM to extract features and model the temporal information [95, 126] or use a 3D convolutional layer followed by a combination of CNNs and LSTMs [10, 112]. Instead of adopting a two-step strategy, Petridis et al. [95] introduced an audio-visual fusion model that simultaneously extracted features directly from pixels and spectrograms and performed classification of speech and nonlinguistic vocalizations. Furthermore, temporal information was extracted by a bidirectional LSTM. Although this method could perform feature extraction and classification at the same time, it still followed the two-step strategy.

To this end, Wand et al. [126] presented a word-level lip-reading system using LSTM. In contrast to previous methods, Assael [10] proposed a novel end-to-end LipNet model based on sentence-level sequence prediction, which consisted of spatial-temporal convolutions, a recurrent network and a model trained via the connectionist temporal classification (CTC) loss. Experiments showed that lip-reading outperformed the two-step strategy.

However, the limited information in the visual modality may lead to a performance bottleneck. To combine both audio and visual information for various scenes, especially in noisy conditions, Trigeorgis et al.  [122] introduced an end-to-end model to obtain a ‘context-aware’ feature from the raw temporal representation.

Chung et al. [26] presented a “Watch, Listen, Attend, and Spell” (WLAS) network to explain the influence of audio on the recognition task. The model took advantage of the dual attention mechanism and could operate on a single or combined modality. To speed up the training and avoid overfitting, the researchers also used a curriculum learning strategy. To analyze an “in-the-wild” dataset, Cui et al. [89] proposed another model based on residual networks and a bidirectional GRU [29]. However, the authors did not take the ubiquitous noise in the audio into account. To solve this problem, Afouras et al. [2] proposed a model for performing speech recognition tasks. The researchers compared two common sequence prediction types: connectionist temporal classification and sequence-to-sequence (seq2seq) methods in their models. In the experiment, they observed that the model using seq2seq could perform better according to word error rate (WER) when it was only provided with silent videos. For pure-audio or audio-visual tasks, the two methods behaved similarly. In a noisy environment, the performance of the seq2seq model was worse than that of the corresponding CTC model, suggesting that the CTC model could better handle background noises.

4 Audio and Visual Generation

The previously introduced retrieval task shows that the trained model is able to find the most similar audio or visual counterpart. While humans can imagine the scenes corresponding to sounds, and vice versa, researchers have tried to endow machines with this kind of imagination for many years. Following the invention and advances of generative adversarial networks (GANs) [48], image or video generation has emerged as a topic. It involves several subtasks, including generating images or video from a potential space [9], cross-modality generation [22, 140], etc. These applications are also relevant to other tasks, e.g., domain adaptation [130, 62]. Due to the difference between audio and visual modalities, the potential correlation between them is nonetheless difficult for machines to discover. Generating sound from a visual signal or vice versa, therefore, becomes a challenging task.

In this section, we will mainly review the recent development of audio and visual generation, i.e., generating audio from visual signals or vice versa. Visual signals here mainly refer to images, motion dynamics, and videos. The subsection ‘Visual to Audio’ mainly focuses on recovering the speech from the video of the lip area (Fig. 5 (a)) or generating sounds that may occur in the given scenes (Fig. 5 (a)). In contrast, the discussion of ‘Audio to Visual’ generation (Fig. 5 (b)) will examine generating images from a given audio (Fig. 6 (a)), body motion generation (Fig. 6 (b)), and talking face generation (Fig. 6 (c)).

(a) Demonstration of generating speech from lip sequences
(b) Demonstration of video-to-audio generation
Figure 5: Demonstration of visual-to-audio generation.
Category Method Ideas & Strengths Weaknesses
Cornu et al. [32]
Reconstruct intelligible
speech only from
visual speech features
Applied to limited scenarios
Lip sequence
to Speech
Ephrat et al. [38]
Compute optical flow
between frames
Applied to limited scenarios
Cornu et al. [118]
Reconstruct speech using
a classification approach
combined with feature-level
temporal information
Cannot apply to real-time
conversational speech
Davis et al. [35]
Recover real-world audio by
capturing vibrations of objects
Requires a specific device;
can only be applied to
soft objects
Owens et al. [92]
Use LSTM to capture
the relation between material
and motion
For a lab-controlled
environment only
General Video
to Audio
Zhou et al. [139]
Leverage a hierarchical
RNN to generate
in-the-wild sounds
Monophonic audio only
Morgado et al. [82]
Localize and
separate sounds to
generate spatial audio
from 360 video
Fails sometimes;
360 video required
Table 3: Summary of recent approaches to video-to-audio generation.

4.1 Vision-to-Audio Generation

Many methods have been explored to extract audio information from visual information, including predicting sounds from visually observed vibrations and generating audio via a video signal. We divide the visual-to-audio generation tasks into two categories: generating speech from lip video and synthesizing sounds from general videos without scene limitations.

4.1.1 Lip Sequence to Speech

There is a natural relationship between speech and lips. Separately from understanding the content of speech by observing lips (lip-reading), several studies have tried to reconstruct speech by observing lips. Cornu et al. [32] attempted to predict the spectral envelope from visual features, combining it with artificial excitation signals, and to synthesize audio signals in a speech production model. Ephrat et al. [40] proposed an end-to-end model based on a CNN to generate audio features for each silent video frame based on its adjacent frames. The waveform was therefore reconstructed based on the learned features to produce understandable speech.

Using temporal information to improve speech reconstruction has been extensively explored. Ephrat et al. [38] proposed leveraging the optical flow to capture the temporal motion at the same time. Cornu et al. [118]

leveraged recurrent neural networks to incorporate temporal information into the prediction.

4.1.2 General Video to Audio

When a sound hits the surfaces of some small objects, the latter will vibrate slightly. Therefore, Davis et al. [35] utilized this specific feature to recover the sound from vibrations observed passively by a high-speed camera. Note that it should be easily for suitable objects to vibrate, which is the case for a glass of water, a pot of plants, or a box of napkins. We argue that this work is similar to the previously introduced speech reconstruction studies [32, 40, 38, 118] since all of them use the relation between visual and sound context. In speech reconstruction, the visual part concentrates more on lip movement, while in this work, it focuses on small vibrations.

Owens et al. [92] observed that when different materials were hit or scratched, they emitted a variety of sounds. Thus, the researchers introduced a model that learned to synthesize sound from a video in which objects made of different materials were hit with a drumstick at different angles and velocities. The researchers demonstrated that their model could not only identify different sounds originating from different materials but also learn the pattern of interaction with objects (different actions applied to objects result in different sounds). The model leveraged an RNN to extract sound features from video frames and subsequently generated waveforms through an instance-based synthesis process.

Although Owens et al. [92] could generate sound from various materials, the authors’ approach still could not be applied to real-life applications since the network was trained by videos shot in a lab environment under strict constraints. To improve the result and generate sounds from in-the-wild videos, Zhou et al. [139] designed an end-to-end model. It was structured as a video encoder and a sound generator to learn the mapping from video frames to sounds. Afterwards, the network leveraged a hierarchical RNN [81] for sound generation. Specifically, the authors trained a model to directly predict raw audio signals (waveform samples) from input videos. They demonstrated that this model could learn the correlation between sound and visual input for various scenes and object interactions.

The previous efforts we have mentioned focused on monophonic audio generation, while Morgado et al. [82] attempted to convert monophonic audio recorded by a 360 video camera into spatial audio. Performing such a task of audio specialization requires addressing two primary issues: source separation and localization. Therefore, the researchers designed a model to separate the sound sources from mixed-input audio and then localize them in the video. Another multimodality model was used to guide the separation and localization since the audio and video were complementary.

4.2 Audio to Vision

In this section, we provide a detailed review of audio-to-visual generation. We first introduce audio-to-images generation, which is easier than video generation since it does not require temporal consistency between the generated images.

4.2.1 Audio to Image

To generate images of better quality, Wan et al. [125] put forward a model that combined the spectral norm, an auxiliary classifier, and a projection discriminator to form the researchers’ conditional GAN model. The model could output images of different scales according to the volume of the sound, even for the same sound. Instead of generating real-world scenes of the sound that had occurred, Qiu et al. [99] suggested imagining the content from music. The authors extracted features by feeding the music and images into two networks and learning the correlation between those features and finally generated images from the learned correlation.

Several studies have focused on audio-visual mutual generation. Chen et al. [22] were the first to attempt to solve this cross-modality generation problem using conditional GANs. The researchers defined a sound-to-image (S2I) network and an image-to-sound (I2S) network that generated images and sounds, respectively. Instead of separating S2I and I2S generation, Hao et al. [52] combined the respective networks into one network by considering a cross-modality cyclic generative adversarial network (CMCGAN) for the cross-modality visual-audio mutual generation task. Following the principle of cyclic consistency, CMCGAN consisted of four subnetworks: audio-to-visual, visual-to-audio, audio-to-audio, and visual-to-visual.

Most recently, some studies have tried to reconstruct facial images from speech clips. Duarte et al. [36] synthesized facial images containing expressions and poses through the GAN model. Moreover, the authors enhanced their model’s generation quality by searching for the optimal input audio length. To better learn normalized faces from speech, Oh et al. [90] explored a reconstructive model. The researchers trained an audio encoder by learning to align the feature space of speech with a pretrained face encoder and decoder.

(a) Demonstration of audio-to-images generation.
(b) Demonstration of a moving body.
(c) Demonstration of a talking face.
Figure 6: Demonstration of talking face generation and moving body generation.
Category Method Ideas & Strengths Weaknesses
Wan et al. [125]
Combined many existing techniques
to form a GAN
Low quality
Qiu et al. [99]
Generated images
related to music
Low quality
Audio to Image Chen et al. [22]
Generated both audio-to-visual and
visual-to-audio models
The models were
Hao et al. [52]
Proposed a cross-modality cyclic
generative adversarial network
Generated images only
Alemi et al. [4]
Generated dance movements from
music via real-time GrooveNet
Lee et al. [73]
Generated a choreography system
via an autoregressive
encoder-decoder network
Audio to Motions Shlizerman et al. [108]
Applied a “target delay” LSTM
to predict body keypoints
Constrained to
the given dataset
Tang et al. [116]
Developed a music-oriented dance
choreography synthesis method
Yalta et al. [134]
Produced weak labels from
motion directions for
motion-music alignment
Kumar et al. [72] and
Supasorn et al. [115]
Generated keypoints
by a time-delayed
Needed retraining for
another identity
Chung et al. [24]
Developed an encoder-decoder
CNN model suitable
for more identities
Jalalifar et al. [66]
Combined RNN and GAN
and applied keypoints
For a lab-controlled
environment only
Talking Face Vougioukas et al. [124]
Applied a temporal GAN for
more temporal consistency
Chen et al. [21] Applied optical flow Generated lips only
Zhou et al. [138] Disentangled information Lacked realism
Zhu et al. [140]

Asymmetric mutual information estimation

to capture modality coherence
Suffered from the “zoom-in
-and-out” condition
Chen et al. [23] Dynamic pixelwise loss
Required multistage
Wiles et al. [132]
Self-supervised model for
multimodality driving
Relatively low quality
Table 4: Summary of recent studies of audio-to-visual generation.

4.2.2 Body Motion Generation

Instead of directly generating videos, numerous studies have tried to animate avatars using motions. The motion synthesis methods leveraged multiple techniques, such as dimensionality reduction [104, 120]

, hidden Markov models

[18], Gaussian processes [127], and neural networks [117, 33, 57].

Alemi et al. [4]

proposed a real-time GrooveNet based on conditional restricted Boltzmann machines and recurrent neural networks to generate dance movements from music. Lee et al.

[73] utilized an autoregressive encoder-decoder network to generate a choreography system from music. Shlizerman et al. [108] further introduced a model that used a “target delay” LSTM to predict body landmarks. The latter was further used as agents to generate body dynamics. The key idea was to create an animation from the audio that was similar to the action of a pianist or a violinist. In summary, the entire process generated a video of artists’ performance corresponding to input audio.

Although previous methods could generate body motion dynamics, the intrinsic beat information of the music has not been used. Tang et al. [116]

proposed a music-oriented dance choreography synthesis method that extracted a relation between acoustic and motion features via an LSTM-autoencoder model. Moreover, to achieve better performance, the researchers improved their model with a masking method and temporal indexes. Providing weak supervision, Yalta et al.

[134] explored producing weak labels from motion direction for motion-music alignment. The authors generated long dance sequences via a conditional autoconfigured deep RNN that was fed by audio spectrum.

4.2.3 Talking Face Generation

Exploring audio-to-video generation, many researchers showed great interest in synthesizing people’s faces from speech or music. This has many applications, such as animating movies, teleconferencing, talking agents and enhancing speech comprehension while preserving privacy. Earlier studies of talking face generation mainly synthesized a specific identity from the dataset based on an audio of arbitrary speech. Kumar et al. [72] attempted to generate key points synced to audio by utilizing a time-delayed LSTM [49] and then generated the video frames conditioned on the key points by another network. Furthermore, Supasorn et al. [115] proposed a “teeth proxy” to improve the visual quality of teeth during generation.

Subsequently, Chung et al. [24] attempted to use an encoder-decoder CNN model to learn the correspondences between raw audio and videos. Combining RNN and GAN [48], Jalalifar et al. [66] produced a sequence of realistic faces that were synchronized with the input audio by two networks. One was an LSTM network used to create lip landmarks out of audio input. The other was a conditional GAN (cGAN) used to generate the resulting faces conditioned on a given set of lip landmarks. Instead of applying cGAN, [124] proposed using a temporal GAN [103] to improve the quality of synthesis. However, the above methods were only applicable to synthesizing talking faces with identities limited to those in a dataset.

Synthesis of talking faces of arbitrary identities has recently drawn significant attention. Chen et al. [21] considered correlations among speech and lip movements while generating multiple lip images. The researchers used the optical flow to better express the information between the frames. The fed optical flow represented not only the information of the current shape but also the previous temporal information.

A frontal face photo usually has both identity and speech information. Assuming this, Zhou et al. [138] used an adversarial learning method to disentangle different types of information of one image during generation. The disentangled representation had a convenient property that both audio and video could serve as the source of speech information for the generation process. As a result, it was possible to not only output the features but also express them more explicitly while applying the resulting network.

Most recently, to discover the high-level correlation between audio and video, Zhu et al. [140] proposed a mutual information approximation to approximate mutual information between modalities. Chen et al. [23] applied landmark and motion attention to generating talking faces. The authors further proposed a dynamic pixelwise loss for temporal consistency. Facial generation is not limited to specific modalities such as audio or visual since the crucial point is whether there is a mutual pattern between these different modalities. Wiles et al. [132] put forward a self-supervising framework called X2Face to learn the embedded features and generate target facial motions. It could produce videos from any input as long as embedded features were learned.

5 Audio-visual Representation Learning

Type Method Ideas & Strengths Weaknesses
Aytar et al. [11]
Student-teacher training
procedure with natural
video synchronization
Only learned the
audio representation
Leidal et al. [74]
Regularized the amount of
information encoded in the
semantic embedding
Focused on spoken utterances
and handwritten digits
Arandjelovic et al.
[7, 8]
Proposed the AVC task
Considered only audio and
video correspondence
Owens et al. [69]
Proposed the AVTS task
with curriculum learning
The sound source has to
feature in the video; only
one sound source
Parekh et al. [93]
Use video labels for weakly
supervised learning
Leverage the prior knowledge
of event classification
Hu et al. [61]
Disentangle each
modality into a set
of distinct components
Require a predefined
number of clusters
Table 5: Summary of recent audio-visual representation learning studies.

Representation learning aims to discover the pattern representation from data automatically. It is motivated by the fact that the choice of data representation usually greatly impacts performance of machine learning [14]. However, real-world data such as images, videos and audio are not amenable to defining specific features algorithmically.

Additionally, the quality of data representation usually determines the success of machine learning algorithms. Bengio et al.[14] assumed the reason for this to be that different representations could better explain the laws underlying data, and the recent enthusiasm for AI has motivated the design of more powerful representation learning algorithms to achieve these priors.

In this section, we will review a series of audio-visual learning methods ranging from single-modality [11] to dual-modality representation learning [8, 7, 69, 74, 61]. The basic pipeline of such studies is shown in Fig. 7.

Figure 7: Basic pipeline of representation learning.

5.1 Single-Modality Representation Learning

Naturally, to determine whether audio and video are related to each other, researchers focus on determining whether audio and video are from the same video or whether they are synchronized in the same video. Aytar et al. [11] exploited the natural synchronization between video and sound to learn an acoustic representation of a video. The researchers proposed a student-teacher training process that used an unlabeled video as a bridge to transfer discernment knowledge from a sophisticated visual identity model to the sound modality. Although the proposed approach managed to learn audio-modality representation in an unsupervised manner, discovering audio and video representations simultaneously remained to be solved.

5.2 Learning an Audio-visual Representation

In the corresponding audio and images, the information concerning modality tends to be noisy, while we only require semantic content rather than the exact visual content. Leidal et al. [74]

explored unsupervised learning of the semantic embedded space, which required a close distribution of the related audio and image. The researchers proposed a model to map an input to vectors of the mean and the logarithm of variance of a diagonal Gaussian distribution, and the sample semantic embeddings were drawn from these vectors.

To learn audio and video’s semantic information by simply watching and listening to a large number of unlabeled videos, Arandjelovic et al. [7] introduced an audio-visual correspondence learning task (AVC) for training two (visual and audio) networks from scratch, as shown in Fig. 8 (a). In this task, the corresponding audio and visual pairs (positive samples) were obtained from the same video, while mismatched (negative) pairs were extracted from different videos. To solve this task, the authors proposed an -Net that detected whether the semantics in visual and audio fields were consistent. Although this model was trained without additional supervision, it could learn representations of dual modalities effectively.

Exploring the proposed audio-visual coherence (AVC) task, Arandjelovic et al. [8] continued to investigate AVE-Net that aimed at finding the most similar visual area to the current audio clip. Owens et al. [91] proposed adopting a model similar to that of [7] but used a 3D convolution network for the videos instead, which could capture the motion information for sound localization.

In contrast to previous AVC task-based solutions, Korbar et al. [69] introduced another proxy task called audio-visual time synchronization (AVTS) that further considered whether a given audio sample and video clip were “synchronized” or “not synchronized.” In previous AVC tasks, negative samples were obtained as audio and visual samples from different videos. However, exploring AVTS, the researchers trained the model using “harder” negative samples representing unsynchronized audio and visual segments sampled from the same video, forcing the model to learn the relevant temporal features. At this time, not only the semantic correspondence was enforced between the video and the audio, but more importantly, the synchronization between them was also achieved. The researchers applied the curriculum learning strategy [15] to this task and divided the samples into four categories: positives (the corresponding audio-video pairs), easy negatives (audio and video clips originating from different videos), difficult negatives (audio and video clips originating from the same video without overlap), and super-difficult negatives (audio and video clips that partly overlap), as shown in Fig. 8 (b).

(a) Introduction to the AVC task
(b) Introduction to the AVTS task
Figure 8: Introduction to the representation task
Figure 9: Demonstration of audio-visual datasets.

The above studies rely on two latent assumptions: 1) the sound source should be present in the video, and 2) only one sound source is expected. However, these assumptions limit the applications of the respective approaches to real-life videos. Therefore, Parekh et al. [93] leveraged class-agnostic proposals from both video frames to model the problem as a multiple-instance learning task for audio. As a result, the classification and localization problems could be solved simultaneously. The researchers focused on localizing salient audio and visual components using event classes in a weakly supervised manner. This framework was able to deal with the difficult case of asynchronous audio-visual events. To leverage more detailed relations between modalities, Hu et al. [61]

recommended a deep coclustering model that extracted a set of distinct components from each modality. The model continually learned the correspondence between such representations of different modalities. The authors further introduced K-means clustering to distinguish concrete objects or sounds.

6 Recent Public Audio-visual Datasets

Many audio-visual datasets ranging from speech- to event-related data have been collected and released. We divide datasets into two categories: audio-visual speech datasets that record human face with the corresponding speech, and audio-visual event datasets that consist of musical instrument videos and real events’ videos. In this section, we summarize the information of recent audio-visual datasets (Table 6).

6.1 Audio-visual Speech Datasets

Constructing datasets containing audio-visual corpora is crucial to understanding audio-visual speech. The datasets are collected in lab-controlled environments where volunteers read the prepared phrases or sentences, or in-the-wild environments of TV interviews or talks.

6.1.1 Lab-controlled Environment

Category Dataset Env. Classes Length* Year
GRID [31] Lab 34 33,000 2006
Lombard Grid [5] Lab 54 54,000 2018
TCD TIMIT [53] Lab 62 - 2015
Vid TIMIT [105] Lab 43 - 2009
RAVDESS [77] Lab 24 - 2018
SEWA [70] Lab 180 - 2017
Speech OuluVS [135] Lab 20 1000 2009
OuluVS2 [6] Lab 52 3640 2016
Voxceleb [86] Wild 1,251 154,516 2017
Voxceleb2 [25] Wild 6,112 1,128,246 2018
LRW [27] Wild 1000 500,000 2016
LRS [26] Wild 1000 118,116 2017
LRS3 [28] Wild 1000 74,564 2017
AVA-ActiveSpeaker [101] Wild - 90,341 2019
C4S [13] Lab - 4.5 2017
Music ENST-Drums [46] Lab - 3.75 2006
URMP [75] Lab - 1.3 2019
YouTube-8M [1] Wild 3862 350,000 2016
AudioSet [45] Wild 632 4971 2016
Real Event Kinetics-400 [68] Wild 400 850* 2018
Kinetics-600 [19] Wild 600 1400* 2018
Kinetics-700 [20] Wild 700 1806* 2018
Table 6: Summary of speech-related audio-visual datasets. These datasets can be used for all tasks related to speech we have mentioned above. Note that the length of a ‘speech’ dataset denotes the number of video clips, while for ‘music’ or ’real event’ datasets, the length represents the total number of hours of the dataset.

Lab-controlled speech datasets are captured in specific environments, where volunteers are required to read the given phases or sentences. Some of the datasets only contain videos of speakers that utter the given sentences; these datasets include GRID [31], TCD TIMIT [53], and VidTIMIT [105]. Such datasets can be used for lip reading, talking face generation, and speech reconstruction. Development of more advanced datasets has continued: e.g., Livingstone et al. offered the RAVDESS dataset [77] that contained emotional speeches and songs. The items in it are also rated according to emotional validity, intensity and authenticity.

Some datasets such as Lombard Grid [5] and OuluVS [135, 6] focus on multiview videos. In addition, a dataset named SEWA offers rich annotations, including answers to a questionnaire, facial landmarks, (low-level descriptors of) LLD features, hand gestures, head gestures, transcript, valence, arousal, liking or disliking, template behaviors, episodes of agreement or disagreement, and episodes of mimicry.

6.1.2 In-the-wild Environment

The above datasets were collected in lab environments; as a result, models trained on those datasets are difficult to apply in real-world scenarios. Thus, researchers have tried to collect real-world videos from TV interviews, talks and movies and released several real-world datasets, including LRW, LRW variants [27, 26, 28], Voxceleb and its variants [86, 25], AVA-ActiveSpeaker [101] and AVSpeech [39]. The LRW dataset consists of 500 sentences [27], while its variant contains 1000 sentences[26, 28], all of which were spoken by hundreds of different speakers. VoxCeleb and its variants contain over 100,000 utterances of 1,251 celebrities [86] and over a million utterances of 6,112 identities [25], respectively.

AVA-ActiveSpeaker [101] and AVSpeech [39] datasets contain even more videos. The AVA-ActiveSpeaker [101] dataset consists of 3.65 million human-labeled video frames (approximately 38.5 hrs) The AVSpeech [39] dataset contains approximately 4700 hours of video segments from a total of 290k YouTube videos spanning a wide variety of people, languages, and face poses. The details are reported in Table 6.

6.2 Audio-visual Event Datasets

Another audio-visual dataset category consists of music or real-world event videos. These datasets are different from the aforementioned audio-visual speech datasets in not being limited to facial videos.

6.2.1 Music-related Datasets

Most music-related datasets were constructed in the lab environment. For example, ENST-Drums [46] merely contains drum videos of three professional drummers specializing in different music genres. The C4S dataset [13] consists of 54 videos of 9 distinct clarinetists, each performing 3 different classical music pieces twice (4.5h in total).

The URMP [75] dataset contains a number of multi-instrument musical pieces. However, these videos were recorded separately and then combined. To simplify the use of the URMP dataset, Chen et al. further proposed the Sub-URMP [22] dataset that contains multiple video frames and audio files extracted from the URMP dataset.

6.2.2 Real Events-related Datasets

More and more real-world audio-visual event datasets have recently been released that consist of numerous videos uploaded to the Internet. The datasets often comprise hundreds or thousands of event classes and the corresponding videos. Representative datasets include the following.

Kinetics-400 [68], Kinetics-600 [19] and Kinetics-700 [20] contain 400, 600 and 700 human action classes with at least 400, 600, and 600 video clips for each action, respectively. Each clip lasts approximately 10 s and is taken from a distinct YouTube video. The actions cover a broad range of classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. The AVA-Actions dataset [50] densely annotated 80 atomic visual actions in 43015 minutes of movie clips, where actions were localized in space and time, resulting in 1.58M action labels with multiple labels corresponding to a certain person.

AudioSet [45], a more general dataset, consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips. The clips were extracted from YouTube videos and cover a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. YouTube-8M [1] is a large-scale labeled video dataset that consists of millions of YouTube video IDs with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities.

7 Discussion

Audio-visual learning (AVL) is a foundation of the multimodality problem that integrates the two most important perceptions of our daily life. Despite great efforts focused on AVL, there is still a long way to go for real-life applications. In this section, we briefly discuss the key challenges and the potential research directions in each category.

7.1 Challenges

The heterogeneous nature of the discrepancy in AVL determines its inherent challenges. Audio tracks use a level of electrical voltage to represent analog signals, while the visual modality is usually represented in the RGB color space; the large gap between the two poses a major challenge to AVL. The essence of this problem is to understand the relation between audio and vision, which also is the basic challenge of AVL.

Audio-visual Separation and Localization is a longstanding problem in many real-life applications. Regardless of the previous advances in speaker-related or recent object-related separation and localization, the main challenges are failing to distinguish the timbre of various objects and exploring ways of generating the sounds of different objects. Addressing these challenges requires us to carefully design the models or ideas (e.g., the attention mechanism) for dealing with different objects. Audio-visual correspondence learning has vast potential applications, such as those in criminal investigations, medical care, transportation, and other industries. Many studies have tried to map different modalities into the shared feature space. However, it is challenging to obtain satisfactory results since extracting clear and effective information from ambiguous input and target modalities remains difficult. Therefore, sufficient prior information (the specific patterns people usually focus on) has a significant impact on obtaining more accurate results. Audio and vision generation

focuses on empowered machine imagination. In contrast to the conventional discriminative problem, the task of cross-modality generation is to fit a mapping between probability distributions. Therefore, it is usually a many-to-many mapping problem that is difficult to learn. Moreover, despite the large difference between audio and visual modalities, humans are sensitive to the difference between real-world and generated results, and subtle artifacts can be easily noticed, which makes this task more challenging. Finally,

audio-visual representation learning can be regarded as a generalization of other tasks. As we discussed before, both audio represented by electrical voltage and vision represented by the RGB color space are designed to be perceived by humans while not making it easy for a machine to discover the common features. The difficulty stems from having only two modalities and lacking explicit constraints. Therefore, the main challenge of this task is to find a suitable constraint. Unsupervised learning as a prevalent approach to this task provides a well-designed solution, while not having external supervision makes it difficult to achieve our goal. The challenging of the weakly supervised approach is to find correct implicit supervision.

7.2 Directions for Future Research

AVL has been an active research field for many years [34, 41] and is crucial to modern life. However, there are still many open questions in AVL due to the challenging nature of the domain itself and people’s increasing demands.

First, from a macro perspective, as AVL is a classic multimodality problem, its primary issue is to learn the mapping between modalities, specifically to map the attributes in audio and the objects in an image or a video. We think that mimicking the human learning process, e.g., by following the ideas of the attention mechanism and a memory bank may improve performance of learning this mapping. Furthermore, the second most difficult goal is to learn logical reasoning. Endowing a machine with the ability to reason is not only important for AVL but also an open question for the entire AI community. Instead of directly empowering a machine with the full logic capability, which is a long way to go from the current state of development, we can simplify this problem and consider fully utilizing the prior information and constructing the knowledge graph. Building a comprehensive knowledge graph and leveraging it in specific areas properly may help machine thinking.

As to each task we have summarized before, Sec. 2 and Sec. 3 can be referred to as the problem of ‘understanding’, while Sec. 4 and Sec. 5 can be referred to as ‘generation’ and ‘representation learning’ respectively. Significant advances in understanding and generation tasks such as lip-reading, speaker separation, and talking face generation have recently been achieved for human faces. The domain of faces is comparatively simple yet important since the scenes are normally constrained, and it has a sizable amount of available useful prior information. For example, consider a 3d face model. These faces usually have neutral expressions, while the emotions that are the basis of the face have not been studied well. Furthermore, apart from faces, the more complicated in-the-wild scenes with more conditions are worth considering. Adapting models to the new varieties of audio (stereoscopic audio) or vision (3D video and AR) also leads in a new direction. The datasets, especially large and high-quality ones that can significantly improve the performance of machine learning, are fundamental to the research community [113]. However, collecting a dataset is labor- and time-intensive. Small-sample learning also benefits the application of AVL. Learning representations, which is a more general and basic form of other tasks, can also mitigate the dataset problem. While recent studies lacked sufficient prior information or supervision to guide the training procedure, exploring suitable prior information may allow models to learn better representations.

Finally, many studies focus on building more complex networks to improve performance, and the resulting networks generally entail unexplainable mechanisms. To make a model or an algorithm more robust and explainable, it is necessary to learn the essence of the earlier explainable algorithms to advance AVL.

8 Conclusions

The desire to better understand the world from the human perspective has drawn considerable attention to audio-visual learning in the deep learning community. This paper provides a comprehensive review of recent advances in audio-visual learning categorized into four research areas: audio-visual separation and localization, audio-visual correspondence learning, audio and visual generation, and audio-visual representation learning. Furthermore, we present a summary of datasets commonly used in audio-visual learning. The discussion section identifies the key challenges of each category followed by potential research directions.


  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §6.2.2, Table 6.
  • [2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman (2018) Deep audio-visual speech recognition. arXiv preprint arXiv:1809.02108. Cited by: §3.2, Table 2.
  • [3] T. Afouras, J. S. Chung, and A. Zisserman (2018) The conversation: deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121. Cited by: §1, §2.1, Table 1.
  • [4] O. Alemi, J. Françoise, and P. Pasquier (2017) GrooveNet: real-time music-driven dance movement generation using artificial neural networks. networks 8 (17), pp. 26. Cited by: §4.2.2, Table 4.
  • [5] N. Alghamdi, S. Maddock, R. Marxer, J. Barker, and G. J. Brown (2018) A corpus of audio-visual lombard speech with frontal and profile views. The Journal of the Acoustical Society of America, pp. EL523–EL529. Cited by: §6.1.1, Table 6.
  • [6] I. Anina, Z. Zhou, G. Zhao, and M. Pietikäinen (2015) OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, pp. 1–5. Cited by: §6.1.1, Table 6.
  • [7] R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In

    2017 IEEE International Conference on Computer Vision (ICCV)

    pp. 609–617. Cited by: §1, §5.2, §5.2, Table 5, §5.
  • [8] R. Arandjelović and A. Zisserman (2017) Objects that sound. arXiv preprint arXiv:1712.06651. Cited by: §5.2, Table 5, §5.
  • [9] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §4.
  • [10] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas (2016) Lipnet: sentence-level lipreading. arXiv preprint. Cited by: §3.2, §3.2, Table 2.
  • [11] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pp. 892–900. Cited by: §5.1, Table 5, §5.
  • [12] Y. Aytar, C. Vondrick, and A. Torralba (2017) See, hear, and read: deep aligned representations. CoRR. Cited by: §3.1.2.
  • [13] A. Bazzica, J. van Gemert, C. C. Liem, and A. Hanjalic (2017) Vision-based detection of acoustic timed events: a case study on clarinet note onsets. arXiv preprint arXiv:1706.09556. Cited by: §6.2.1, Table 6.
  • [14] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, pp. 1798–1828. Cited by: §1, §5, §5.
  • [15] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §5.2.
  • [16] R. Białobrzeski, M. Kośmider, M. Matuszewski, M. Plata, and A. Rakowski (2019) Robust bayesian and light neural networks for voice spoofing detection. Proc. Interspeech 2019, pp. 1028–1032. Cited by: §3.1.
  • [17] K. D. Bochen Li, Z. Duan, and G. Sharma (2017) See and listen: score-informed association of sound tracks to players in chamber music performance videos. In IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §2.1.
  • [18] M. Brand and A. Hertzmann (2000) Style machines. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, New Orleans, LA, USA, July 23-28, 2000, pp. 183–192. Cited by: §4.2.2.
  • [19] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman (2018) A short note about kinetics-600. arXiv preprint arXiv:1808.01340. Cited by: §6.2.2, Table 6.
  • [20] J. Carreira, E. Noland, C. Hillier, and A. Zisserman (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Cited by: §6.2.2, Table 6.
  • [21] L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu (2018) Lip movements generation at a glance. CoRR. Cited by: §4.2.3, Table 4.
  • [22] L. Chen, S. Srivastava, Z. Duan, and C. Xu (2017) Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 349–357. Cited by: §4.2.1, Table 4, §4, §6.2.1.
  • [23] L. Chen (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.3, Table 4.
  • [24] J. S. Chung, A. Jamaludin, and A. Zisserman (2017) You said that?. CoRR. Cited by: §4.2.3, Table 4.
  • [25] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §6.1.2, Table 6.
  • [26] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman (2017) Lip reading sentences in the wild.. In CVPR, pp. 3444–3453. Cited by: §3.2, Table 2, §6.1.2, Table 6.
  • [27] J. S. Chung and A. Zisserman (2016) Lip reading in the wild. In Asian Conference on Computer Vision, pp. 87–103. Cited by: §6.1.2, Table 6.
  • [28] J. S. Chung and A. Zisserman (2017) Lip reading in profile. In British Machine Vision Conference, Cited by: §6.1.2, Table 6.
  • [29] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.1, §3.2.
  • [30] S. Chung, J. Son Chung, and H. Kang (2018) Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. ArXiv e-prints, pp. arXiv:1809.08001. Cited by: §3.1.1.
  • [31] M. Cooke, J. Barker, S. Cunningham, and X. Shao (2006)

    An audio-visual corpus for speech perception and automatic speech recognition

    The Journal of the Acoustical Society of America, pp. 2421–2424. Cited by: §6.1.1, Table 6.
  • [32] T. L. Cornu and B. Milner (2015) Reconstructing intelligible audio speech from visual speech features. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §4.1.1, §4.1.2, Table 3.
  • [33] L. Crnkovic-Friis and L. Crnkovic-Friis (2016) Generative choreography using deep learning. CoRR. Cited by: §4.2.2.
  • [34] T. Darrell, J. W. Fisher, and P. Viola (2000) Audio-visual segmentation and “the cocktail party effect”. In Advances in Multimodal Interfaces—ICMI 2000, pp. 32–40. Cited by: §2.1, §7.2.
  • [35] A. Davis, M. Rubinstein, N. Wadhwa, G. J. Mysore, F. Durand, and W. T. Freeman (2014) The visual microphone: passive recovery of sound from video. Cited by: §4.1.2, Table 3.
  • [36] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giro-i-Nieto (2019) Speech-conditioned face generation using generative adversarial networks. Cited by: §4.2.1.
  • [37] S. Dupont and J. Luettin (2000) Audio-visual speech modeling for continuous speech recognition. IEEE transactions on multimedia, pp. 141–151. Cited by: §3.2.
  • [38] A. Ephrat, T. Halperin, and S. Peleg (2017) Improved speech reconstruction from silent video. In Proceedings of the IEEE International Conference on Computer Vision, pp. 455–462. Cited by: §4.1.1, §4.1.2, Table 3.
  • [39] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph., pp. 112:1–112:11. Cited by: §1, §2.1, §2.1, Table 1, §6.1.2, §6.1.2.
  • [40] A. Ephrat and S. Peleg (2017) Vid2speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5095–5099. Cited by: §4.1.1, §4.1.2.
  • [41] J. W. Fisher III, T. Darrell, W. T. Freeman, and P. A. Viola (2001) Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13, pp. 772–778. Cited by: §2.1, §2.2.1, §7.2.
  • [42] C. Fu, X. Wu, Y. Hu, H. Huang, and R. He (2019) Dual variational generation for low-shot heterogeneous face recognition. NeurIPS. Cited by: §1.
  • [43] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg (2018) Seeing through noise: visually driven speaker separation and enhancement. In ICASSP, pp. 3051–3055. Cited by: §1, §2.1, §2.1, §2.1, Table 1.
  • [44] R. Gao, R. Feris, and K. Grauman (2018) Learning to separate object sounds by watching unlabeled video. In ECCV, Cited by: §2.2.1, Table 1.
  • [45] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 776–780. Cited by: §6.2.2, Table 6.
  • [46] O. Gillet and G. Richard (2006) ENST-drums: an extensive audio-visual database for drum signals processing.. In ISMIR, pp. 156–159. Cited by: §6.2.1, Table 6.
  • [47] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez (2019)

    A light convolutional gru-rnn deep feature extractor for asv spoofing detection

    Proc. Interspeech 2019, pp. 1068–1072. Cited by: §3.1.
  • [48] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §4.2.3, §4.
  • [49] A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, pp. 602–610. Cited by: §4.2.3.
  • [50] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: §6.2.2.
  • [51] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NIPS, pp. 5767–5777. Cited by: §1.
  • [52] W. Hao, Z. Zhang, and H. Guan (2017) CMCGAN: a uniform framework for cross-modal visual-audio mutual generation. arXiv preprint arXiv:1711.08102. Cited by: §4.2.1, Table 4.
  • [53] N. Harte and E. Gillen (2015) TCD-timit: an audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, pp. 603–615. Cited by: §6.1.1, Table 6.
  • [54] R. He, W. Zheng, and B. Hu (2010) Maximum correntropy criterion for robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1561–1576. Cited by: §1.
  • [55] J. Hershey and J. Movellan (2000) Audio-vision: using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems 12, pp. 813–819. Cited by: §2.2.1, §2.2.2.
  • [56] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §2.2.2.
  • [57] D. Holden, J. Saito, and T. Komura (2016) A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), pp. 138. Cited by: §4.2.2.
  • [58] S. Hong, W. Im, and H. S. Yang (2017) Deep learning for content-based, cross-modal retrieval of videos and music. CoRR. Cited by: §3.1.2, Table 2.
  • [59] K. Hoover, S. Chaudhuri, C. Pantofaru, M. Slaney, and I. Sturdy (2017) Putting a face to the voice: fusing audio and visual signals across a video to determine speakers. CoRR. Cited by: §3.1.1, Table 2.
  • [60] D. Hu, X. Li, et al. (2016) Temporal multimodal learning in audiovisual speech recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3574–3582. Cited by: §3.2.
  • [61] D. Hu, F. Nie, and X. Li (2018) Deep co-clustering for unsupervised audiovisual learning. arXiv preprint arXiv:1807.03094. Cited by: §5.2, Table 5, §5.
  • [62] S. Huang, C. Lin, S. Chen, Y. Wu, P. Hsu, and S. Lai (2018) Auggan: cross domain adaptation with gan-based data augmentation. In ECCV, pp. 718–731. Cited by: §4.
  • [63] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR. Cited by: §3.1.1.
  • [64] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey (2016) Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:1607.02173. Cited by: §2.1.
  • [65] H. Izadinia, I. Saleemi, and M. Shah (2013) Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, pp. 378–390. Cited by: §2.2.2.
  • [66] S. A. Jalalifar, H. Hasani, and H. Aghajan (2018) Speech-driven facial reenactment using conditional generative adversarial networks. CoRR. Cited by: §4.2.3, Table 4.
  • [67] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, pp. 4401–4410. Cited by: §1.
  • [68] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §6.2.2, Table 6.
  • [69] B. Korbar, D. Tran, and L. Torresani (2018) Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230. Cited by: §1, §5.2, Table 5, §5.
  • [70] J. Kossaifi, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt, F. Ringeval, J. Han, V. Pandit, B. Schuller, K. Star, et al. (2019) SEWA db: a rich database for audio-visual emotion and sentiment research in the wild. arXiv preprint arXiv:1901.02839. Cited by: Table 6.
  • [71] G. Krishna, C. Tran, J. Yu, and A. H. Tewfik (2019) Speech recognition with no speech or with noisy speech. In ICASSP, pp. 1090–1094. Cited by: §1.
  • [72] R. Kumar, J. Sotelo, K. Kumar, A. de Brébisson, and Y. Bengio (2017) ObamaNet: photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442. Cited by: §4.2.3, Table 4.
  • [73] J. Lee, S. Kim, and K. Lee (2018) Listen to dance: music-driven choreography generation using autoregressive encoder-decoder network. CoRR. Cited by: §4.2.2, Table 4.
  • [74] K. Leidal, D. Harwath, and J. R. Glass (2017) Learning modality-invariant representations for speech and images. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 424–429. Cited by: §5.2, Table 5, §5.
  • [75] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma (2019) Creating a multitrack classical music performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Transactions on Multimedia, pp. 522–535. Cited by: §6.2.1, Table 6.
  • [76] C. Lippert, R. Sabatini, M. C. Maher, E. Y. Kang, S. Lee, O. Arikan, A. Harley, A. Bernal, P. Garst, V. Lavrenko, K. Yocum, T. Wong, M. Zhu, W. Yang, C. Chang, T. Lu, C. W. H. Lee, B. Hicks, S. Ramakrishnan, H. Tang, C. Xie, J. Piper, S. Brewerton, Y. Turpaz, A. Telenti, R. K. Roby, F. J. Och, and J. C. Venter (2017) Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences, pp. 10166–10171. Cited by: §3.1.1.
  • [77] S. R. Livingstone and F. A. Russo (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, pp. e0196391. Cited by: §6.1.1, Table 6.
  • [78] L. R. Long, L. E. Berman, and G. R. Thoma (1996) Prototype client/server application for biomedical text/image retrieval on the Internet. In Storage and Retrieval for Still Image and Video Databases IV, pp. 362 – 372. Cited by: §3.1.2.
  • [79] R. Lu, Z. Duan, and C. Zhang (2018) Listen and look: audio–visual matching assisted speech source separation. IEEE Signal Processing Letters, pp. 1315–1319. Cited by: §2.1, Table 1.
  • [80] Y. Luo, Z. Chen, and N. Mesgarani (2018) Speaker-independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 787–796. Cited by: §2.1.
  • [81] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio (2016) SampleRNN: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837. Cited by: §4.1.2.
  • [82] P. Morgado, N. Vasconcelos, T. Langlois, and O. Wang (2018) Self-supervised generation of spatial audio for 360 video. arXiv preprint arXiv:1809.02587. Cited by: §1, §4.1.2, Table 3.
  • [83] G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, and L. Badino (2019) Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6900–6904. Cited by: §2.1, Table 1.
  • [84] A. Nagrani, S. Albanie, and A. Zisserman (2018) Learnable pins: cross-modal embeddings for person identity. CoRR. Cited by: §3.1.2, Table 2.
  • [85] A. Nagrani, S. Albanie, and A. Zisserman (2018) Seeing voices and hearing faces: cross-modal biometric matching. CoRR. Cited by: §3.1.1, Table 2.
  • [86] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §6.1.2, Table 6.
  • [87] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. Cited by: §3.2.
  • [88] H. Ninomiya, N. Kitaoka, S. Tamura, Y. Iribe, and K. Takeda (2015) Integration of deep bottleneck features for audio-visual speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §3.2.
  • [89] M. Nussbaum-Thom, J. Cui, B. Ramabhadran, and V. Goel (2016) Acoustic modeling using bidirectional gated recurrent convolutional units. pp. 390–394. Cited by: §3.2.
  • [90] T. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein, and W. Matusik (2019) Speech2Face: learning the face behind a voice. arXiv preprint arXiv:1905.09773. Cited by: §4.2.1.
  • [91] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648. Cited by: §5.2.
  • [92] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman (2016) Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413. Cited by: §4.1.2, §4.1.2, Table 3.
  • [93] S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. Pérez, and G. Richard (2018) Weakly supervised representation learning for unsynchronized audio-visual events. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2518–2519. Cited by: §5.2, Table 5.
  • [94] S. Parekh, A. Ozerov, S. Essid, N. Q. K. Duong, P. Pérez, and G. Richard (2018) Identify, locate and separate: audio-visual object extraction in large video collections using weak supervision. CoRR. Cited by: §2.2.3, Table 1.
  • [95] S. Petridis, Z. Li, and M. Pantic (2017) End-to-end visual speech recognition with lstms. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 2592–2596. Cited by: §3.2, Table 2.
  • [96] S. Petridis and M. Pantic (2016) Prediction-based audiovisual fusion for classification of non-linguistic vocalisations. IEEE Transactions on Affective Computing, pp. 45–58. Cited by: §3.2.
  • [97] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior (2003) Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, pp. 1306–1326. Cited by: §3.2.
  • [98] J. Pu, Y. Panagakis, S. Petridis, and M. Pantic (2017) Audio-visual object localization and separation using low-rank and sparsity. Cited by: §2.1, §2.2.3, §2.2.3, Table 1.
  • [99] Y. Qiu and H. Kataoka (2018) Image generation associated with music data. In 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 2510–2513. Cited by: §4.2.1, Table 4.
  • [100] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R.G. Lanckriet, R. Levy, and N. Vasconcelos (2010) A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, pp. 251–260. Cited by: §3.1.2.
  • [101] J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, et al. (2019) AVA-activespeaker: an audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342. Cited by: §6.1.2, §6.1.2, Table 6.
  • [102] A. Rouditchenko, H. Zhao, C. Gan, J. H. McDermott, and A. Torralba (2019) Self-supervised audio-visual co-segmentation. CoRR. Cited by: §2.2.3, Table 1.
  • [103] M. Saito, E. Matsumoto, and S. Saito (2017)

    Temporal generative adversarial nets with singular value clipping

    In IEEE International Conference on Computer Vision (ICCV), pp. 5. Cited by: §4.2.3.
  • [104] A. Samadani, E. Kubica, R. Gorbet, and D. Kulic (2013) Perception and generation of affective hand movements. I. J. Social Robotics, pp. 35–51. Cited by: §4.2.2.
  • [105] C. Sanderson and B. C. Lovell (2009) Multi-region probabilistic histograms for robust and scalable identity inference. In International Conference on Biometrics, pp. 199–208. Cited by: §6.1.1, Table 6.
  • [106] A. Senocak, T. Oh, J. Kim, M. Yang, and I. S. Kweon (2018) Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4358–4366. Cited by: §2.2.2, Table 1.
  • [107] R. V. Shannon, F. Zeng, V. Kamath, J. Wygonski, and M. Ekelid (1995) Speech recognition with primarily temporal cues. Science, pp. 303–304. Cited by: §1.
  • [108] E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman (2018) Audio to body dynamics. In Proc. CVPR, Cited by: §4.2.2, Table 4.
  • [109] K. Simonyan and A. Zisserman (2014) Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27, pp. 568–576. Cited by: §3.1.1.
  • [110] R. K. Srihari (1995) Combining text and image information in content-based retrieval. In Proceedings., International Conference on Image Processing, pp. 326–329 vol.1. Cited by: §3.1.2.
  • [111] K. Sriskandaraja, V. Sethu, and E. Ambikairajah (2018) Deep siamese architecture based replay detection for secure voice biometric.. In Interspeech, pp. 671–675. Cited by: §3.1.
  • [112] T. Stafylakis and G. Tzimiropoulos (2017) Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105. Cited by: §3.2.
  • [113] C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017) Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pp. 843–852. Cited by: §7.2.
  • [114] D. Surís, A. Duarte, A. Salvador, J. Torres, and X. Giró i Nieto (2018) Cross-modal embeddings for video and audio retrieval. CoRR. Cited by: §3.1.2, Table 2.
  • [115] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman (2017) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), pp. 95. Cited by: §4.2.3, Table 4.
  • [116] T. Tang, J. Jia, and H. Mao (2018) Dance with melody: an lstm-autoencoder approach to music-oriented dance synthesis. In 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22-26, 2018, pp. 1598–1606. Cited by: §4.2.2, Table 4.
  • [117] G. W. Taylor and G. E. Hinton (2009) Factored conditional restricted boltzmann machines for modeling motion style. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pp. 1025–1032. Cited by: §4.2.2.
  • [118] T. Thomas Le Cornu and B. Milner (2017) Generating intelligible audio speech from visual speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1751–1761. Cited by: §4.1.1, §4.1.2, Table 3.
  • [119] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu (2018) Audio-visual event localization in unconstrained videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, pp. 252–268. Cited by: §2.2.2, Table 1.
  • [120] J. Tilmanne and T. Dutoit (2010) Expressive gait synthesis using PCA and gaussian modeling. In Motion in Games - Third International Conference, MIG 2010, Utrecht, The Netherlands, November 14-16, 2010. Proceedings, pp. 363–374. Cited by: §4.2.2.
  • [121] A. Torfi, S. M. Iranmanesh, N. M. Nasrabadi, and J. M. Dawson (2017)

    Coupled 3d convolutional neural networks for audio-visual recognition

    CoRR. Cited by: §3.1.1.
  • [122] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5200–5204. Cited by: §3.2, Table 2.
  • [123] H. L. Van Trees (2004) Optimum array processing: part iv of detection, estimation, and modulation theory. John Wiley & Sons. Cited by: §2.2.1.
  • [124] K. Vougioukas, S. Petridis, and M. Pantic (2018) End-to-end speech-driven facial animation with temporal gans. In BMVC, Cited by: §4.2.3, Table 4.
  • [125] C. Wan, S. Chuang, and H. Lee (2018) Towards audio to scene image synthesis using generative adversarial network. arXiv preprint arXiv:1808.04108. Cited by: §4.2.1, Table 4.
  • [126] M. Wand, J. Koutník, and J. Schmidhuber (2016)

    Lipreading with long short-term memory

    In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 6115–6119. Cited by: §3.2, §3.2, Table 2.
  • [127] J. M. Wang, D. J. Fleet, and A. Hertzmann (2007) Multifactor gaussian process models for style-content separation. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, pp. 975–982. Cited by: §4.2.2.
  • [128] R. Wang, H. Huang, X. Zhang, J. Ma, and A. Zheng (2019) A novel distance learning for elastic cross-modal audio-visual matching. In 2019 IEEE International Conference on Multimedia Expo Workshops (ICMEW), pp. 300–305. Cited by: Table 2.
  • [129] R. Wang, H. Huang, X. Zhang, J. Ma, and A. Zheng (2019) A novel distance learning for elastic cross-modal audio-visual matching. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 300–305. Cited by: §3.1.1.
  • [130] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In CVPR, pp. 79–88. Cited by: §4.
  • [131] Y. Wen, M. A. Ismail, W. Liu, B. Raj, and R. Singh (2018) Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. ArXiv e-prints, pp. arXiv:1807.04836. Cited by: §3.1.1, Table 2.
  • [132] O. Wiles, A.S. Koepke, and A. Zisserman (2018) X2Face: a network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision, Cited by: §4.2.3, Table 4.
  • [133] X. Wu, R. He, Z. Sun, and T. Tan (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, pp. 2884–2896. Cited by: §3.1.
  • [134] N. Yalta, S. Watanabe, K. Nakadai, and T. Ogata (2018) Weakly supervised deep recurrent neural networks for basic dance step generation. CoRR. Cited by: §4.2.2, Table 4.
  • [135] G. Zhao, M. Barnard, and M. Pietikainen (2009) Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, pp. 1254–1265. Cited by: §6.1.1, Table 6.
  • [136] H. Zhao, C. Gan, W. Ma, and A. Torralba (2019) The sound of motions. CoRR. Cited by: §2.2.3, §2.2.3, Table 1.
  • [137] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018) The sound of pixels. arXiv preprint arXiv:1804.03160. Cited by: §2.2.3, §2.2.3, Table 1.
  • [138] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang (2018) Talking face generation by adversarially disentangled audio-visual representation. CoRR. Cited by: §4.2.3, Table 4.
  • [139] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg (2017) Visual to sound: generating natural sound for videos in the wild. arXiv preprint arXiv:1712.01393. Cited by: §4.1.2, Table 3.
  • [140] H. Zhu, A. Zheng, H. Huang, and R. He (2018) High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:1812.06589. Cited by: §4.2.3, Table 4, §4.
  • [141] A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. Del Bue, and V. Murino (2015) Seeing the sound: a new multimodal imaging device for computer vision. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 6–14. Cited by: §2.2.1, §2.2.2.