Deep neural networks trained in a supervised manner are a popular contemporary choice for various speech related tasks such as automatic speech recognition (ASR), emotion recognition and age/gender recognition. However they are a double-edged sword by virtue of providing extremely good performance given that large scale annotated data is available, which is usually expensive. For problems like emotion recognition, reliably annotated data is also extremely scarce and even modern datasets are very limited in size. Transfer learning approaches attempt to solve this problem by domain adaptation (e.g. using supervised ImageNet pretraining for visual tasks), but even they need a large amount of annotated data for the primary supervised task and generalization is not guaranteed. Self-supervised learning is a recent and rapidly developing area of machine learning which might offer a potential solution to this problem. In this work, we present a method for visually guided self-supervised learning of speech features that outperforms baseline self-supervised methods and also outperforms fully supervised pretraining on the evaluated downstream tasks.
Self-supervision is an interesting way to attempt to combat the paucity of labeled data by capturing the intrinsic structure of the unlabelled data. The idea behind self-supervision is to find a ‘pretext task / proxy task’ for the network to learn that does not require any explicit labeling, but instead the data’s inherent structure provides the labels. During training, the network is tasked with predicting these implicit labels, which could be of various kinds. For instance, predicting the next element or a randomly masked element of a known sequence given the history/context is a popular pretext task. The key idea is that the whole sequence is already available as an unlabeled data sample, and we are just choosing an intrinsic property (here the value of the element to be predicted) as the label for the proxy supervised learning problem. This ‘label’ is provided to us for free by the data and does not require any sort of external annotation. These pretext tasks may also model and span across multiple modalities (e.g. predicting the data or features of one modality from another). This is especially relevant in the context of speech and emotion recognition where we are interested in modeling complementary multimodal information, especially in audio and video.
In this work, we investigate self-supervised learning for audio representations. Audio representations are a cornerstone of speech and affect recognition. Most audio-related applications involve the analysis of a speech signal using either handcrafted low level descriptors or through a supervised (or fine tuned) neural network which directly predicts the labels of interest. However, self-supervised learning may offer better representations for these applications, especially in cases where labeled data is hard to come by and unlabeled audio data is readily available. We look into how self-supervision can be used to produce robust audio features.
First, we examine the state-of-the-art in self-supervised audio feature learning which we use as baselines. We then propose a novel visual self-supervised method and two novel audio-only self-supervised methods for learning audio features.
Most existing self-supervised learning approaches are unimodal. The few existing cross-modal approaches typically have some interaction between the modalities in the latent space by pretext tasks like clustering but they do not produce an intuitive interaction between the two modalities. By contrast, our work proposes audio features that are explicitly guided by lip movements and facial expressions’ reconstruction (see Fig. 1). We impicitly capture visual information related to lip movements and facial expressions in the audio features. The visual modality is needed only during training and our audio features can be evaluated on audio-only datasets.
We summarize our research contributions as follows:
We investigate visual self-supervision for learning audio features. We propose a novel method for visually-guided self-supervised learning of speech representations by face reconstruction (L1). The proposed speech features, which are correlated with lip movements and facial expressions due to being driven by video generation, outperform existing audio-only self-supervision approaches for speech and emotion recognition.
We propose two new audio-only self-supervised methods (Odd One Out and Arrow of Time). These methods are both inspired from visual self-supervised learning and are based on temporal order verification as the pretext task. Both of them offer competitive performance on the tested datasets.
We combine the proposed audio-only and video-only supervision methods by multi-task learning. We find that the encoder trained in a multi-modal regime encodes richer information about the speech signal and yields the most effective representation that attains the best performance among all tested methods.
We show that pretraining by audio-visual self-supervision produces a better weight initialization for downstream tasks than does training from scratch. This results in faster training and convergence for a variety of hyperparameters for downstream tasks.
We show that the proposed visually-guided audio features are more robust for various levels of noise.
2 Related Work
To position our work with respect to existing literature and to highlight its novelty, we review prior work in: (1) Self-supervised learning, including audio, visual and cross modal methods; (2) Audiovisual speech recognition and methods that exploit both modalities for speech related tasks like emotion recognition.
2.1 Self-Supervised Learning
Self-supervised learning is a rapidly developing field in machine learning, with the promise of being able to learn useful representations from unlabeled data. Perhaps the most seminal and widespread applications of self-supervised learning have come in natural language processing. Extremely popular recent works like ELMo and BERT  are based on predicting the next token of text based on the history or context. Self-supervised learning of visual features has also attracted a lot of research interest, whereas self-supervised learning of audio representations has received less attention so far. There have also been a few works on cross-modal self-supervised learning. We briefly survey these trends in the subsequent sub-sections.
2.1.1 Self-supervised video feature learning
There have been numerous recent works on visual self-supervised representation learning. Gidaris et. al.  predict rotations for unlabeled images that have been rotated by a known amount, which drives the features to encode information about the object shape and appearance. Other works try to predict the relative location of patches , temporal order of frames in a video , or audio-visual synchronization [19, 27]. BigBiGAN is a recent method proposed for adversarial self-supervised representation learning . The work shows that more accurate and realistic reconstructions tend to produce better visual features for downstream tasks. Cycle consistency is also a concept that has been explored for visual feature learning . DeepCluster  was an interesting idea which focused on clustering in the latent space based on iteratively improving labels provided by the model being trained. NoisyStudent 
was a follow up work on a similar concept, the idea being that the predictions of the model from a previous training epoch could be used as labels for the current epoch. S4L (Self-Supervised Semi-Supervised Learning) is another recent work which combines self-supervised learning with a small amount of labeled data to learn richer representations. Contrastive learning is a recent trend in self-supervised learning that is focused on separating representations of positive and negative pairs in the latent space. MoCo  is an important work in this area, and is based on distancing a positive pair from a large memory bank of negative examples. PIRL  extends this idea to produce image representations that are invariant of the chosen pretext task. A more detailed overview of self-supervised methods for visual feature learning can be found in  We draw inspiration from visual self-supervised learning for the learning of audio features. In this work, we apply concepts from visual self-supervision to develop two audio-only self-supervised methods (see section 4) and a cross-modal self-supervised method based on visual reconstruction (see section 3).
2.1.2 Self-supervised audio feature learning
There has also been a wave of recent work on self supervised audio-only representation learning. CPC (Contrast Predictive Coding)  and APC (Autoregressive Predictive Coding)  are similar approaches that model the next token of a speech segment given the history. Another method called LIM (Local Info Max)  is based on maximizing the MI (mutual information) among randomly chosen windows in an unsupervised way to learn speaker embeddings. Wav2vec  is also an unsupervised pre-training method used in the context of speech recognition. Self supervised audio features have also been proposed for mobile devices . Another very relevant recent work is PASE (Problem Agnostic Speech Encoder) , which aims to learn multi-task speech representations from raw audio by predicting a number of handcrafted features such as MFCCs, prosody and waveform. SeCoSt  is a teacher-student self-supervised approach very similar to the ones in the visual domain that iteratively use the predictions of one epoch as the labels for the next one. Phase prediction  has also been proposed as an audio-based pretext task. WaveNet  is a generative model for raw audio waveforms that can be used for generic audio representations. There has also been a new version of CPC proposed for audio for multiple languages . We compare our proposed methods with the best performing audio-only self-supervised baselines in recent literature. A detailed description of the baselines can be found in section 6 and the results can be found in section 7.
2.1.3 Self-supervised cross-modal learning
A few works also exploit the relationship between modalities, such as by predicting cyclic transitions , the relationship between ambient sound and vision , and cross-modal prediction based fusion . XDC  extends the idea of clustering as a pretext tasks across modalities, with the cluster predictions for video coming from audio and vice versa. Piergiovanni et. al. 48] is a method that projects representations of views from different modalities closer in the latent space for positive examples and further for negative examples. In this process, the encoders for each view (modality) learn useful representations. Patrick et. al.  extend the contrastive learning concept from MoCo  to a multi-modal setting. Morgado et. al.  combine audio-visual instance discrimination with cross-modal agreement for self-supervised learning. Zhu et. al.  present a detailed survey of deep audio-visual learning including cross-modal self-supervised learning. All of these works have shown that it is possible to learn robust multi-task representations from a large amount of unlabeled data that is inexpensive to obtain. We propose a novel self-supervised method based on cross-modal reconstruction to learn audio features. Our method is based on speech-driven facial reconstruction and is explained in detail in section 3.
2.2 Audiovisual Speech and Emotion Recognition
Audiovisual speech data is extremely common and the usage of complementary information from both modalities is a popular concept in many fields of research. The McGurk effect  was the classic example that demonstrated the audio-visual nature of human perception of speech. The visual modality contains information that offers robustness in circumstances where the audio modality may be corrupted with noise .
Audiovisual emotion recognition has also seen a significant amount of recent research efforts. Automatic affect recognition has a variety of applications in various fields; from detecting depression , to more emotionally relevant advertising [45, 44]. A lot of contemporary affect analysis approaches are based on deep neural networks that study both the visual and audio modalities [18, 58]. However a big problem in emotion recognition is the lack of reliably annotated data for large datasets, which we try to address (implicitly) in this paper.
3 Visual Self-Supervision for Speech Representation Learning
The proposed method is illustrated in Fig. 1 and is based on our prior work on visually guided speech representation learning through speech-driven facial animation [50, 46]. The model is a temporal encoder-decoder which takes a still image of a face (frame from a 25 fps video) and an audio signal as inputs and generates video frames from these. The model itself can be conceptually divided into three subnetworks (see Fig. 1 and Fig. 2), namely the content/audio encoder (3 layer GRU), the identity encoder (6 layer 2D CNN) and the frame decoder (U-Net  architecture with skip connections from the identity encoder).
. Its purpose is to convert the input audio into a latent space audio feature vector. Similarly, the identity encoder (see Fig. 2
top-left), which is made of 6 (Conv2D - BatchNorm - ReLU) blocks, reduces a 64x128 input image (which is the first video frame of the audiovisual speech segment) to a 64x1 feature vector.
We also use a noise generator (see Fig. 1
) capable of producing noise that is temporally coherent. A 10 dimensional vector is sampled from a Gaussian distribution with mean 0 and variance of 0.33 and passed through a single-layer GRU to produce the noise sequence. This latent representationaccounts for randomness in the face synthesis process (such as the generation of random sequential behaviour like blinks ), which leads to a more realistic facial reconstruction.
top-right), which is a CNN that uses strided transposed convolutions to produce the video frames. The skip connections to the identity encoder help in preserving subject identity.
An L1 reconstruction loss between a random frame from the generated video and the corresponding frame from the real video is used to train the network. The L1 loss on the pixel level is commonly used in facial reconstruction as opposed to the L2 loss which typically produces blurrier reconstructions. We use the Adam optimizer with a learning rate of 0.06 that is decayed by a factor of 0.98 every 10 epochs. Essentially, our model aims to predict the video modality (face reconstruction) given only the audio modality and speaker identity information from the first frame. In this process, the audio encoder is driven to produce useful speech features that correlate with mouth and facial movements (because we need to generate these lip and facial movements using only the audio information, so the features
must encode this in order to reduce the L1 loss). After this process of visually guided self-supervised pretraining, we simply use the trained audio encoder as a pretrained model for audio-only downstream tasks. The features extracted from this model are especially interesting to evaluate on tasks like speech recognition and emotion recognition. This is because these features are explicitly trained (guided by the visual modality) to contain information related to lip movements (highly correlated with speech) and facial expressions (highly correlated with emotion).
3.1 Audio Encoder Architecture
The audio encoder (see Fig. 2 bottom-left) is a log mel spectrogram encoder (closely following ). The log mel spectrogram is computed with 80 frequency bins, a window width of 25ms and a stride of 10ms, which is a standard choice for processing speech signals. This (t, 80) dimensional input then goes through the encoder which is a 3 layer GRU network with each layer having a hidden size of 512 followed by a fully connected layer which converts it into a feature with dimensionality (t, 512). This specific architecture of the audio encoder with 3 GRU layers is the exact same as used in . We chose this for simplicity and to enable direct comparison with this baseline which uses a similar audio input as us (80 dimensional log mel features).
We use the above described architecture (see Fig. 2) as the audio encoder in the proposed models in the following sections (for both visual and audio self-supervision). We then use the trained encoder to extract features from the evaluation datasets.
4 Audio Self-Supervision for Speech Representation Learning
This section introduces two audio-only self-supervised methods that we propose for speech representation learning. The concept and inspiration behind both of them is similar, which is temporal order verification for audio as a pretext task. The first proposed method is inspired by a work for video representation learning called the ’Arrow of Time’ , and the second one by a similar work called ’Odd One Out’ .
4.1 Audio feature learning with the Arrow of Time
The temporal order of a sequence carries a lot of potentially useful information about its structure. For video, Wei et. al.  proposed the Arrow of Time as a self supervised method that predicts whether a given video sequence is being played forwards or backwards. This helps the encoder that is predicting this pretext task label to learn features that correspond to object semantics and other visually correlated physical characteristics like gravity, forces etc. which may be useful for generic visual feature learning. We adapt the Arrow of Time (henceforth abbreviated as AoT) method for speech signals. The problem reduces to predicting whether a given audio clip is being played forwards or backwards. While learning to predict the task, the encoder learns useful audio features that differentiate between certain phonemes and how they sound when played forward vs backward. In order to predict the direction of the Arrow of Time, the encoder must capture useful characteristics about the phonemes themselves. In our implementation, we simply flip the temporal order of half of the sequences of an input batch (make them play backwards), and train the audio encoder with the supervised task of predicting the binary class problem (forward or backward). We use the encoder architecture described in Section 3.1.
4.2 Odd One Out networks for Audio
Odd One Out networks for video  are based on predicting which one out of multiple sets of ordered sequences of frames is in jumbled order (temporally incorrect order). The intuition behind such a method being able to learn useful features is very similar to that of Arrow of Time. Being able to predict temporal order should drive the encoder to learn generic useful features about the data. We adapt this idea to the audio modality in a straightforward way as well. For a given input batch of audio clips, we jumble 25% of the clips. The jumbling is performed by selecting at random two windows of a length of 15% of the total audio duration and swapping them. The encoder is then tasked with predicting which element in the input batch is the ’Odd One Out’, and is optimized using cross entropy loss. Fig. 3 illustrates the training procedure for Odd One Out networks for audio representation learning. We use the same encoder architecture as before (Section 3.1).
5 Audio-visual Self-supervision for Speech Representation Learning
We combine the proposed audio and visual self-supervision methods by making the encoder jointly predict the visual self-supervision task and the audio self-supervision task. Since we used the same encoder architecture for both the visual and audio tasks, this is straightforward to accomplish. In the pipeline shown in Fig. 1 for visual self-supervision, we also use the optional prediction for the audio-only self-supervised task (either AoT or L1). This leads to two losses being calculated, one for visual and one for audio self-supervision.
The total loss is the weighted sum of the L1 reconstruction loss from visual self-supervision and the cross entropy loss from the audio-only self supervision . is the weight factor which controls how much of the loss term comes from which type of supervision. The total loss is given by the equation:
We have two possible multimodal self-supervised models being trained depending audio self-supervision type, namely: L1 + AoT and L1 + Odd.
6 Datasets and Baselines
This section introduces the various audio-only and audiovisual datasets that were used in the work either for pretraining or evaluating the baseline and proposed models. For all datasets, we divide the data into training, validation and test sets with all samples from each speaker belonging to a particular set only. Table I summarizes the statistics for all the datasets used in this work.
The CREMA-D dataset  contains a diverse set of 91 actors who utter 12 sentences multiples times each with a different level of intensity for each of 6 basic emotional labels (anger, fear, disgust, neutral, happy, sad). We use Crema as a discrete emotion recognition evaluation dataset.
The Ravdess dataset  contains 1440 samples of 24 different actors who acted out two sentences with 8 different basic emotions (anger, calm, sad, neutral, happy, disgusted, surprised, fear) and two different intensity levels. We use Ravdess also as a discrete emotion recognition evaluation dataset.
The IEMOCAP dataset  contains dyadic conversations between 10 speakers for a total of 12 hours of audiovisual data. The discrete emotion labels comprise of 8 categories (anger, happiness, sadness, neutral, excitement, frustration, fear, surprise), however we only consider the first 4 categories for our experiments (anger, happiness, sadness, neutral). This is due to much higher inter annotator agreement for these categories, and this portion of the dataset has been similarly used in prior studies . This partition also leaves us with around 6.5 hours of data instead of the original 12 hours. We use IEMOCAP as another discrete emotion recognition evaluation dataset.
The SPC (Speech Commands) dataset  contains 64,727 total utterances of 30 different words by 1,881 speakers. We use SPC as a speech recognition evaluation dataset.
The GRID dataset  contains audio-visual speech recordings of subjects in full frontal view. It has 33 speakers, each of whom speak 1000 sentences containing six words. Every sentence in the GRID dataset follows a particular format for every word: [command/colour/preposition/letter/digit/adverb]. An example sentence is “Bin blue at F 1 now”. We use GRID as an ASR evaluation dataset, and use only the audio modality for WER (word error rate) evaluation.
The LRW dataset  is a large, in-the-wild dataset of 500 different isolated words primarily from BBC recordings. It is an audiovisual speech dataset and is thus appropriate for training our methods. We use a subset of LRW that has only nearly frontal videos (with yaw, pitch and roll restricted to a maximum of 10 degrees), in order to have a cleaner supervisory signal from the visual modality. This filtering leaves us with a total of around 40 hours of usable data. We use LRW as the self-supervised pretraining dataset for all baseline and proposed methods.
Since the aim of our work is to yield self-supervised audio features, we compare against other baselines focusing on the same goal. The three methods we compare against are CPC , APC  and PASE .
Contrast Predictive Coding (CPC)  is a technique that tries to model a density ratio to maximize mutual information (MI) between the target signal (random raw audio window) and the context (current raw audio window). By maximizing the MI, the method can extract the underlying latent variables that the two different parts of the signal have in common.
Autoregressive Predictive Coding (APC)  is similar to CPC, however the key difference is that APC directly tries to predict the immediate future part of the signal based on the history whereas CPC tries to maximize mutual information between the target (future) and the context (present). The input features for APC are 80 dimensional log mel spectrograms with a window size of 25 ms and a step size of 10 ms. The model tries to predict the log mel spectrograms for the future windows given the history.
PASE  is a raw audio encoder trained in a self supervised way to predict various different handcrafted features such as MFCC, prosody, waveform etc. While predicting these multiple tasks, the encoder learns a very robust and multi-task representation for raw audio that these tasks exemplify (e.g. prosody for emotion).
We also compare our methods against 39 dimensional MFCCs (13 coefficients, 13 deltas, and 13 delta-deltas) which act as baseline features used for supervised learning for audio.
|GRID||31639 / 26.4||6999 / 5.80||9976 / 8.31|
|LRW||112658 / 36.3||5870 / 1.90||5980 / 1.90|
|CREMA-D||11594 / 9.70||819 / 0.70||820 / 0.68|
|Ravdess||1509 / 1.76||415 / 0.48||519 / 0.60|
|SPC||51094 / 14.2||6798 / 1.88||6835 / 1.89|
|IEMOCAP||3548 / 4.28||793 / 0.95||942 / 1.31|
7 Experiments and Results
|Self Supervised Methods||Emotion Recognition||Speech Recognition|
|Classifier for (t, dim) features||LSTM||LSTM||LSTM||ESPNet||LSTM|
|Labels||6 emotions||8 emotions||4 emotions||ASR Text||30 words|
|Method||Supervision||Dim.||Accuracy ()||Accuracy ()||Accuracy ()||WER ()||Accuracy ()|
|L1 + AoT||Audio+Visual||512||49.27||45.86||47.38||4.1||92.49|
|L1 + Odd||Audio+Visual||512||53.17||42.77||47.91||3.8||92.28|
This section presents the details of all experiments that we perform to rigorously validate our proposed method. We present all results for speech and emotion recognition from the extracted features in Table II, for both visual and audio self-supervision for all variants of the models. We also show the results with the combination of the visual and audio self-supervision approaches using multi-task learning. We present numerous ablation studies such as the variation of model performance with change in pretraining set size and noise level. We also compare the frozen encoders (trained on the extracted features) with their finetuned and fully supervised (trained from scratch) equivalents.
7.1 Experimental Setup
We evaluate all extracted features on: (i) Discrete Emotion Recognition and (ii) Automatic Speech Recognition (ASR).
For the emotion recognition task, we first perform self-supervised pretraining on LRW as described, and then use the pretrained models as feature extractors on the CREMA, Ravdess and IEMOCAP datasets. Once we have these features, we then train an LSTM model for the emotion classification task. We opted to use an LSTM for simplicity, however this can be replaced by any model that can classify variable length sequences into discrete categories (such as BiGRUs, TCNs, LiGRUs 
). For our experiments, we use a 2 layer LSTM with 256 units in each layer. The initial learning rate is 0.0001 and is decayed by a factor of 0.1 every 30 epochs. We train the LSTM for 100 epochs and use the checkpoint from the epoch which gives the best validation accuracy for evaluation on the test set. We pass the last hidden state of the LSTM to a linear layer with size equal to the number of target classes (6 for CREMA, 8 for Ravdess, 4 for IEMOCAP) followed by a Softmax layer with a cross entropy loss for emotion classification. This exact same process (self-supervised feature extraction + LSTM training) is followed for all the methods being compared (as shown at the bottom of Fig.1).
For the speech recognition task, we use the GRID and SPC datasets to evaluate our methods. For the SPC dataset which is a spoken word classification task with 30 different possible labels, we use the exact same protocol as described for emotion recognition (self-supervised feature extraction + LSTM training). We use the same parameters and learning schedule for the LSTM. However, for the GRID dataset, we have a continuous ASR task instead of classification (i.e. we need to decode the full sentence for every utterance instead of just assigning it a class label). Thus we need to change the evaluation pipeline in order to do WER (word error rate) evaluation instead of classification. For this, we use the extracted features converted to Kaldi format and employ the ESPNet  toolkit for the end-to-end ASR training. We use a hybrid CTC/attention based ASR model with the default ESPNet parameters with a BLSTM encoder (as used similarly in ) with 320 units and location aware attention. We train the model for 15 epochs. For decoding, we use a beam search with a beamsize of 20 and a CTC weight of 0.1.
7.2 Results with Visual Self-Supervision (L1)
Our method for visual self-supervision by face reconstruction from audio is based on an L1 reconstruction loss, and is indicated as L1 throughout the results in Table II. For emotion recognition, irrespective of dataset, our method performs better than any audio self-supervised baseline. On CREMA, L1 achieves an accuracy of 51.09%. The best performing baseline is PASE which achieves an accuracy of 43.16%. For Ravdess, APC is the best baseline with an accuracy of 34.36%, but L1 with an accuracy of 46.05% significantly outperforms this. The same trend can be seen for IEMOCAP, with L1 again being the best performing method with an accuracy of 46.34%. For speech recognition, L1 is again the best performing method with a WER of 4.5, which is slightly better than the result attained when using MFCCs (WER 4.7). For SPC, L1 is again the best self-supervised method with an accuracy of 90.05%, which is closest to the performance by MFCCs (optimised for speech recognition) at 91.06%.
In summary, the proposed method for visual self-supervision leads to features that significantly outperform those from baseline audio self-supervised methods for both emotion recognition and speech recognition.
7.3 Results with Audio-only Self-Supervision (AoT and Odd)
Our methods for audio-only self-supervision: Arrow of Time and Odd One Out, are respectively indicated as AoT and Odd throughout the discussion of the results. For emotion recognition, AoT is the best performing audio-only self-supervised method on CREMA (achieving an emotion recognition accuracy of 48.78%), while Odd is the best on IEMOCAP (achieving an emotion recognition accuracy of 45.14%). Both jointly perform the best on Ravdess (AoT: 39.50%; Odd: 39.49%). For speech recognition, Odd is the best method both on SPC (89.29%) and GRID (WER 5.1). PASE is the closest competing self-supervised method for ASR except MFCCs.
When comparing between the two proposed methods, Odd and AoT seem to be very close in performance on emotion recognition, but Odd seems to slightly outperform AoT on speech recognition (likely due to being a more refined pretext task). Both methods outperform baselines for audio-only self-supervision, however when compared to the L1 method using visual self-supervision, they fall short for all evaluated unimodal experiments. This leads to the observation that the proposed visual self-supervision approach yields better features than all proposed audio-only self-supervised approaches. There is also a performance gap between the proposed unimodal self-supervised methods and MFCCs for speech recognition. We attempt to bridge this gap and yield better features using a multimodal combination of the proposed methods using multi-task learning.
7.4 Results with Audiovisual Self-Supervision (L1 + AoT and L1 + Odd)
In order to determine the optimal weights for each modality for multi-task learning, we tune the parameter (equation 1) on the validation sets of the CREMA, Ravdess and SPC datasets (introduced in section 6) for a range of values. The results for the tuning are in Table III. From the table, we observe that the best value of is 0.67 for both L1 + AoT and L1 + Odd. Thus, we use the models trained with this value when evaluating on the test sets in all experiments.
When comparing the results with other results in Table II, we can see a clear improvement using audiovisual self-supervision. L1 + AoT and L1 + Odd significantly outperform all other methods in every experiment. For emotion recognition, L1 + Odd is the best-performing method on CREMA (accuracy of 53.17%) and IEMOCAP (accuracy of 47.91%), while L1 + AoT is the best-performing method on Ravdess (accuracy of 45.86%). For speech recognition, L1 + AoT is the best-performing method on SPC (accuracy of 92.49%) while L1 + Odd is the best-performing method on GRID (WER 3.8). The significant result here is for speech recognition, in which these methods outperform MFCCs, which neither unimodal method had done. This points to the presence of complementary information being encoded by the two types of supervision from the two modalities which leads to very good generalized audio representations. In summary, multimodal self-supervision methods clearly outperform any unimodal self-supervision method.
|MTL Weight Tuning for Audio and Visual Tasks||Emotion Recognition||Speech Recognition|
|Classifier for (t, dim) features||LSTM||LSTM||LSTM|
|Labels||6 emotions||8 emotions||30 words|
|Method||Video weight ()||Audio weight ()||Dim.||Accuracy ()||Accuracy ()||Accuracy ()|
|L1 + AoT||0.17||0.83||512||46.22||38.61||88.74|
|L1 + AoT||0.33||0.67||512||47.91||40.18||89.28|
|L1 + AoT||0.50||0.50||512||51.50||43.03||90.36|
|L1 + AoT||0.67||0.33||512||51.77||44.39||91.94|
|L1 + AoT||0.83||0.17||512||48.93||40.40||90.79|
|L1 + Odd||0.17||0.83||512||48.91||42.11||89.97|
|L1 + Odd||0.33||0.67||512||47.48||39.81||88.39|
|L1 + Odd||0.50||0.50||512||50.73||43.26||90.78|
|L1 + Odd||0.67||0.33||512||52.81||44.32||92.17|
|L1 + Odd||0.83||0.17||512||51.17||42.41||91.31|
7.5 Performance in various levels of noise
In order to further rigorously validate our proposed models for robustness, we investigate the performance under various levels of noise. We create noisy versions of the CREMA and SPC datasets by adding babble noise from the NOISEX database , while varying the SNR from -5 dB to 20 dB in steps of 5 dB. We perform a comparison between the best performing proposed methods: (i) L1 using visual self-supervision, (ii) Odd using audio-only self-supervision, and (iii) L1 + Odd for bimodal self-supervision. We examine how the performance varies for the three methods as the level of added noise changes in the evaluation datasets.
The results for the experiments with added noise can be seen in Fig 4. For both datasets, we observe that the audiovisual combination outperforms unimodal methods. For emotion recognition on CREMA, the audio features from the clean dataset give the best performance, and there is a linear degradation of performance with the increase in noise. Visual self-supervision is more effective in almost all scenarios, which may be expected as visual features are unaffected by auditory noise and can still drive robust learning of audio features. A similar conclusion can be reached for speech recognition as well. Yet audio-visual self-supervision leads to the best audio representations across all noise labels. The results for speech recognition are also significantly more robust to noise than those for emotion recognition. We can see this from the decrease in performance with increase in noise, which is not very sharp until extremely high noise levels at -5 dB and 0 dB. We can also notice a sharper degradation for the audio-only results when compared to both the video-only and audiovisual results. This suggests that audio features obtained by visual or audio-visual self-supervision are more robust to noise compared to those obtained by audio-only self-supervision.
7.6 Performance with various sizes of the pretraining set
It is also an interesting experiment to see how model performance varies with the amount of data used for self-supervised pretraining. One of the most important advantages of a self-supervised learning approach is the ability to use an arbitrarily high amount of unlabeled data to learn a good representation. But there is still a tradeoff to be made with training time and model performance. We used a subset of the LRW dataset with a total pretraining size of 112658 samples (36.3 hours of audiovisual speech) to train our full model. For this experiment, we investigate what happens to our model if we only use a fraction of the total available data for pretraining. We use 0.2, 0.4, 0.6 and 0.8 times the dataset size for pretraining. We compare the performance between: (i) L1 for visual self-supervision, (ii) Odd for audio-only self-supervision, and (iii) L1 + Odd for audio-visual self-supervision. As before, we evaluate on the CREMA dataset for emotion recognition and the SPC dataset for speech recognition.
The results for the experiments with different pretraining set sizes can be seen in Fig. 5. For emotion recognition on CREMA, there is a steady and somewhat linear degradation in performance with the reduction in data in the pretraining. The best performance drops from an accuracy of 53.17% with the full dataset to an accuracy of 43.91% for 20% of the dataset. For speech recognition on SPC, there is a slower degradation for both visual and audiovisual methods, however there is a very sharp degradation for the audio method with lesser training data. With 20% of the training data, the performance for the audio-only method drops to an accuracy of 51.67%, which is a massive gap from the accuracy of 76.60% for the audiovisual method in the same setting. This goes on to show that the method with combined audio and visual self-supervision offers more robustness with varied amounts of pretraining data. Another observation is that a larger amount of pretraining data helps learn better features: the gain is significant initially however starts to plateau after a point (80% of the data). Fig. 5 also shows that with just 40% / 60% of the LRW dataset for pretraining, the audio-visual self-supervised methods achieve similar performance (42% for emotion, 89% for ASR) as PASE, CPC and APC which use the full pretraining set (see Table II). This is a very interesting result because it shows that our proposed self-supervised methods require lesser pretraining data than other tested self-supervised methods to achieve competitive performance.
7.7 Comparison with finetuned and supervised versions of encoders
All the results presented thus far were with the frozen versions of the encoders, i.e. the encoders with their fixed weights were used as feature extractors on the evaluation datasets before training a classifier. However, these encoders can also be fine-tuned to the target dataset. We present results for finetuning in Table IV. We use the weights from the L1 + Odd model as the initialization for the encoder before performing training on the target datasets in an end-to-end manner. We vary both the learning rate of the encoder (Enc LR) and that of the LSTM classifier (Cls LR). We set the Enc LR to 10e-6 or 10e-4, and the Cls LR to 10e-4 or 10e-3, which gives us 4 different sets of hyperparameters to compare. Note that the frozen version of the encoders presented in Table II can be interpreted to have an encoder learning rate of 0, and we present those as baseline results in Table IV which we aim to improve upon by finetuning. We observe that finetuning is able to give us significantly better results depending on the problem setting. The best result that we get on CREMA is an accuracy of 58.90%, which is better than the 53.17% that is attained using just the frozen encoder. For SPC, we get an accuracy of 93.56% which is also the best result seen so far in all experiments. The largest observed gain comes for the Ravdess dataset, on which we get an accuracy of 64.35%, which represents a gain of nearly 20% from the frozen variation.
It is also interesting to compare our methods with a supervised version of an encoder with the same architecture trained from scratch directly on the target dataset (see IV). The only difference is the weight initialization, for which we use random initialization for each layer. We compare the supervised version to the finetuned version for the exact same hyperparameter sets (Enc LR and Cls LR). We find that for the first two parameter sets, the supervised model is not able to learn any useful features at all and attains performance close to chance. For the other parameter sets with Enc LR = 10e-4, the supervised model does converge and offers good performance. However this is still significantly worse than that obtained by the finetuned version (see IV). This is clear evidence to support that our self-supervised pretraining yields a much better weight initialization that is likely to converge for a wider variety of hyperparameters while training on downstream tasks that have smaller datasets.
We observe from Fig. 6 that training a fully supervised model from scratch without self-supervised pretraining is more susceptible to overfitting and non-convergence for certain hyperparameters. It also results in significantly slower training (as can be seen from Fig. 6). We are able to achieve much better performance by finetuning our model for every tested parameter set, despite training for only half the number of epochs.
There are a number of interesting observations from the experiments.
|Frozen vs Finetuned vs Supervised||Emotion Recognition||ASR|
|Classifier for (t, dim) features||LSTM||LSTM||LSTM|
|Labels||6 emotions||8 emotions||30 words|
|Method||Enc LR||Cls LR||Epochs||Accuracy ()||Accuracy ()||Accuracy ()|
|Mel encoder (Supervised)||10e-6||10e-3||100||12.43||15.22||5.31|
|Mel encoder (Supervised)||10e-6||10e-4||100||17.08||13.31||11.38|
|Mel encoder (Supervised)||10e-4||10e-3||100||50.21||44.68||91.80|
|Mel encoder (Supervised)||10e-4||10e-4||100||53.19||52.68||92.11|
|L1 + AoT (Frozen)||0||10e-4||100||49.27||45.86||92.49|
|L1 + Odd (Frozen)||0||10e-4||100||53.17||42.77||92.28|
|L1 + Odd (Finetuned)||10e-6||10e-3||50||50.49||43.93||92.64|
|L1 + Odd (Finetuned)||10e-6||10e-4||50||53.17||50.67||92.76|
|L1 + Odd (Finetuned)||10e-4||10e-3||50||58.78||57.99||93.41|
|L1 + Odd (Finetuned)||10e-4||10e-4||50||58.90||64.35||93.56|
The audio-only self-supervised methods outperform the existing self-supervised baselines. There is also a clear observation that visual self-supervision is vastly superior when compared to audio-only self-supervision, both for baseline and proposed methods. This is largely due to the fact that the audio features obtained by visual self-supervision are closely related to the useful information present in lip movements and facial expressions (because they must encode this information for accurate facial reconstruction during pretraining). This property is also especially useful for emotion recognition due to the correlation between emotion and facial expression information, and for speech recognition due to the information from lip movements.
It is also clear that the models that have been trained using a combination of audio and visual self-supervision are able to encode complementary information from each modality to yield the best possible representations among all tested methods in this work. These representations are also the most robust in the presence of various levels of noise in the data, and offer the best performance independently of the size of the pretraining set. This is perhaps the most useful finding of this work, with the implication being that any problem using any sort of speech data can benefit greatly from using visual supervision from available audiovisual speech datasets to enhance the target representations.
Another very useful finding of the work is that fine-tuning of the pretrained audiovisual self-supervised models offers not only better performance, but faster training and convergence for a variety of hyperparameters when compared to training a fully supervised model from scratch. This could be useful in setting a strong baseline for other speech related problems and cutting down training time on downstream tasks on small datasets, which is a typical problem setting in various domains. We will make our pretrained models publicly available as to enable the community to commence further research on these problems.
A current limitation to our work is the fact that we use a nearly-frontal subset of the LRW dataset (with yaw, pitch and roll restricted to a maximum of 10 degrees each) for pretraining. This leaves out a large portion of the audiovisual dataset with profile faces which could also contain useful visual supervisory signals. There are also other larger datasets like AVSpeech  which could potentially yield better pretrained models. Another limitation is that we have used a very simple 2 layer LSTM with 256 hidden units as the classifier of choice for our audio classification tasks. This might not be the most optimal method or configuration. However this was chosen for simplicity. Other models such as BiGRUs, LiGRUs 
or temporal convolutional networks (TCNs) in different configurations may yield even better results. Another possible limitation is the fact that the models in our work start from log mel spectrograms instead of raw audio (as input to the audio encoder). There is a static frequency domain transformation applied to raw audio to yield the spectrogram representation, however a more refined approach might be to use a set of trainable filters (e.g. as used in SincNet) instead of static Mel filters. In summary, the methods presented in this paper show the principle that visual and bimodal self-supervision lead to much better performances than full supervision from scratch. However, more refined approaches may result in even better performances than those presented here.
8.2 Future work
In this work, we have considered the interaction between the audio and visual modalities and how visual self-supervision can benefit learning of audio features. There are also other modalities that could be considered, especially the text modality. Multimodal human language is comprised of text, audio and video combined, and developing a self-supervised model that can capture the interactions between the three can be very useful. The visual pretext task that we focused on was facial reconstruction optimized by the L1 loss. This process leads to a very realistic facial animation, however this might not be the most desirable thing in order to learn the best features. Realistic reconstruction will need to capture a lot of additional information related to fine grained visual characteristics. A lot of this information might not be useful if our end goal is simply to learn useful audio representations. Although reconstruction does give us really good performance, the question remains open to what a good alternative or additional visual pretext task might be. This work has also focused on audio features alone. It is also interesting to see how we could use audio self-supervision to guide the learning of visual speech features in an analogous way (by predicting the audio waveform from only the visual modality, like in ). These visual features could then be used for problems like facial affect recognition and lipreading, or even combined with our proposed audio features.
Abhinav Shukla’s work was supported by a PhD scholarship from Samsung Electronics UK.
-  (2019) Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667. Cited by: §2.1.3.
-  (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335. Cited by: §6.1.
-  (2014) CREMA-d: crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5 (4), pp. 377–390. Cited by: §6.1.
Deep clustering for unsupervised learning of visual features. In
Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §2.1.1.
Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing 27 (12), pp. 2041–2053. Cited by: §2.1.2.
-  (2016) Lip reading in the wild. In ACCV, Cited by: §6.1.
An unsupervised autoregressive model for speech representation learning. arXiv:1904.03240. Cited by: §2.1.2, §3.1, §3, §6.2, §6.2, Table II.
-  (2009) Detecting depression from facial actions and vocal prosody. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–7. Cited by: §2.2.
-  (2006) An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120 (5), pp. 2421–2424. Cited by: §6.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Cited by: §2.1.
-  (2015) Unsupervised visual representation learning by context prediction. In ICCV, pp. 1422–1430. Cited by: §2.1.1.
-  (2019) Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10541–10551. Cited by: §2.1.1.
-  (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. SIGGRAPH. Cited by: §8.1.
-  (2017) Self-supervised video representation learning with odd-one-out networks. In CVPR, pp. 3636–3645. Cited by: §2.1.1, §4.2, §4.
-  (2018) Unsupervised representation learning by predicting image rotations. arXiv:1803.07728. Cited by: §2.1.1.
-  (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §2.1.1, §2.1.3.
-  (2019) Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005. Cited by: §2.1.1.
-  (2019) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, pp. 1–23. Cited by: §2.2.
-  (2018) Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, pp. 7763–7774. Cited by: §2.1.1.
-  (2019) SeCoST: sequential co-supervision for weakly labeled audio event detection. arXiv preprint arXiv:1910.11789. Cited by: §2.1.2.
-  (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13 (5), pp. e0196391. Cited by: §6.1.
Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-Based Systems 161, pp. 124–133. Cited by: §6.1.
-  (1976) Hearing lips and seeing voices. Nature 264 (5588), pp. 746. Cited by: §2.2.
-  (2019) Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991. Cited by: §2.1.1.
-  (2020) Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943. Cited by: §2.1.3.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.1.2, §6.2, §6.2, Table II.
-  (2018) Audio-visual scene analysis with self-supervised multisensory features. arXiv:1804.03641. Cited by: §2.1.1.
-  (2018) Learning sight from sound: ambient sound provides supervision for visual learning. IJCV 126 (10), pp. 1120–1137. Cited by: §2.1.3.
-  (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. arXiv:1904.03416. Cited by: §2.1.2, §6.2, §6.2, Table II.
-  (2020) Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298. Cited by: §2.1.3.
-  (2018) Deep contextualized word representations. arXiv:1802.05365. Cited by: §2.1.
-  (2015) Prediction-based audiovisual fusion for classification of non-linguistic vocalisations. IEEE Transactions on Affective Computing 7 (1), pp. 45–58. Cited by: §2.1.3.
-  (2018) End-to-end audiovisual speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. Cited by: §2.2.
-  (2019) End-to-end visual speech recognition for small-scale datasets. arXiv preprint arXiv:1904.01954. Cited by: §7.1.
-  (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In AAAI, Vol. 33, pp. 6892–6899. Cited by: §2.1.3.
-  (2020) Evolving losses for unsupervised video representation learning. arXiv preprint arXiv:2002.12177. Cited by: §2.1.3.
-  (2019) Learning audio representations via phase prediction. arXiv preprint arXiv:1910.11910. Cited by: §2.1.2.
-  (2018) Learning speaker representations with mutual information. arXiv:1812.00271. Cited by: §2.1.2.
-  (2018) Interpretable convolutional filters with sincnet. IEEE SLT Workshop. Cited by: §8.1.
Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence 2 (2), pp. 92–102. Cited by: §7.1, §8.1.
-  (2020) Unsupervised pretraining transfers well across languages. arXiv preprint arXiv:2002.02848. Cited by: §2.1.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §3.
-  (2019) Wav2vec: unsupervised pre-training for speech recognition. arXiv:1904.05862. Cited by: §2.1.2.
-  (2020) Recognition of advertisement emotions with application to computational advertising. IEEE Transactions on Affective Computing. Cited by: §2.2.
-  (2017) Affect recognition in ads with application to computational advertising. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1148–1156. Cited by: §2.2.
-  (2020) Visually guided self supervised learning of speech representations. Proceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP). Cited by: §3.
-  (2019) Self-supervised audio representation learning for mobile devices. arXiv:1905.11796. Cited by: §2.1.2.
-  (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2.1.3.
-  (1993) Assessment for automatic speech recognition: ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication 12 (3), pp. 247–251. Cited by: §7.5.
-  (2018) End-to-end speech-driven facial animation with temporal gans. Proceedings of the British Conference on Machine Vision (BMVC). Cited by: §3.
-  (2019) Video-driven speech reconstruction using generative adversarial networks. Proc. Interspeech 2019, pp. 4125–4129. Cited by: §8.2.
-  (2019) Realistic speech-driven facial animation with gans. International Journal of Computer Vision, pp. 1–16. Cited by: §3.
Learning correspondence from the cycle-consistency of time.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2566–2576. Cited by: §2.1.1.
-  (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §6.1.
-  (2018) ESPnet: end-to-end speech processing toolkit. In Interspeech, Cited by: §7.1.
-  (2018) Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060. Cited by: §4.1, §4.
-  (2019) Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §2.1.1.
-  (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246. Cited by: §2.2.
-  (2019) S4l: self-supervised semi-supervised learning. In Proceedings of the IEEE international conference on computer vision, pp. 1476–1485. Cited by: §2.1.1.
-  (2020) Deep audio-visual learning: a survey. arXiv preprint arXiv:2001.04758. Cited by: §2.1.3.