The coupling of deep neural networks with large-scale labelled training datasets has produced a number of notable successes, yielding improved performance in speech related tasks such as ASR and speaker verification [27, 30]. However, the considerable cost of manually producing such labels ultimately limits the potential of fully supervised approaches. By contrast, methods which are able to learn effective representations from data with few labelled examples can in principle benefit from the ever-increasing quantity of existing unlabelled speech data.
The objective of this paper is to develop one such method for learning compact and robust representations of speaker identity without supervision. Ultimately, these representations can then be used for a number of downstream tasks such as speaker recognition, clustering and diarisation etc. To achieve this goal, we propose to exploit the natural synchrony between faces and audio in audio-visual video data as a supervisory signal, removing the need for speaker annotation. To facilitate our method, we assume access to a large-scale collection of unlabelled speaking face-tracks, which can be readily obtained through self-supervised techniques for active speaker detection . Beyond access to this data, our approach makes use of two weak statistical cues to define a self-supervised learning objective (Fig. 1): we assume that faces and voices extracted within a face-track at small offsets are likely to have the same speaker identity but different linguistic content, while faces and voices from different face-tracks are likely to differ in both content and speaker identity. As we show in Sec. 3, these cues can be combined to learn representations of speaker identity which minimise their dependence on speaker content. The motivation for doing so is simple: unlike earlier datasets such as TIMIT  that are carefully balanced for phonetic and dialectal coverage, more modern (and larger) datasets created from uncontrolled speech ‘in the wild’ are likely to contain a strong correlation between identity and linguistic content. For example, VoxCeleb2  consists of interviews of famous celebrities from a wide variety of professions, whose speech can be closely tied to their occupation—the cricketer Adam Gilchrist says the word ‘cricket’ 17 times and ‘president’ 0 times; whereas the politician Nancy Pelosi says the word ‘President’ 88 times, ‘Democrats’ 19 times and ’cricket’ 0 times. Consequently a model trained to represent identity may be incentivised to use linguistic content as a discriminative cue. While some coupling between content and identity is natural, over-reliance on content can prevent generalisation to new settings, harming robustness. More broadly, disentangled representations can, in principle, achieve an exponential improvement in generalisation efficiency over their entangled counterparts, because they are able to represent novel combinations of factors that were encountered separately (but never in combination) during training.
In this work, we make the following contributions: (1) We propose a novel framework for learning speech representations capturing information at different time scales in the speech signal, including in particular the identity of the speaker; (2) we show that we can learn these representations from a large, unlabelled collection of talking faces in videos as a source of free supervision, without the need for any manual annotation; (3) we show that sharing a trunk architecture for two different tasks (content and speaker identity) and adding an explicit disentanglement objective between the two improves performance; and, (4) we evaluate the performance of our self-supervised embeddings on the popular VoxCeleb1 speaker recognition benchmark and compare to fully supervised methods. All data, code and models will be released.
2 Related Work
Representation Learning. The ability to represent variable-length high-dimensional audio segments using compact, fixed-length representations has proven useful for many speech applications such as speaker verification [30, 9], audio emotion classification , and spoken term detection (STD) 
, where the representation can be coupled with a standard classifier. The use of fixed-length representations also enables efficient storage and retrieval when paired with an inverted index. These representations can either be hand-crafted, such as MFCCs or learned from data - such as i-vectors and deep neural networks. While the former may fail to capture the correct underlying factors for the task, the latter require large amounts of expensively labeled training data to be effective. As a consequence, there has recently been renewed interest in learning unsupervised audio representations.
Disentangled Representation Learning. Motivated by their attractive compositional properties and theoretical ability to generalise efficiently, a number of models that seek to learn disentangled representations in a weakly supervised or self-supervised manner have been proposed, such as DC-IGN , InfoGAN  and VQ-VAE . Due to the proliferation of video data, there has also been a renewed interest in learning representations from sequential data [15, 12, 11, 16]. These self-supervised works focus on predicting future, missing or contextual information, all within the same modality. However to the best of our knowledge, no prior method has sought to learn disentangled representations through cross-modal self-supervision.
Audio-Visual Self Supervision. A number of recent works [4, 3, 5, 25, 19] have explored the concept of exploiting the correspondence between synchronous audio and visual data in teacher-student style architectures (where the ‘teacher’ is represented by a pretrained network) [4, 5], or two-stream networks where both networks are trained from scratch [3, 10]. Additional work has examined cross-modal relationships between faces and voices specifically in order to learn identity [23, 22, 18] or emotion  representations. In contrast to these works, we aim to learn representations of both content and identity with a view to explicitly disentangling separate factors—we compare our approach with theirs in Sec. 4.
Speech, like many sequential natural signals, can be decomposed into the interaction of several largely-independent causal factors which express themselves over different time scales. The central observation that underpins our approach is that the speaker identity affects fundamental frequency, pitch and volume at the utterance level while linguistic content affects spectral contour and duration of formants more locally.
Without labels, we have no way to directly separate these factors. Instead, we can impose our prior knowledge as to how such representations should behave. Intuitively, representations of identity should change slowly over time (remaining constant for a given speaker), whereas representations of content should change quickly, capturing the local variation in the speech signal. Concretely, we enforce these properties by exploiting the known correspondence between a speech signal and the face of its speaker within a facetrack to impose three constraints on the representations for content and identity:
Content constraints. Within a given speaking facetrack, speech and face signals extracted concurrently contain redundant (or overlapping) linguistic content (while this information is trivially available in the speech signal, it is perhaps less obvious that it is also present in the face—in fact, it is this signal that enables lipreading). By contrast, the face signal at a small temporal offset from the speech signal is likely to convey different linguistic content. These cues provide a natural source of paired data (positive and negative examples) that we can use to learn a self-supervised representation of linguistic content from a speech signal .
Identity constraints. By considering instead face and voice signals across face tracks, we can obtain a different form of constraint: signals from the same face-track should come from the same speaker, while those from different face-tracks are likely to come from different speakers .
Disentangling constraint. Although representations that have been trained to satisfy the intra-track and inter-track constraints may capture a measure of both linguistic content and speaker identity, there is no guarantee that both factors will be disentangled (represented independently of one-another). To achieve this last goal, we employ a further constraint on the speech representations themselves, requiring that variation within one factor cannot be predicted from the other to enforce their independence.
In this work, we train a single model in an end-to-end self-supervised manner to satisfy the constraints described above (the framework is depicted in Fig. 2). In the next section, we describe the architecture used for representation learning and the losses that are used to implement these constraints.
3.1 Network Architecture
Our architecture consists of two sub-networks, one sub-network that ingests five cropped faces as input, and another sub-network that takes in short-term magnitude spectrograms of 0.2-second speech segments. Each sub-network contains a block of five convolutional layers as the basic feature extraction trunk (these are shared for both content and identity, as it has been speculated that lower level features, e.g. edges for images and formants for speech, are likely to be common for different high level tasks). Both sub-networks are based on the VGG-M architecture  which strikes a good trade-off between efficiency and performance. See  for the exact filter sizes. After this, each block branches into two separate fully connected layers, one that produces identity embeddings and one that produces content embeddings, both of dimension 1024. For input frames, -4 identity and content embeddings are produced for each modality stream (Fig. 2
), since both sub-networks have temporal receptive fields of 5 frames (0.2 second) and strides of 1 frame (0.04 second). During training, the identity vectors from the audio stream are then averaged into a single vector, while a single identity vector is selected from the face stream at random. To understand this choice, note that if we were to also average the face embeddings, then the task of matching identity representations would simply become one of lip reading, i.e. matching the linguistic content of the audio and visual signals. Hence we pick a single random face vector and make the assumption that a face from a single frame is insufficient to encode linguistic content.
Self-Supervised Paired Data Inputs. In a single minibatch, we take face-tracks. Within a face-track, we sample consecutive face images and temporally aligned speech segments from the 1.2-second speech segment. Hence the total number of input samples per batch is face images and speech segments.
3.2 Loss Functions
A content loss (CL) is used to implement the content constraint via a multi-way matching task, as described in . The loss takes one input feature from the visual stream and -4 features from the audio stream. Since only one of these audio features is a positive sample (i.e. in sync with the visual stream), this can be set up as any (-4)-way feature matching task. Euclidean distances between the audio and video features are computed, resulting in -4 distances. A cross-entropy loss is applied on the inverse of this distance after passing through a softmax, encouraging the similarity between matching pairs to exceed that of non-matching pairs.
An identity loss (IL) is used to implement the identity constraint. It is similar in form to the content loss, but the negative samples are now obtained from different tracks, as opposed to within a track. The task becomes one of selecting the correct track averaged identity speech representation for a single face representation from all the tracks in a batch, i.e. this is a B-way classification task.
). This loss is used to assess the amount of spurious variation information left in either feature representation and then remove it (for the identity representation, content information is a spurious variation and vice versa). Minimizing this loss seeks to change the feature representation, such that it becomes invariant to the spurious variations. We compute the best classifier for the spurious variation, and then compute the cross-entropy between the output predicted from each of these classifiers and a uniform distribution, and minimise this entropy. See, Equations 1-3 for exact details.
We train our model using the following loss combinations: (1) Only the content loss: in this case the identity streams are not present in the network; (2) Using only the identity loss: in this case the content streams are not present in the network; (3) Joint training with both the content and the identity loss; (4) Joint training with the content, identity and disentanglement losses. In all cases the model uses the same trunk architecture and training hyperparameters.
The model is implemented using PyTorch. It is trained end-to-end with batch sizeand samples per face-track using SGD (initial learning rate of 1e-2 which decays by
We train our model on VoxCeleb2 , a large-scale audio-visual dataset of interviews obtained from unedited YouTube videos. The dataset consists of over a million utterances for 6,112 identities. No identity labels are used during training. To reduce computational cost, we sample only 20% of the speech per speaker for training from the VoxCeleb dev set, and validate performance of the self-supervised learning objectives on speakers from the VoxCeleb2 test set. The statistics of the dataset can be seen in Table 1.
|# face-tracks||# identities|
We first evaluate the performance of our model on the two self-supervised learning objectives that it was trained for, and then evaluate the learned representations on the downstream task of speaker recognition on the standard VoxCeleb1 speaker recognition benchmark.
|Content Task||Identity Task|
|-way cls.||-way cls.||EER|
|Content loss only||49.0%||–||–|
|Identity loss only||–||44.3%||24.8%|
|Con. and Id. Loss||46.7%||8.5%||45.7%|
|Con., Id. and Dis. Loss||49.0%||10.5%||45.2%|
|Con. and Id. Loss||19.3%||48.2%||23.1%|
|Con., Id. and Dis. Loss||12.0%||49.6%||18.9%|
Learning Objective. We evaluate the self-supervised learning objectives on speakers from the VoxCeleb2 test set (Table 1), and the results can be seen in Table 2. We evaluate the learned identity representations on the N-way classification task within a facetrack (content task), as well as evaluating it on the identity B-way classification task. From Table 2, it is clear that training both self-supervised objectives jointly improves performance on the identity classification task over training for identity alone (48.2 % vs 44.3 %) and training with the disentanglement loss provides a further improvement (49.6 %). In order to further probe the effect of the disentanglement loss, however, we look at the performance of the identity embeddings on the content classification task (which ideally it should perform poorly on). From Table 2, it can be seen that disentanglement helps remove content information from the identity embedding – EER drops from 19.3 % to 12.0 %, on the N-way content classification task.
As an aside, we also report performance of the content embeddings in the middle two rows of Table 2 (although learning content representations for their own sake is not the objective of this work) and note that joint training actually harms the performance compared to training with the content loss alone (from 49.0% to 46.7%) on the content classification task, however this performance is recovered by adding in the disentanglement loss. This is to be expected, as it is very difficult for identity information to leak into the content representation when it is trained for content alone (the content objective is trained with a large number of negative pairs within the same face-track, discouraging the embedding from learning identity).
Speaker Recognition. We then extract identity embeddings for the data in the VoxCeleb1 test set (VoxCeleb1-O, 40 speakers) . We first evaluate using the self-supervised embeddings directly (i.e. without any speaker identity labels at all, and report results in Table 3. The negative cosine distance between embeddings is calculated directly and used as the similarity score between verification pairs. Once again we see a similar trend in the results, both joint training and disentanglement show cumulative gains in performance. We then compare our method to fully supervised performance, by freezing the layers of our network and then finetuning a single fully connected layer on the embedding network with n-pair loss, using labels from the VoxCeleb1 dev set. We do this for various subsets of the VoxCeleb1 dev set, and demonstrate in Table 4 that up until 500 speakers, our self-supervised method (even with only the identity loss, and with gains using the other two losses) outperforms full supervision. The fully supervised baseline is trained end-to-end, and for a fair comparison, has the exact same architecture as the audio stream of the cross-modal model.
|Identity loss only||23.15%|
|Identity loss + Content loss||22.59%|
|Identity loss + Content loss + Dis. loss||22.09%|
|Id. loss only||15.05%||13.00%||11.16%||9.85%|
In this work we develop a self-supervised method that learns speaker recognition embeddings from speech without access to any training labels, simply by using the co-occurence of faces in video. By explicitly disentangling factors of variation such as content and identity, and training for both objectives with a common trunk architecture, we show improvements in generalisation to unseen speakers, and in the case of small amounts of training data, even outperform fully supervised methods.
Acknowledgements This work is funded by the EPSRC Programme Grant Seebibyte EP/M013774/1 and ExTol EP/R03298X/1. Arsha is funded by a Google PhD Fellowship.
-  (2018) Emotion recognition in speech using cross-modal transfer in the wild. ACM Multimedia. Cited by: §2, §2.
-  (2018) Turning a blind eye: explicit removal of biases and variation from deep neural network embeddings. In ECCV, Cited by: §3.2.
-  (2017) Look, listen and learn. In ICCV, pp. 609–617. Cited by: §2.
-  (2016) Soundnet: learning sound representations from unlabeled video. In NeurIPS, pp. 892–900. Cited by: §2.
-  (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932. Cited by: §2.
-  (2014) Return of the devil in the details: delving deep into convolutional nets. In bmvc, Cited by: §3.1.
-  (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pp. 2172–2180. Cited by: §2.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. In ICASSP, pp. 4774–4778. Cited by: §1.
-  (2018) Voxceleb2: deep speaker recognition. INTERSPEECH. Cited by: §1, §2, §4.1.
-  (2016) Out of time: automated lip sync in the wild. In ACCV, pp. 251–263. Cited by: §2, §3.
Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704. Cited by: §2.
-  (2015) A recurrent latent variable model for sequential data. In NeurIPS, Cited by: §2.
-  (2019) Perfect match: improved cross-modal embeddings for audio-visual synchronisation. In ICASSP), pp. 3965–3969. Cited by: §3.1, §3.2.
-  (2016) . arXiv:1603.00982. Cited by: §2.
-  (2014) Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581. Cited by: §2.
-  (2016) Sequential neural models with stochastic layers. In NeurIPS, pp. 2199–2207. Cited by: §2.
-  (1993) DARPA timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93. Cited by: §1.
-  (2018) On learning associations of faces and voices. arXiv:1805.05553. Cited by: §2.
-  (2018) Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, Cited by: §2.
-  (2015) Deep convolutional inverse graphics network. In NeurIPS, pp. 2539–2547. Cited by: §2.
-  (2007) Rapid and accurate spoken term detection. In ISCA, Cited by: §2.
-  (2018) Learnable PINs: cross-modal embeddings for person identity. eccv. Cited by: §2, §3.
-  (2018) Seeing voices and hearing faces: cross-modal biometric matching. In cvpr, Cited by: §2.
-  (2017) VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, Cited by: §1, §4.2.
-  (2016) Ambient sound provides supervision for visual learning. In ECCV, Cited by: §2.
-  (2016) Adversarial multi-task learning of deep neural networks for robust speech recognition.. In INTERSPEECH, pp. 2369–2372. Cited by: §3.1.
Text-independent speaker verification using 3d convolutional neural networks. In ICME, pp. 1–6. Cited by: §1.
-  (2015) Simultaneous deep transfer across domains and tasks. In ICCV, pp. 4068–4076. Cited by: §3.2.
-  (2017) Neural discrete representation learning. In NeurIPS, Cited by: §2.
-  (2019) Utterance-level aggregation for speaker recognition in the wild. In ICASSP, Cited by: §1, §2.