Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38 and 75

READ FULL TEXT VIEW PDF
07/14/2022

A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

While audio-visual speech models can yield superior performance and robu...
01/05/2022

Robust Self-Supervised Audio-Visual Speech Recognition

Audio-based automatic speech recognition (ASR) degrades significantly in...
01/05/2022

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Video recordings of speech contain correlated audio and visual informati...
10/05/2021

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

Self-supervised speech representation learning methods like wav2vec 2.0 ...
10/08/2021

A study on the efficacy of model pre-training in developing neural text-to-speech system

In the development of neural text-to-speech systems, model pre-training ...
06/09/2022

Audio-video fusion strategies for active speaker detection in meetings

Meetings are a common activity in professional contexts, and it remains ...
05/13/2021

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Most of the prior studies in the spatial DoA domain focus on a single mo...

1 Introduction

Personalizing user experiences is essential in spoken language technology systems, e.g., smart speakers and personal banking applications. Robust speaker verification (SV) and recognition models are crucial for enabling authentication and conversational experiences, as well as many other tasks like speaker diarization [Wang2018SpeakerDW], voice conversion [Zhang2020DurIANSCDI] and source separation [Wang2019VoiceFilterTV].

Supervised speaker representation methods made significant progress over the past decade [Snyder2018xvectors, chung2020in, desplanques2020ecapa, Lee2020two]; however, they require a non-trivial amount of human annotations of speaker identity, which might not comply with the evolving privacy-preserving standards. Furthermore, it is challenging to provide speaker labels for multi-speaker dialogues or when speakers’ voices alternate between whispering and shouting [hansen2015speaker]. Self-supervised speaker representation approaches, which work around these challenges, have recently gained popularity. One family of self-supervised speaker representation methods relies on contrastive learning, which constructs positive samples by either augmenting the same speech segment or assuming a single speaker is recorded per utterance [Inoue2020SemiSupervisedCL, tao2021self, Xia2021SelfSupervisedTS]. They present solid downstream performance, where the unsupervised state-of-the-art approach [tao2021self] achieves EER of 1.66%, close to some of the SOTA supervised systems (e.g., 0.41% from [zhao2021speakin]). However, one downside of these approaches is that they are tailored solely for speaker embedding tasks. In contrast, general self-supervised speech representation learning approaches, e.g., wav2vec 2.0 [Baevski2020wav2vec2A] and HuBERT [Hsu2021HuBERT], were found to capture enough speaker information to be competitive on SV while excelling at many other downstream tasks [Yang2021SUPERBSP].

The Audio-Visual Hidden Unit BERT (AV-HuBERT) was recently introduced as a general audio-visual representation learning approach. It learns joint representations over speech and lip-movement streams by alternating between clustering representations using a small codebook mimicking broad phonetic units and learning latent contextual representations through the masked prediction loss. AV-HuBERT achieves SOTA results on lip-reading and audio-visual speech recognition (AVSR) under adverse noise conditions [shi2022robust, avhubert], thanks to the noise-immune visual modality.

This paper goes beyond single modality speaker representations to work with audio and lip-movement information to learn noise-robust speaker embeddings. We extend the representation learned by the AV-HuBERT approach to study their effectiveness for speaker-based downstream tasks in multi-modal settings. Compared to recent specialized unsupervised speaker representation methods [Inoue2020SemiSupervisedCL, tao2021self, Xia2021SelfSupervisedTS], one advantage of utilizing a general approach like AV-HuBERT is its ability to simultaneously serve other downstream tasks beyond speaker embedding. Prior work on audio-visual speaker representation learning focused on the consistency between the audio and the visual information, either by learning speaker embedding via audio-visual synchronization and identity matching [Nagrani2020DisentangledSE] or by multi-way matching in a joint audio-visual embedding space [Chung2020PerfectMS]. AV-HuBERT offers a more stable training procedure than methods utilizing contrastive and consistency objectives since its masked prediction loss is computed over offline learned discrete units.

In our experiments, AV-HuBERT representations are used either in an ELMo-style feature combination protocol [sarzynska2021detecting]

or through fine-tuning the whole network for the target downstream task. We report our results on speaker classification and verification tasks under four types of interfering noise and five different signal-to-noise ratios (SNR). Our audio-visual models improve label efficiency by 10 folds from supervised models, and offer 38% and 75% relative equal error rate (EER) reduction for SV under clean and noisy conditions compared to audio-only pre-trained models.

2 Method

Figure 1: AV-HuBERT for learning speaker embedding. Dashed box: added during fine-tuning.

2.1 Overview of AV-HuBERT

Audio-Visual Hidden Unit BERT (AV-HuBERT) is a self-supervised model that learns from unlabeled audio-visual speech data. Similar to its audio counterpart — HuBERT [Hsu2021HuBERT], AV-HuBERT was initially benchmarked on speech recognition tasks and achieved state-of-the-art performances on uni-modal (audio-only and video-only) [avhubert] and multimodal (audio-visual) [shi2022robust] setups. As depicted in Figure 1, AV-HuBERT comprises four modules: a feed-forward network (FFN) audio feature extractor, a modified ResNet [Stafylakis2017CombiningRN, Martnez2020LipreadingUT] video feature extractor, a fusion module, and a Transformer [Vaswani2017attention] backend. The two feature extractors generate frame-level representation for the corresponding stream, which are frame-wise concatenated by the fusion module to form initial audio-visual features. The Transformer backend takes these features and produces the contextualized frame-level audio-visual representations. The entire model is optimized to perform masked prediction, where random segments are masked for each stream independently (denoted by the red crosses in Figure 1), and the model learns to predicts the cluster assignment of the masked frames (the middle three frames in the example). The cluster assignment are iteratively refined: it is produced by clustering MFCC features in the first iteration, and by clustering previous iteration’s AV-HuBERT representations in the subsequent iterations.

While the self-supervised learning objective of HuBERT and AV-HuBERT (masked prediction of cluster assignments) resembles the task of automatic speech recognition (ASR) 

[avhubert] because it could be understood as joint acoustic and language modeling, several analyses observed that such models still learn rich speaker information especially in earlier layers [chang2021distilhubert, Yang2021SUPERBSP, Chen2021WavLM]. These observations can be linked to many studies in the ASR and SV literature. For example, researchers have found that speaker information can improve ASR performance [senior2014improving]

. Hence, AV-HuBERT may learn to extract speaker information at early layers to better infer phonetic information in the subsequent layers. In addition, statistics of the acoustic features extracted from ASR models have been widely used for SV 

[lei2014novel], which can also explain why HuBERT features benefits SV. Motivated by the strong results of adapting HuBERT for SV, we explore learning audio-visual speaker embeddings from a pre-trained AV-HuBERT model with emphasis on the noise robustness aspect strengthened by the addition of visual stream.

Figure 2: Example lip images, from LRS3 dataset [afouras2018lrs3] . Images in the same row are from the same speaker.

2.2 Learning multimodal speaker embedding

We describe in this section how to learn speaker embeddings with pairs of audio-visual speech and speaker label . Given a pre-trained AV-HuBERT model parameterized by , the goal is to leverage it to learn a speaker embedder such that are similar to if , and are dissimilar otherwise.

Two learning protocols are considered, which differs in whether is frozen or not. The former is used to evaluate representation quality, since it treats the self-supervised model as a fixed feature extractor. Following [Yang2021SUPERBSP], we consider ELMo [Peters2018DeepCW]

style fine-tuning, where frame-level AV-HuBERT representations at each Transformer layer are weighted and summed using a learnable non-negative weight vector that sums to one. This representation is then passed to a prediction model to produce a fixed-sized speaker embedding, followed by a softmax layer to estimate the posterior over a closed set of training speakers. The weight vector, prediction model, and softmax layer are trained to minimize a cross entropy loss with respect to speaker labels. The architecture of the prediction model depends on the downstream task but is generally much lighter compared to AV-HuBERT.

The other protocol is often adopted for pre-training [bert, Baevski2020wav2vec2A, Hsu2021HuBERT, baevski2022data2vec], where a self-supervised model is updated along with new parameters during fine-tuning. Similar to how BERT does sequence classification tasks [bert], we prepend an additional trainable [cls] vector to the Transformer input as shown in Figure 1

, and take its contextualized representation at the Transformer model output as the speaker embedding. Pooling of information across frames is carried out by the self-attention module at each Transformer layer. Similar to the other protocol, the speaker embedding is passed to a softmax layer to predict the speaker label and the same loss function is used, but in this protocol the entire model is optimized.

Two tasks, SV and speaker classification (SC), are used for evaluating learned speaker embeddings. SC evaluates prediction accuracy of a closed set speaker seen during training. In contrast, SV considers an speaker-independent setup, where a set of testing trials is provided, each of which contains two utterances and a label indicating if the two are from the same speaker. The learned speaker embedder is used to compute a similarity score for each trial, and the equal error rate (EER) is reported, which is the error rate where the false positive rate is the same as the false negative rate.

We should note that unlike most of the prior work on learning audio-visual speaker embeddings [shon2019noise, Sari2021AMA], we use lip video (see figure 2) instead of whole-face video or image as input. This can significantly improve the robustness in noisy environments compared to audio-based systems while reducing the amount of biometric information required compared to face-based systems.

3 Experiments

3.1 Setup

Pre-training   We pre-training a 12-layer BASE and 24-layer LARGE AV-HuBERT following [avhubert] and [shi2022robust] with only one change. In [avhubert], models are pre-trained with 433 hours of LRS3 [afouras2018lrs3] and the English portion of VoxCeleb2 (VC2) [voxceleb2] from both the dev and test split. In this paper, since we evaluate the speaker embeddings on the VC2 test split and it contains both English and non-English speakers, we combine LRS3 with all VC2 data, English or non-English, but exclude the VC2 test split for pre-training, which sums up to roughly 2,800 hours. Both LRS3 and VC2 are sampled at a frame rate of 25Hz. The AV-HuBERT model also produces representations at this frame rate. To improve noise robustness, noise randomly sampled from MUSAN [Snyder2015MUSANAM] is added to the audio stream following [shi2022robust]. The audio and video preprocessing steps remain the same as [avhubert].

Fine-tuning   For the frozen protocol, to compare with results reported in [Yang2021SUPERBSP], we adopt the same prediction heads as [Yang2021SUPERBSP], which is an average pooling layer for SC, and an x-vector model [snyder2018x] for SV. Two widely used audio-visual speech recognition datasets, VoxCeleb1 (VC1, 352 hours/1,251 speakers) [nagrani2017voxceleb]111We use the videos provided by [Nagrani18seeing] since the [nagrani2017voxceleb] only releases processed audio files. As a few dozen video clips in VoxCeleb1 are missing in the data provided by [Nagrani18seeing], we download raw videos from the URL and extract clips with ground-truth timestamps provided in [nagrani2017voxceleb]. The speaker face is then extracted with dlib face detector [david2009dlib]. The video files from [Nagrani18seeing] was down-sampled by a factor of six (4.17Hz). To tackle this, we let the ResNet video feature extractor to process the downsampled video as is and upsample its output by a factor of six before passing it to the Transformer. Empirically, this leads to similar performance as upsampling the video at input to the ResNet while reducing the memory and the compute. and VC2 (2,442 hours/6,112 speakers), are adopted for supervised fine-tuning. Noise-augmented fine-tuning with MUSAN following [shi2022robust] is explored in §3.3. All models are optimized with Adam [kingma2014adam] with learning rate warmed up to 0.001 for one third of training steps and then linearly decayed. The pre-trained parameters are frozen for certain steps before being updated. For {5h, 50h, 500h, VC2} setups in Table 1, we train the model for {20, 30, 90, 75} K-steps with a batch size of {100, 100, 100, 400} and {60, 60, 120, 240} for audio-only and audio-visual setting. Whenever the model is used in audio-only setting, the visual feature (ResNet output) is replaced by an all-zero vector.

Evaluation   For SV, we follow the standard evaluation protocol used in [chung2020in]. We report SC accuracy (SC-ACC) and SV-EER on VC1 using the official test set and test trials. As for VC2-EER, since there does not exist official test trials, we follow [Sari2021AMA] to create one by sampling one positive trial and one negative trial for each test set utterance. To probe noise robustness, we follow [shi2022robust] to create 20 noisy test sets for VC1 and VC2, where each clean set is mixed with {Babble, Speech, Music, Other} noise at a SNR in {-10, -5, 0, 5, 10} dB.

By default, we fine-tune all models on VC2 without noise augmentation and with the protocol where AV-HuBERT parameters are updated, and report EER on the clean set. We use AV-HuBERT Base for analysis purpose (sec 3.2-3.4).

3.2 Effectiveness of AV-HuBERT pre-training

PT FT Mod. VC2 EER (%) VC1 EER (%)
clean noisy clean noisy
None VC2-15spk (5h) A 26.8 39.2 25.1 39.2
None AV 29.8 35.9 24.6 28.7
VC2+LRS3 A 23.3 33.9 20.0 33.0
VC2+LRS3 AV 22.6 28.0 19.4 21.9
None VC2-156spk (50h) A 18.5 34.5 16.1 34.6
None AV 16.4 24.7 13.1 17.7
VC2+LRS3 A 11.8 28.9 9.4 29.1
VC2+LRS3 AV 9.3 18.8 7.8 12.5
None VC2-1200spk (485h) A 11.1 31.6 8.6 30.5
None AV 9.3 17.6 7.0 9.9
VC2+LRS3 A 7.2 26.1 4.9 25.2
VC2+LRS3 AV 5.7 12.6 3.8 6.1
None VC2-5h (1740spk) A 24.4 39.4 21.7 39.5
None AV 32.8 41.0 30.2 40.3
VC2+LRS3 A 20.1 34.7 17.7 34.5
VC2+LRS3 AV 16.7 28.6 13.9 23.0
None VC2-50h (5113spk) A 20.2 35.5 16.1 34.7
None AV 21.5 26.3 15.7 16.4
VC2+LRS3 A 10.7 29.7 8.0 28.7
VC2+LRS3 AV 7.4 19.8 4.8 11.4
None VC2-500h (5992spk) A 10.6 33.1 8.0 31.4
None AV 6.5 14.5 5.3 7.8
VC2+LRS3 A 4.9 23.7 3.0 22.8
VC2+LRS3 AV 3.7 9.2 1.7 3.9
None VC2 (5994spk) A 7.3 29.2 5.1 27.8
None AV 5.1 11.3 2.9 4.7
VC2+LRS3 A 3.4 20.9 1.9 20.0
VC2+LRS3 AV 2.4 7.8 1.0 2.5
Table 1: SV performance on clean and noisy test sets when fine-tuned with various VC2 subsets. The EER averaged over 20 setups (5 SNRs 4 types) is reported for the noisy test sets.

We first study label efficiency of AV-HuBERT pre-training for audio-only and audio-visual speaker verification, and evaluate their performance on both the clean and noisy test trials (Table 1). In the “noisy” columns, we report the average EER over the 20 testing configurations. Three sizes of subsets are considered: 20%/2%/0.2% of the labeled data. For each size, we generate subsets in two ways: one is by sampling x% of the utterances, and the other is by sampling x% of the speakers and select all their utterances. For a given size, the former would contain more speakers but less utterances per speaker.

For each labeled subset, AV-HuBERT outperforms models trained from scratch (PT=None) when the same modalities are used as input. Comparing across subsets, we find AV-HuBERT often matches or outperforms a randomly initialized model that use 10 times more data (e.g., AV-HuBERT is 16.7% on VC2 clean set with VC2-5h data and AV input, while the baseline is 21.5% with VC2-50h data and AV input).

Next, we observe that with pre-training, audio-visual models always outperform audio-only ones. The consistent gain on the clean set shows that lip videos bring complementary information for recognizing speakers, which is different from speech recognition where visual information helps very little in clean conditions [shi2022robust, ma2021conformer]. On the other hand, audio-visual based models bring substantial gains in noisy conditions compared to audio-based models (e.g., reducing EER from 20.0% to 2.5% on noisy VC1 test set when pre-trained and fine-tuned on the entire VC2). This demonstrates the enhancement in noise robustness by incorporating lip information.

Comparing the different subset sampling strategies for 0.2% VC2 (VC2-5h versus VC2-15spk), it shows that pre-trained models benefit from having more speakers (w/ PT, 16.7% for VC2-5h vs. 22.6% for VC2-15spk on VC2 clean with AV input),222 The model achieves similar performance on noisy test sets for both subset sampling strategies, which is caused by the mismatch in training/testing audio conditions (clean vs. noisy). With noise-augmented fine-tuning, the same trend emerges again (w/ PT, 5h vs. 15spk: 26.1%/13.8% vs. 30.7%/21.6% with A/AV input on VC1 noisy). while the supervised audio-visual baselines struggle with too few examples per speaker (w/o PT 32.8% vs. 29.8% on VC2 clean with AV input). This implies that a visual encoder trained from scratch hardly benefits from increasing number of speakers given only few utterances per speaker. In the 5h-setting (few utterances per speaker), the extra visual modality is only effective in case of pre-training whereas an audio-visual model trained from scratch is consistently worse than its audio-only counterpart regardless of auditory conditions (PT=None, A vs. AV). To conclude, we discover pre-training can lead to better generalization with few-shot learning.

3.3 Noise-augmented fine-tuning

Observing the gap between audio-based and audio-visual models on SV in the previous section, we study if applying noise augmentation during fine-tuning can bridge the gap. We fine-tune AV-HuBERT on VC2-500h audio and audio-visual data with and without noise augmentation described in §3.1

The performance of the four models (A or AV, with or without noise augmentation) on the 20 noisy test sets are presented in Table 2. Unsurprisingly, all models perform worse on lower SNR conditions. Nevertheless, we observe much bigger degradation for audio-based model, especially when corrupted with speech (S) or babble (B) noise, because audio models can not determine who the target speaker is from a mixture of speech. In contrast, audio-visual based models suffer very minor performance degradation in noisier conditions, because lip videos can help identify the target to infer speaker embedding for.

We also see noise augmentation reduces the average EER at -10dB from 43.7% to 30.8% for audio-only models, and from 6.3% to 3.3% for audio-visual models. The results suggest that while noise-augmentation is beneficial, adopting it alone can not bridge the gap between audio and audio-visual models pre-trained and fine-tuned on the same amount of data.

Noise Noise A, VC1 EER (%), SNR (dB)= AV, VC1 EER (%), SNR (dB)=
Aug? Type -10 -5 0 5 10 -10 -5 0 5 10
N B 48.2 36.4 18.5 9.6 6.0 4.4 3.9 3.4 2.6 2.2
S 48.8 46.5 36.5 18.3 8.5 8.6 6.8 4.8 3.4 2.5
M 39.3 26.9 14.5 8.3 5.5 6.2 4.3 3.1 2.4 2.0
O 34.0 23.3 13.7 8.8 5.9 6.0 4.3 3.2 2.6 2.3
Y B 48.1 27.2 12.7 7.3 5.2 3.4 3.2 2.5 2.2 2.0
S 24.4 14.9 11.8 12.3 9.6 3.2 2.8 2.6 2.3 2.0
M 27.3 14.3 8.2 5.6 4.4 3.5 2.8 2.4 2.0 1.8
O 23.6 13.0 8.0 5.8 4.7 3.1 2.6 2.3 2.1 2.0
Table 2: AV-HuBERT fine-tuned on VC2-500h with audio (A) or audio-visual (AV) input, with or without noise augmentation. The abbreviations used are B: Babble, S: Speech, M: Music, and O: Other.

3.4 Choice of visual input

Using lip instead of face videos is one of the key features of the proposed model as it better addresses privacy concerns while exhibits excellent noise robustness. We quantify in this section how much our model degrades by comparing it with a variant of AV-HuBERT pre-trained and fine-tuned on face videos. As a baseline, we also evaluate a widely used face recognition model RetinaFace 

[deng2020retina] taking a still image as input, which only achieves an EER of 14.6% on VC2. Table 3 shows that the face-based AV-HuBERT model indeed performs slightly better (0.9% absolute EER reduction) , which is a trade-off to be considered.

Model Input VC2 EER (%)
AV-HuBERT audio + face video 2.8
AV-HuBERT audio + lip video 3.7
Table 3: Comparing different input. AV-HuBERT is fine-tuned on VC2-500h.
Method PT Mod. VC1
Data SC-Acc SV-EER
FBANK [Yang2021SUPERBSP] - A 8.5E-4 9.56
wav2vec2-B [Yang2021SUPERBSP] LS (960 hr) A 75.18 6.02
HuBERT-B [Yang2021SUPERBSP] LS (960 hr) A 81.42 5.11
WavLM-B [Chen2021WavLM] Mix (94k hr) A 89.42 4.07
wav2vec2-L [Yang2021SUPERBSP] LL (60k hr) A 86.14 5.65
HuBERT-L [Yang2021SUPERBSP] LL (60k hr) A 90.33 5.98
WavLM-L [Chen2021WavLM] Mix (94k hr) A 95.49 3.77
AV-HuBERT-B VC2+LRS3 (2.8k hr) A 80.99 5.85
AV-HuBERT-B AV 93.90 4.85
AV-HuBERT-L A 91.56 4.42
AV-HuBERT-L AV 98.06 2.95
Method PT FT Mod. VC1 VC1 VC2
Data Data SC-Acc SV-EER SV-EER
Nagrani et al. [Nagrani2020DisentangledSE] 20% VC2 VC1 AV - 9.43 -
WavLM-L [Chen2021WavLM] Mix (94k hr) VC2 A - 0.38 -
Shon et al. [shon2019noise] - VC2 AV - - 5.29
Unimodal [Sari2021AMA] - VC2 A - 2.2 3.5
Multi-view [Sari2021AMA] - VC2 AV - 1.8 2.4
Feature Fusion [Sari2021AMA] - VC2 AV - 1.4 2.0
Ensemble [Sari2021AMA] - VC2 AV - 0.7 1.6
AV-HuBERT-B VC2+LRS3 (2.8k hr) VC2 A - 1.92 3.43
AV-HuBERT-B VC2 AV - 1.00 2.41
AV-HuBERT-L VC2 A - 1.71 3.11
AV-HuBERT-L VC2 AV - 0.84 2.29
Table 4: (Top) Comparison with the prior work following the SUPERB fine-tuning protocol. Models are fine-tuned on VC1. (Bottom) Comparison with prior work that does not follow the SUPERB evaluation protocol.

3.5 Comparison with prior work

We first compare AV-HuBERT with other self-supervised models using the SUPERB [Yang2021SUPERBSP] protocol that evaluates the quality of frozen representations (upper half of Table 4), where models are fine-tuned and evaluated on VC1 for SC and SV. With audio-only input, AV-HuBERT-L outperforms wav2vec2-L and HuBERT-L and is inferior to WavLM-L. Outperforming all audio-only pre-trained models, AV-HuBERT achieves significantly better results on the SC and SV tasks with audio-visual input even with an order of magnitude less pre-training data.

The lower half of Table 4 compares AV-HuBERT with other supervised and self-supervised models reporting results on VC1 or VC2 without following the SUPERB protocol. Specifically, we fine-tune our pre-trained model using the whole VC2 labeled data. As is expected, fine-tuning with a large amount of labeled data improves performance. In audio-visual setting, our best model (0.84%) outperforms [Sari2021AMA] (1.8% and 1.4%) with a single model and slightly falls behind its ensembled model (0.7%). Note in contrast to the prior works [Nagrani2020DisentangledSE, Sari2021AMA, shon2019noise] which uses the whole face, our model only relies on the lip area of the speaker as visual input and achieves a better trade-off between privacy and performance. In addition, we acknowledge the gap between our best model and the current SOTA on VC1 ([Chen2021WavLM]: 0.38%). Such gap is attributed to the relative smaller amount of pre-training data we use as well as the simplicity of our training and evaluation pipeline (e.g., [Chen2021WavLM] uses Inter-TopK penalty [zhao2021speakin], ECAPA-TDNN [desplanques2020ecapa] as an upstream network, second stage large-margin fine-tuning, adaptive s-norm [Karam2011TowardsRF, Cumani2011ComparisonOS] and calibration [Thienpondt2021theidlab] in evaluation, etc). As the goal of this paper is to study the speaker embedding learned by AV-HuBERT, we stick to a simple and standard pipeline for speaker verification. How to incorporate extra techniques in AV-HuBERT framework is left as our future work.

4 Conclusion

In this paper, we study learning lip-based audio-visual speaker embeddings from AV-HuBERT. We show that the general-purpose audio-visual speech representation learning framework, AV-HuBERT, is able to learn high-quality speaker embeddings that improves the performance on speaker related tasks including classification and verification using less labeled data. The proposed model also greatly improves robustness to a variety of noise while preserving better privacy as it requires only lip instead of whole face videos.

References