Cross-modal Speaker Verification and Recognition: A Multilingual Perspective

Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognised irrespective of the spoken language?". These two questions are very important to understand effectiveness and to boost development of multilingual biometric systems. To answer them, we collected a Multilingual Audio-Visual dataset, containing human speech clips of 154 identities with 3 language annotations extracted from various videos uploaded online. Extensive experiments on the three splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 14

05/25/2019

Reconstructing faces from voices

Voice profiling aims at inferring various human parameters from their sp...
04/01/2018

Seeing Voices and Hearing Faces: Cross-modal biometric matching

We introduce a seemingly impossible task: given only an audio clip of so...
08/08/2020

JukeBox: A Multilingual Singer Recognition Dataset

A text-independent speaker recognition system relies on successfully enc...
09/09/2021

Multilingual Audio-Visual Smartphone Dataset And Evaluation

Smartphones have been employed with biometric-based verification systems...
02/20/2020

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker ident...
09/30/2019

Non-native Speaker Verification for Spoken Language Assessment

Automatic spoken language assessment systems are becoming more popular i...
04/05/2021

SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System

In this paper, we describe SpeakerStew - a hybrid system to perform spea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Half of the world population is bilingual with people often switching between their first and second language while communicating111www.washingtonpost.com/local/education/half-the-world-is-bilingual-whats-our-problem/2019/04/24/1c2b0cc2-6625-11e9-a1b6-b29b90efa879_story
* Equal Contribution

. Therefore it is essential to investigate the effect of multiple languages on computer vision and machine learning tasks. As introduced in Figure 

1, this paper probes two closely related and relevant questions, which deal with the recent introduction of cross-modal biometric matching tasks in the wild:

Q1. Is face-voice association language independent?
Q2. Can a speaker be recognised irrespective of the spoken language?

Regarding the first question, a strong correlation has been recently found between face and voice of a person which has attracted significant research interest [19, 25, 32, 33, 34, 46]. Though previous works have established a strong association between faces and voices, however none of these approaches investigate the effect of multiple languages on this task. In addition, existing dataset containing audio-visual information, VoxCeleb [31, 21, 30], FVCeleb [19], FVMatching [25] do not provide language level annotation. Therefore, we cannot deploy these dataset to analyse the effect of multiple languages on association between faces and voices.

Figure 1: Multimodal data may provide enriched understanding to improve verification performance. Joaquin can wear make-up that makes visual identification challenging but voice can still bring enough cues to verify identity. In this work, we are interested to understand the effect of multilingual input when processed by audio-visual verification model (Q1) or just using the audio input (Q2). Joaquin is a perfect English-Spanish bilingual, would the system still be able to verify Joaquin when speaking Spanish even if the system was trained with English audio only?

Thus, in order to answer both questions, we create a new Multilingual Audio-Visual MAV-Celeb dataset comprising of video and audio recordings with a large number of celebrities speaking more than one language in the wild. The proposed dataset paves the way to analyze the impact of multiple languages on association between faces and voices. Then, we propose a cross-modal verification approach to answer Q1 by analyzing the effect of multiple languages on face-voice association. In addition, the audio part of the dataset supplies samples of languages with annotations which serves as a foundation to answer Q2.

To summarise, the paper main contributions are listed as follow:

  • We first propose a cross-modal verification approach to analyze the effect of multiple languages on face-voice association;

  • Likewise, we perform an analysis that highlights the very same problem of multilingualism for speaker recognition;

  • We propose the MAV-Celeb dataset, containing human speech clips with language annotations with utterances of celebrities, extracted from videos uploaded online.

The rest of the paper is structured as follows: Section 2 explores the related literature on the two introduced questions. While Section 3 introduces the nature of proposed dataset followed by experimental evidence to answer both questions in Section 4 and 5. Finally, conclusions are presented in Section 6.

2 Related Work

We summarize previous work relevant to the two questions raised in the introduction. Q1 falls under cross-modal verification topic while Q2 deals with speaker recognition tasks.

2.1 Cross-modal Verification Between Faces and Voices

Last decade has witnessed an increasing use of multimodal data in challenging Computer Vision tasks including visual question and answering [2, 3]

, image captioning 

[22, 44], classification [18, 24], cross-modal retrieval [35, 45]

and multimodal named entity recognition 

[5, 50].

Typically, multimodal applications are built on image and text information, however recent years have seen an increased interest to leverage audio-visual information [20, 36, 42, 48]. Previous works [1, 7] capitalize on natural synchronization between audio and visual information to learn rich audio representation via cross-modal distillation. More recently, Nagrani et al. [33] leveraged audio and visual information to establish an association between faces and voices in a cross-modal biometric matching. Furthermore, recent works [25, 32] introduced joint embedding to establish correspondences between faces and voices. These methods extract audio and face embedding to minimize the distance between embeddings of similar speakers while maximizing the distance among embeddings from different speakers. Similarly, Nawaz et al. [34] extracted audio and visual information with a single stream network to learn a shared deep latent space representation. Such framework used speaker identity information to eliminate the need of pairwise or triplet supervision [32, 33]. Wen et al. [46] presents a disjoint mapping network to learn a shared representation for audio and visual information by mapping them individually to common covariates (gender, nationality, identity).

Our goal is similar to previous works [25, 32, 33, 35, 46], however, we investigate a novel problem: To understand if the association between faces and voices is language independent.

2.2 Speaker Recognition

Speaker recognition dates back to 1960s when Sandra et al. [38]

laid the groundwork for speaker recognition systems attempting to find a similarity measure between two speech signals by using filter banks and digital spectrograms. In the following we provide a brief overview of speaker recognition methods as clustered in two main classes: Traditional and deep learning methods.


Traditional Methods –  For a long time, low dimensional short-term representation of audio input has been basis for speaker recognition tasks e.g. Mel Frequency Cepstrum Coefficients (MFCC) and Linear Predictive Coding (LPC) based features. These features are extracted using short overlapping segments of audio samples. Reynolds et al. [39]

introduced speaker verification method based on Gaussian Mixture Models using MFCCs. Differently, Joint Factor Analysis (JFA) models speaker and channel subspace separately 

[23]. Najim et al. [15]

introduced i-vectors which combines both JFA and Support Vector Machines (SVM). Other works employed JFA as a feature extractor in order to train a SVM classifier. Furthermore, traditional methods have also been applied to analyze the effect of multiple languages on speaker recognition tasks 

[6, 27, 29]. Though, traditional methods showed reasonable performance on speaker recognition task, however majority of these approaches suffer performance degradation in real-world scenarios.
Deep Learning Methods – Neural Networks have provided more efficient methods of speaker recognition. Therefore, the community has experienced a shift from hand-crafted features to deep neural networks. Ellis et al. [16] introduced a system in which a classifier (Gaussian Mixture Model) is trained from embedding of hidden layers of a neural network. Salman et al. [40] proposed a deep neural network which learn from speaker-specific characteristics from MFCC features for segmentation and clustering of speaker. Chen et.al. [12]

used a Siamese feed forward neural network which can discriminatively compare two voices based on MFCC features. Lei et al. 

[26] introduced a deep neural model with i-vectors as input features for the task of automatic speaker recognition. More recently, Nagrani et al. [30]

proposed adapted convolutional neural network (VGG-Vox) with spectrogram for speaker recognition. This paper has similarities with the previous work i.e. speaker identification and verification, however the objective is different: We evaluate and provide an answer about the effect of multiple languages on speaker identification and verification strategies in the wild. To this end we propose a dataset instrumental for answering such questions.

2.3 Related Datasets

There are various existing datasets for multilingual speaker recognition task but they are not instrumental to answer Q1/Q2 due to at least one of the following reasons: i) they are obtained in constrained environment [14]; ii) they are manually annotated so limited in size; iii) not freely available [8]; iv) not audio-visual [14, 14] v) missing language annotations [31, 21, 30]. A comparison of these dataset with our proposed MAV-Celeb dataset is given in Table 1.

Dataset Condition Free Language annotations
The Mixer Corpus [14] Telephone, Microphone
Vermobil [8] Telephone, Microphone
Common Voice [4] Microphone
SITW [28] Multimedia
VoxCeleb [31, 21, 30] Multimedia
MAV-Celeb [proposed] Multimedia
Table 1: Comparison of our proposed dataset with existing multilingual datasets.

3 Dataset Description

Multilingual Audio-Visual MAV-Celeb dataset222We will release the dataset. provide data of celebrities in languages (English, Hindi, Urdu). These three languages have been selected because of several factors: i) They represent approximately 1.4 Billion bilingual/trilingual people; ii) The population is highly proficient in both or more languages; iii) There is a relevant corpus of different media that can be extracted from available online repositories (e.g. YouTube). The collected videos cover a wide range of ‘in the wild’, unconstrained, challenging multi-speaker environment including political debates, press conferences, outdoor interviews, quiet studio interviews, drama and movie clips.

Dataset EU EH EHU
Languages U/E/EU H/E/EH E/HU
# of Celebrities 70 84 154
# of male celebrities 44 56 100
# of female celebrities 26 28 54
# of videos 560/407/967 546/669/1,215 2182
# of hours 59/32/91 48/60/109 200
# of utterances 11836/5551/17387 9975/13313/23288 41674
Avg # of videos per celebrity 8/6/14 6/8/14 14
Avg # of utterances per celebrity 169/79/248 119/158/277 270
Avg length of utterances(s) 17.9/17.8/17.8 17.4/16.5/16.9 17.3
Table 2: Dataset statistics. The dataset is divided into splits (EU, EH, EHU) containing audio samples from languages, English(E), Hindi(H) and Urdu (U).
Figure 2: Audio-visual information samples selected from proposed dataset. The visual data contains various variations such as pose, lighting condition and motion. The green block contains information of celebrities speaking English and the red block presents data of the same celebrity in Urdu.

It is also interesting to note that the visual data spans over a vast range of variations including poses, motion blur, background clutter, video quality, occlusions and lighting conditions. In addition, videos are degraded with real-world noise like background chatter, music, overlapping speech and compression artifacts. Fig. 2 shows some audio-visual samples while Table 2 shows statistics of the dataset. The dataset contains splits English–Urdu (EU), English–Hindi (EH) and English–Hind/Urdu) to analyze performance measure across multiple languages. The pipeline followed in creating the dataset is discussed in Appendix 0.A.

4 Face-voice Association

We introduce a cross-modal verification approach to analyze face-voice association across multiple languages using MAV-Celeb dataset in order to answer the question:

Q1. Is face-voice association language independent?



For example, consider a model trained with faces and voice samples of one language. At inference time, the model is evaluated with faces and audio samples of both the same language and a completely unheard language. This experimental setup provides a foundation to analyze association between faces and voices across languages to answer Q1. Therefore, we extract face and voice embedding from two subnetworks trained on VGGFace2 [10] and voice samples from MAV-Celeb dataset respectively. Previous works showed that the faces and voices subnetworks can be trained jointly to bridge the gap between the two [25, 32]. However, we built a shallow architecture on top of face and voice embedding to reduce the gap between them, inspired from the previous work on images and text [45].

The details of these sub networks and shallow architecture are as follow:
Face Subnetwork – The face sub network must produce discriminative features for face verification task. However, CNNs trained with ‘softmax’ produce features which lack discriminative capabilities [9, 47]. Therefore, we jointly trained a CNN with ‘softmax’ and ‘center loss’ [47]

to extract discriminative embedding for faces. The network learns a center for embedding of each class and penalizes the distances between the embedding and their corresponding class centers. At inference time, the network produces discriminative embedding which is typically employed for face recognition tasks 

[9, 47]. We trained the Inception ResNet-V1 network [43] with VGGFace2 [10] dataset together with ‘softmax’ and ‘center loss’.
Voice Subnetwork – Similarly, the audio network must also produce a discriminative embedding. Nagrani et al. [30]

introduced VGG-Vox network to process audio information. The network is trained with ‘softmax’ loss function. In the current work, we modify the last layer of the network to configure it with the center loss. After modification, VGG-Vox is jointly trained with ‘softmax’ and ‘center loss’ to produce discriminative embedding for verification task.


Center Loss – Suppose there are samples in c-th class representing an identity. During training, the geometric center of features is computed and the objective function consisting of the distance of each feature from the center is minimized using the center loss:

(1)

The center loss simultaneously learns centers for all classes and minimizes the distances between each class center and features in a mini-batch. If there are classes and samples in a mini batch, the loss function is given by:

(2)

where denotes the i

th deep feature, belonging to the

th class and is the feature dimension. The vector denotes the jth column of the weights is the last fully connected layer and b is the bias term. A scalar is used for balancing the ‘softmax’ and center loss. The ‘softmax’ loss can be considered as a special case of this joint supervision, if is set to  [47].
Cross-modal verification – Finally, we learn a face-voice association for cross-modal verification approach using a two stream neural network (we name it Two-Branch) with single layer of nonlinearities on top of the face and voice representations Fig. 3 shows the Two-Branch shallow architecture along with the pre-trained subnetworks. The shallow architecture consists indeed of two branches, each composed of fully connected layer with weight matrices and

followed by Rectified Linear Unit (

). At the end of each branch, we add normalization.

Figure 3: Cross-modal verification network configuration. The left side represents audio and face sub networks trained separately. Afterwards, audio and face embedding is fed to train a shallow architecture represented on the right side.

Loss Function – Given a training face , let and represent sets of positive and negative voice samples respectively. We impose the distance between and each positive voice sample to be smaller than the distance between and each negative voice sample with margin :

(3)

Eq. (3) is modified for a voice :

(4)

where and represents the sets of positive and negative face for .

Finally, constraints are converted to the training objective using hinge loss. The resulting adapted loss function is given by:

(5)

The shallow architecture configured with the adapted loss function produce joint embedding of face and voice to study face-voice association across multiple languages using the proposed dataset.

4.1 Experimental Protocol

We propose an evaluation protocol in order to answer Q1 which deals face-association across multiple languages. The MAV-Celeb dataset is divided into train and test splits consisting of disjoint identities from the same language typically known as unseen-unheard configuration [32, 33]. Fig. 4 shows evaluation protocol during training and testing stages. At inference time, the network is evaluated on a heard and completely unheard language. The protocol is more challenging than unseen-unheard configuration due to the presence of an unheard language in addition to disjoint identities. The dataset splits EU, EH, EHU contains , and identities for train and test respectively.

Figure 4: Evaluation protocol to analyze the impact of multiple languages on association between faces and voices. Green and the red blocks represent training and testing strategies. At test time, the network is evaluated on unseen-unheard configuration from the same language (English) heard during training along with a completely unheard language (Urdu).

4.2 Results and Discussion

We evaluate cross-modal verification between faces and voices with the proposed model along with a previously introduced method, Deep Latent Space framework [34] to analyze similar performance measure across multiple languages. The goal of cross-modal verification task is to verify if a voice segment and face image belong to the same identity or not. Table 3 shows the results of face-voice association with multiple languages using the proposed evaluation protocol. On average, 11.9%, 5.9% and 15.6%, 15.7% performance drop occurred on EU and EH splits with Two-Branch and Deep Latent Space methods respectively. These results clearly demonstrate that the association between faces and voices is not language independent. The performance degradation is due to different data distributions of the two languages, typically known as domain shift [41]. Moreover, the model does not generalize well to other unheard languages. However the performance is still better than random verification, which is not trivial considering the challenging nature of the evaluation protocol.

Furthermore, it is clear that Deep Latent Space framework performance is superior than the proposed Two-Branch network because the former is trained from scratch and latter is trained on embedding extracted from pre-trained models. In any case, both approaches experience a performance drop when tested on unseen-unheard identities along with completely unheard language.

EU
Method Configuration Eng. test Urdu test Drop (%)
(EER) (EER)
Two-Branch (Proposed) Eng. train 41.0 47.8 16.6
Urdu train 48.9 45.6 7.2
Deep Latent Space [35] Eng. train 39.4 46.9 19.1
Urdu train 45.9 33.4 12.1
EH
Eng. test Hindi test
(EER) (EER)
Two-Branch (Proposed) English train 45.5 48.8 7.3
Hindi train 47.3 45.8 4.4

Deep Latent Space [35] Eng. train data 34.5 41.1 19.3
Hindi train data 42.7 38.1 12.0

Table 3: Cross-modal verification between face and voice across multiple language on various test configurations of MAV-Celeb dataset (lower is better).

5 Speaker Recognition

This section investigates the performance of speaker recognition across multiple languages to answer the following question.

Q2. Can a speaker be recognised irrespective of the spoken language?



For example, consider a model trained with voice samples of one language. At inference time, the model is evaluated with audio samples of the same language and a completely unheard language of the same identity. This experimental setup provides a foundation for speaker recognition across multiple languages to answer Q2. We developed following methodology for speaker recognition across multiple languages using MAV-Celeb dataset.
Input features – The signals are converted into single channel, -bit streams at a kHz sampling rate with sampling frequency in accordance to the frame rate. The encoded audio signals are short term magnitude spectrograms generated directly from raw audio of length seconds. The approach provides spectrograms of size for seconds of speech segment using a hamming window of width ms and step size ms.
Architecture – Speaker identification under a closed set can be considered as a multi-class classification problem. Nagrani et al. [30] introduced VGG-Vox architecture by modifying VGG-M [11] model to adapt to the spectrogram input. Specifically, the fully connected fc6 layer of VGG-M is replaced by two layers – a fully connected layer and an average pool layer.
Identification 

– Since identification task is considered as a multi-class classification problem, the last layer output of VGG-Vox is fed into a ‘softmax’ to produce a probability distribution over the total number of speakers in the dataset.


Verification – For verification, feature vectors can be obtained from the classification network (VGG-Vox) jointly trained with ‘softmax’ and center loss. The last layer (fc8) of the network is modified to produce embedding size. Finally, euclidean distance is used to compare embedding for verification task.

5.1 Experimental Protocol

We proposed an evaluation protocol in order to analyze the impact of multiple languages on speaker recognition to answer Q2. The MAV-Celeb dataset is divided into typical classification scenario for speaker identification. However, different voice tracks of the same person are used for train, validation and test. The network is trained with one language and tested with the same language and a completely unheard language of same identities. Moreover, the dataset is split into disjoint identities for speaker verification [30]. Fig. 5 shows evaluation protocol for speaker recognition across multiple languages. The protocol is consistent with previous studies on human subjects for speaker identification [37].

Figure 5: Evaluation protocol to analyze the impact of multiple languages on speaker recognition. Green and the red blocks represent training and testing strategies respectively. At test time, the network is evaluated on the same language heard during training along with completely unheard language of the same identities.

5.2 Results and Discussion

We evaluate the performance of speaker recognition across multiple languages. Table 4 shows speaker identification performance on splits (EU, EH, EHU) of MAV-Celeb dataset. We note that on average , and performance drop occurred on a completely unheard language for EU, EH and EHU splits respectively. The speaker identification model (VGG-Vox) does not generalize well on unheard language and is overfitted on a particular language. However, its performance is quantitatively better than random classification on unheard language. Based on these results, we conclude that speaker identification is a language dependent task. Furthermore, these results are inline with the previous studies which show that human’s speaker identification performance is higher on people speaking familiar language than people speaking unknown language [37].

EU
Configuration Eng. test Urdu test Drop(%)
Top-1(%) Top-1(%)
Eng. train 54.7 43.4 26.0
Urdu train 47.5 52.9 11.4
EH
Eng. test Hindi test
Top-1(%) Top-1(%)
Eng. train 65.7 40.0 64.3
Hindi train 49.9 52.5 5.2

EHU
Eng. test Hindi/Urdu test
Top-1(%) Top-1(%)
Eng. train 60.1 54.0 11.3
Hindi/Urdu train 46.9 56.0 19.4

Table 4: Speaker identification results across multiple languages on test configurations of MAV-Celeb dataset (higher is better).

Similarly, Table 5 shows speaker verification performance on splits (EU, EH, EHU) of MAV-Celeb dataset. We note that on average , and performance drop occurred on a completely unheard language for EU, EH and EHU respectively. Therefore, speaker verification is also not language independent.

EU

Configuration Eng. test Urdu test
(EER) (EER) Drop (%)
Eng. train 36.7 38.7 5.4
Urdu train 37.6 35.6 5.6

EH
Eng. test Hindi test
(EER) (EER)
English train 30.1 32.9 9.3
Hindi train 32.7 28.2 15.9


EHU
Eng. test Hindi/Urdu test
(EER) (EER)
English train 35.7 39.1 9.5
Hindi/Urdu train 34.5 31.8 8.5

Table 5: Speaker verification results across multiple languages on various test configurations of MAV-Celeb dataset (lower is better).

6 Conclusion

In this work, effect of language is explored on cross-modal verification between faces and voices along with speaker recognition. A new audio-visual dataset consisting of celebrities is presented with language level annotation. The dataset contains splits having same set of identities speaking English/Urdu, English/Hindi and both. In the cross-modal verification experiment by changing training and test language, performance drop is observed indicating that face-association is not language independent. In case of speaker recognition, similar drop in performance is observed, thus concludes that speaker recognition is also language dependent task. The reason in performance is due to the domain shift caused by two different languages.

Appendix 0.A Dataset Collection Pipeline

Figure 6: Data collection pipeline. It consists of two blocks with upper block download static images while the bottom block download and process videos from YouTube.

In this section we present a semi-automated pipeline inspired by Nagrani et al. [30] for collecting the proposed dataset. The pipeline is shown in Fig. 6 and various stages are discussed below.
Stage 1 – List of Persons of Interest: In this stage, candidate list of Persons of Interest (POIs) is generated by scraping Wikipedia. The POIs cover over a wide range of identities including sports persons, actors, actresses, politicians, entrepreneurs and singers.
Stage 2 – Collecting list of YouTube links. In this step we used crowd-sourcing to collect lists of YouTube videos. Keywords like “Urdu interview”, “English Interview”,“public speech English”, “public speech Urdu” are appended to increase the likelihood that search results contain an instance of POI speaking. The links of search results are stored in text files. Videos are then automatically downloaded using the links from these text files.
Stage 3 – Face tracks. In this stage, we employed joint face detection and alignment using Multi-task Cascaded Convolutional Networks (MTCNN) for face detection and alignment [49]. MTCNN can detect faces in extreme conditions, and different poses. After face detection and alignment, shot boundaries are detected by comparing color histograms across consecutive frames. Based on key frames from shot boundaries and detected faces, face tracks are generated.
Stage 4 – Active speaker verification.

The goal of this stage is to determine the visible speaking faces. We carried out this stage by using ‘SyncNet’ which estimates the correlation between mouth motion and audio tracks 

[13]. Based on scores from this model, face tracks with no visible speaking faces, voice-over and background speech are rejected.
Stage 5 – Static Images. In this stage, static images are automatically downloaded using Google Custom Search API based on list of POIs obtained from stage . MTCNN is employed to detect and align static face images. A clustering mechanism based on a popular density-based clustering algorithm DBSCAN [17] is used to remove false positives from the detected and aligned faces. Interestingly, DBSCAN does not require a priori specification of the number of clusters in the data. Intuitively, the clustering algorithm groups faces of an identity that are closely packed together.
Stage 6 – Face tracks classification. In this stage, active speaker face tracks are classified if they belong to POI or not. We trained an Inception ResNet V1 network [43] on VGGFace2 dataset [10] with center loss [47] to extract discriminative embedding from face tracks and static images. A classifier is trained based on Support Vector Machine with static face embedding. Finally, classification is performed using a score with a threshold obtained from each face track.

References

  • [1] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman (2018) Emotion recognition in speech using cross-modal transfer in the wild. In Proceedings of the 26th ACM international conference on Multimedia, pp. 292–301. Cited by: §2.1.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6077–6086. Cited by: §2.1.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.1.
  • [4] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber (2019) Common voice: a massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670. Cited by: Table 1.
  • [5] O. Arshad, I. Gallo, S. Nawaz, and A. Calefati (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), Cited by: §2.1.
  • [6] R. Auckenthaler, M. J. Carey, and J. S. Mason (2001) Language dependency in text-independent speaker verification. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 1, pp. 441–444. Cited by: §2.2.
  • [7] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In Advances in neural information processing systems, pp. 892–900. Cited by: §2.1.
  • [8] S. Burger, K. Weilhammer, F. Schiel, and H. G. Tillmann (2000) Verbmobil data collection and annotation. In Verbmobil: Foundations of speech-to-speech translation, pp. 537–549. Cited by: §2.3, Table 1.
  • [9] A. Calefati, M. K. Janjua, S. Nawaz, and I. Gallo (2018) Git loss for deep face recognition. In British Machine Vision Conference, Cited by: §4.
  • [10] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2018) Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 67–74. Cited by: Appendix 0.A, §4, §4.
  • [11] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Cited by: §5.
  • [12] D. Chen, S. Tsai, V. Chandrasekhar, G. Takacs, H. Chen, R. Vedantham, R. Grzeszczuk, and B. Girod (2011) Residual enhanced visual vectors for on-device image matching. In 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 850–854. Cited by: §2.2.
  • [13] J. S. Chung and A. Zisserman (2016) Out of time: automated lip sync in the wild. In Asian conference on computer vision, pp. 251–263. Cited by: Appendix 0.A.
  • [14] C. Cieri, J. P. Campbell, H. Nakasone, K. Walker, and D. Miller (2004) The mixer corpus of multilingual, multichannel speaker recognition data. Technical report PENNSYLVANIA UNIV PHILADELPHIA. Cited by: §2.3, Table 1.
  • [15] N. Dehak, P. Kenny, R. Dehak, O. Glembek, P. Dumouchel, L. Burget, V. Hubeika, and F. Castaldo (2009) Support vector machines and joint factor analysis for speaker verification. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4237–4240. Cited by: §2.2.
  • [16] D. P. Ellis, R. Singh, and S. Sivadas (2001) Tandem acoustic modeling in large-vocabulary recognition. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 1, pp. 517–520. Cited by: §2.2.
  • [17] M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: Appendix 0.A.
  • [18] I. Gallo, A. Calefati, and S. Nawaz (2017) Multimodal classification fusion in real-world scenarios. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 5, pp. 36–41. Cited by: §2.1.
  • [19] S. Horiguchi, N. Kanda, and K. Nagamatsu (2018) Face-voice matching using cross-modal embeddings. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1011–1019. Cited by: §1.
  • [20] J. Huang and B. Kingsbury (2013) Audio-visual deep learning for noise robust speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7596–7599. Cited by: §2.1.
  • [21] A. Z. J. S. Chung (2018) VoxCeleb2: deep speaker recognition.. In INTERSPEECH, 2018, Cited by: §1, §2.3, Table 1.
  • [22] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §2.1.
  • [23] P. Kenny (2005) Joint factor analysis of speaker and session variability: theory and algorithms. CRIM, Montreal,(Report) CRIM-06/08-13 14, pp. 28–29. Cited by: §2.2.
  • [24] D. Kiela, E. Grave, A. Joulin, and T. Mikolov (2018) Efficient large-scale multi-modal classification. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.1.
  • [25] C. Kim, H. V. Shin, T. Oh, A. Kaspar, M. Elgharib, and W. Matusik (2018) On learning associations of faces and voices. In Asian Conference on Computer Vision, pp. 276–292. Cited by: §1, §2.1, §2.1, §4.
  • [26] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699. Cited by: §2.2.
  • [27] L. Lu, Y. Dong, X. Zhao, J. Liu, and H. Wang (2009) The effect of language factors for robust speaker recognition. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4217–4220. Cited by: §2.2.
  • [28] M. McLaren, L. Ferrer, D. Castan, and A. Lawson (2016) The speakers in the wild (sitw) speaker recognition database.. In Interspeech, pp. 818–822. Cited by: Table 1.
  • [29] A. Misra and J. H. Hansen (2014) Spoken language mismatch in speaker verification: an investigation with nist-sre and crss bi-ling corpora. In 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 372–377. Cited by: §2.2.
  • [30] A. Nagrani, J. S. Chung, and A. Zisserman (2017) VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, Cited by: Appendix 0.A, §1, §2.2, §2.3, Table 1, §4, §5.1, §5.
  • [31] A. Nagrani (2019) VoxCeleb: large-scale speaker verification in the wild.. Computer Speech & Language. Cited by: §1, §2.3, Table 1.
  • [32] A. Nagrani, S. Albanie, and A. Zisserman (2018) Learnable pins: cross-modal embeddings for person identity. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 71–88. Cited by: §1, §2.1, §2.1, §4.1, §4.
  • [33] A. Nagrani, S. Albanie, and A. Zisserman (2018) Seeing voices and hearing faces: cross-modal biometric matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8427–8436. Cited by: §1, §2.1, §2.1, §4.1.
  • [34] S. Nawaz, M. K. Janjua, I. Gallo, A. Mahmood, and A. Calefati (2019) Deep latent space learning for cross-modal mapping of audio and visual signals. In 2019 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. Cited by: §1, §2.1, §4.2.
  • [35] S. Nawaz, M. Kamran Janjua, I. Gallo, A. Mahmood, A. Calefati, and F. Shafait (2019) Do cross modal systems leverage semantic relationships?. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.1, §2.1, Table 3.
  • [36] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. Cited by: §2.1.
  • [37] T. K. Perrachione, S. N. Del Tufo, and J. D. Gabrieli (2011) Human voice recognition depends on language ability. Science 333 (6042), pp. 595–595. Cited by: §5.1, §5.2.
  • [38] S. Pruzansky (1963) Pattern-matching procedure for automatic talker recognition. The Journal of the Acoustical Society of America 35 (3), pp. 354–358. Cited by: §2.2.
  • [39] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn (2000) Speaker verification using adapted gaussian mixture models. Digital signal processing 10 (1-3), pp. 19–41. Cited by: §2.2.
  • [40] A. Salman and K. Chen (2011) Exploring speaker-specific characteristics with deep learning. In The 2011 International Joint Conference on Neural Networks, pp. 103–110. Cited by: §2.2.
  • [41] H. Shimodaira (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §4.2.
  • [42] N. Srivastava and R. R. Salakhutdinov (2012)

    Multimodal learning with deep boltzmann machines

    .
    In Advances in neural information processing systems, pp. 2222–2230. Cited by: §2.1.
  • [43] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In Thirty-first AAAI conference on artificial intelligence, Cited by: Appendix 0.A, §4.
  • [44] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §2.1.
  • [45] L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5005–5013. Cited by: §2.1, §4.
  • [46] Y. Wen, M. A. Ismail, W. Liu, B. Raj, and R. Singh (2019) Disjoint mapping network for cross-modal matching of voices and faces. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §1, §2.1, §2.1.
  • [47] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: Appendix 0.A, §4, §4.
  • [48] Y. Wu, L. Zhu, Y. Yan, and Y. Yang (2019) Dual attention matching for audio-visual event localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6292–6300. Cited by: §2.1.
  • [49] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: Appendix 0.A.
  • [50] Q. Zhang, J. Fu, X. Liu, and X. Huang (2018) Adaptive co-attention network for named entity recognition in tweets. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.1.