I Introduction
Why do people watching a large screen in a movie theater hear an actor’s voice coming from his face, even though the audio speakers are on the side of the hall? The illusion of third voice due to association of visual data with the auditory information is known as the McGurk effect [17]. It is not surprising that auditory experience is influenced by the visual input. This is the case with the ventriloquist who exploit the visual capturing ability of the audience to sell the idea that their puppets speak. Since the audience sees the puppet moving its lips, the location of sound also changes simultaneously and the audience experiences the McGurk effect.
It is a well studied fact that humans end up associating voices and faces of people [13, 5] due to the fact that neuro-cognitive pathways for voices and faces share the same structure [10]. This behavioral tendency is best exploited in a Hollywood film where a person can be identified just by hearing his/her voice. In fact a movie named ‘Taken’ is based on the identification of a person using only the auditory input. Recognition of speakers from their voices requires information from cross domains to be mapped onto a shared latent space [19]. Once the joint representation is obtained, a family of algorithms can be introduced, ranging from matching, verifying, authenticating, retrieving and even searching. However, speaker recognition under unconstrained conditions is an exceedingly difficult task, since real-world scenarios are not limited to noise-free environments. Background noises play an important role and have become an inevitable part of our everyday life. Without characterizing the background noise and variation, holistic understanding and robust learning is not possible [20].
The task of cross-modal mapping is supported by the hypothesis presented by studies above that it may be possible to find association between voices and faces as well. With this in perspective, the current paper focuses on the task of obtaining a joint representation of auditory and visual input employing a single stream network for both modalities. The problem setup is that we have a corpus of auditory and visual information (voices and faces) and we perform following tasks on the dataset using a single network: matching, retrieval and verification.
One of the major hindrance in performing such experiments is the unavailability of large-scale corpus consisting of information from both domains (images and audio). However, recently VoxCeleb dataset [20] has been introduced which comprises of a collection of video and audio recordings of a large number of celebrities. Previous works in literature [14, 19, 18] have modeled the problem of cross modal matching by employing separate networks for multiple modalities in either triplet network fashion or subnetwork. Separate networks in triplet fashion may help with modularity given few modalities (two in this case) at input, but it is important to take into account the possibility of multiple input streams (text, image, voice, etc). In a triplet fashion, each modality acquires space complexity, where is sample size. In VoxCeleb dataset, there are identities which becomes if we consider single image and single voice for each identity; the space complexity grows exponentially if we increase number of instances of each of each identity.
Nagrani et. al [19] performed experiments in static and dynamic settings. A five stream dynamic-fusion architecture111The five stream dynamic-fusion architecture consists of two face sub-networks, one voice network along with two extra streams as dynamic feature subnetworks requires five subnetworks to account for this fusion. We achieve the five stream dynamic-fusion results with a single network trained in end-to-end fashion. Our network performs under no restrictions in terms of triplet selection. We perform series of experiments inspired from [19, 29]. Furthermore, we perform two additional experiments to establish the robustness of our methodology. Our main contributions are listed below.
-
We introduce a single, end-to-end trainable network for performing auditory and visual information matching, verification and retrieval.
-
We propose a novel training procedure which can be coupled with deep neural networks to map multiple modalities to shared latent space without pairwise or triplet information.
-
We perform series of cross modal matching, verification and retrieval experiments considering multiple demographic, age, and gender factors in the wild.
The rest of the paper is structured as follows: we explore the related literature in Section II; details of the proposed approach are discussed in Section III, followed by experiments and evaluations in Section IV and Section V respectively. Section VI presents discussion, followed by ablation study in Section VII. Finally, conclusions are in Section VIII.
Ii Related Work
In this section we study previous works under multiple subsections. We lay the foundation of the problem in cognitive neuroscience studies followed by the introductory work done by the vision community.
Ii-A Cognitive Neuroscience Studies
The characteristic display of auditory information in emotional analysis dates back to the times of Darwin. In his book The expression of emotions in man and animals, Darwin recognized emotions as non-private experiences and described their characteristic displays. Speech perception studies in psychology provide experimental support to Darwin’s idea of an existence of link between auditory and visual information [9, 17, 31]. Studies like [13, 5] provide evidence that equivalent information about identity is available cross-modally from both the auditory and visual information domains. The McGurk effect illustrates the fundamental idea what is seen can affect what is heard. It also explains how ventriloquists enhance spatial attention to sound by employing the cross modal information from the visual cues. More generally, the study in [27] explains that having the speaker’s face visible improves the perception of speech in noise. It is important to note that human perception studies have performed static matching of voices and faces, but argue that the results lie at chance level [16, 13]. We see that these studies have been conducted previously under the cognitive psychological perspective in detail and have recently been introduced to the vision community.
Ii-B Audio and Visual Recognition
Audio and visual recognition are important tasks and main stream vision techniques have been focused towards solving the problem. In recent years, this problem of face and audio recognition has seen extensive progress [6, 21, 23, 12, 22, 8]. It is important to note that these approaches allow for effective representation of unimodal information. However, the alignment of information learned across modalities is not accounted for in these approaches. In the current work we present a single stream architecture trained with the distance learning paradigm to obtain effective representations of audio and visual signals in the shared latent space.
Ii-C Joint Latent Space Representation
To effectively capture cross modal embeddings, information across all the modalities have to be learned and mapped onto a joint latent space. Cross modal learning employing visual and textual data has seen significant progress [11, 28, 32, 1] . However, not much work has been directed towards exploring joint representation of audio and visual data. Few works focus on the tasks of audio-visual matching for scene understanding
Iii Deep Latent Space Learning Approach
One of the core ideas of this paper is to bridge the gap between the images and encoded voice signals i.e. spectrograms. Our proposed approach eliminates the need for multiple networks for either modality, since similar results can be achieved with a single network. Fig. 1 visually explains the framework of the proposed approach.

Consider be training data points where and represent audio and visual signals belonging to a class and represents the labels. The objective of latent space learning is based on minimization of distance between features and center of class .
This can be achieved with the help of a single stream convolutional neural network which is trained in an end-to-end fashion.
The details of the proposed approach are presented in the following subsections:
Iii-A Visual Signals
The input to single stream network consists of three channel (RGB) facial image cropped to represent only the facial region. The input image size is fixed to pixels. Note that under the proposed framework, single stream network is capable of handling both dynamic and static input regardless of any conditioning.
Iii-B Audio Signals
In addition to visual input, audio signals are also fed to the network. The encoded audio signals are short term magnitude spectrograms generated directly from raw audio of length three seconds. The audio stream is extracted, converted to single channel at 16kHz sampling rate with sampling frequency in accordance to the frame rate. The methodology is explained in [20, 19], however, we do not perform any normalization as a part of pre or post processing.
Iii-C The Single Stream Network
Our proposed network is generic and an appropriate deep neural network can be employed. In our implementation, we use InceptionResNet-V1 as a single stream network for joint embedding of audio and visual signals (Fig. 1). The network is trained using both face images and spectrograms. Suppose there are audio spectrograms associated with face images in a class representing an identity. Each image and the spectrogram is input to the network and feature vectors
(1) |
Thus during the training phase, face images and the spectrograms are treated in similar fashion and a single stream network can effectively bridge the gap between image and audio eliminating the need of multiple networks for each modality. In our implementation, instead of using the traditional loss functions, we extend center loss for cross-modal distance learning jointly trained with softmax loss [30]. This loss function simultaneously learns centers for all classes including face images and spectrograms in a mini-batch and minimzes the distances between each center and the associated images and spectrograms. It thus imposes neighborhood preserving constraint within each modality as well as across modalities. If there are classes in a mini batch , the loss function is given by
(2) |
In Eq. 2, denotes the i th deep feature, belonging to the
This loss function minimizes the variation between face image and spectrogram within a class and effectively preserves the neighborhood structure. In this way, face image and spectrogram which do not belong to the same identity do not occur in the same neighborhood. Implementation details are explained in the next section.
Iv Experiments
We perform a series of experiments on various tasks consisting of cross-modal verification, cross-modal matching and cross-modal retrieval to evaluate the embeddings learned by the single stream network under the proposed framework. The experimental setup and dataset details are explained below.
Iv-A Experimental Setup
We perform three different experiments which are as below.
Iv-A1 Cross-Modal Verification
The first task is to perform cross-modal verification where the goal is to verify if an audio segment and a face image belong to the same identity. Two inputs are considered i.e. face and voice and verification between the two depends upon a threshold on the similarity value. The threshold can be adjusted in accordance to wrong rejections of true match and/or wrong acceptance of false match. We report results on standard verification metrics i.e. ROC curve (AUC) and equal error rate (EER).
Iv-A2 Cross-Modal Matching
The second task consists of cross-modal matching where the goal is to match the input modality (probe) to the varying gallery size which consists of the other modality. We increase to determine how the results change. For example, in matching task, we are given a modality at input, e.g. face, and the gallery consists of two inputs from other modality, e.g. audio. One of them contains a true match and other serves as an imposter input. We employ matching metric i.e. accuracy to report results. We perform this task in five settings where in each setting the is increased as .
Iv-A3 Cross-Modal Retrieval
Lastly, we evaluate the learned embedding on cross-modal retrieval. Given a single modality input, the task is to retrieve all the semantic matches of the opposite modality. We perform this task for both Face Voice and Voice Face formulation. We report results in terms of which evaluates accuracy in terms of the first retrieved results against a query.
Iv-B Dataset
Recently, Nagrani et. al [20] introduced a large-scale dataset of audio-visual human speech videos extracted ‘in the wild’ from YouTube. Nagrani et. al [18] created two train/test splits out of this dataset to perform various cross-modal tasks. The first split consists of disjoint videos from the same set of speakers while the second split contains disjoint identities. We train the model using these two training sets, allowing us to evaluate on both test sets, the first one for seen-heard identities, and the second for unseen-unheard identities. Note that we followed the same train, validation and test split configurations as used in [18] for fair comparisons.
Iv-C Implementation Details
We learn single stream network with standard hyper-parameters setting. The size of the input images and spectrograms is specified to and output feature vector is extracted from the last fully connected layer of single stream network. For optimization we employ Adam optimizer [15] because of its ability to adjust the learning rate during training. We use Adam’s initial learning rate of and employ weight decay strategy with decaying by a factor of . The network is trained for epochs. The mini-batch size was fixed to randomly selected images and spectrograms. Training with mini-batch speeds up the process and helps with generalization.
V Evaluation
V-A Cross-modal Verification
In this section we report results of the single stream network on cross-modal verification task, the aim of which is to determine whether an audio segment and a face image are from the same identity or not. Recently [18] used VoxCeleb dataset to benchmark this task under two evaluation protocols, one for seen-heard identities and the other for unseen-unheard identities. We evaluate on the same test pairs222http://www.robots.ox.ac.uk/~vgg/research/LearnablePins provided in [18] for each evaluation. More specifically, pairs from unseen-unheard identities and pairs from seen-heard identities are selected. The results for cross-modal verification are reported in Table I. We use area under the ROC curve (AUC) and equal error rate (EER) metrics for verification. As can be seen from the table, our model trained from scratch outperformed the state-of-the-art work on seen-heard protocol and unseen-unheard protocol.
Furthermore, we examine the effect of Gender (G), Nationality (N) and Age (A) separately, which influence both face and voice verification. It is important to note that [18] employed pre-trained network, whereas we trained the model from scratch. Our network outperformed on G, N, A and the combination (GNA) in seen-heard formulation regardless of pre-trained network as a backbone, see Table II. However, our network shows comparable results on unseen-unheard formulation for N,A and GNA, whereas it outperformed on random and G regardless of pre-trained network, see in Table II.
AUC % | EER % | |
Seen-Heard | ||
Learnable Pins [18] | 73.8 | 34.1 |
Proposed SSNet | 91.1 | 17.2 |
Un-seen-Un-heard | ||
Learnable Pins [18] | 63.5 | 39.2 |
Proposed SSNet | 78.8 | 29.5 |
|
Demographic Criteria | Configuration | Random | G | N | A | GNA |
---|---|---|---|---|---|---|
Seen-Heard (AUC %) | ||||||
Learnable Pins [18] | Scratch | 73.8 | - | - | - | - |
Learnable Pins [18] | Pre-train | 87.0 | 74.2 | 85.9 | 86.6 | 74.0 |
Proposed SSNet | Scratch | 91.2 | 82.5 | 89.9 | 90.7 | 81.8 |
Unseen-Unheard (AUC %) | ||||||
Learnable Pins [18] | Scratch | 63.5 | - | - | - | - |
Learnable Pins [18] | Pre-train | 78.5 | 61.1 | 77.2 | 74.9 | 58.8 |
Proposed SSNet | Scratch | 78.8 | 62.4 | 53.1 | 73.5 | 51.4 |
V-B Cross-modal Matching
In this section we perform the cross-modal matching task employing single stream network. We perform the where matching tasks to evaluate the performance of our approach. Unlike others [19, 18], we do not require positive or negative pair selection since under the proposed framework, the network learns in a self-supervised manner. Table III reports results on said task along with comparison with recent approaches on the same task. In Table III, the probe is voice while the matching gallery consists of faces. For instance consider the case where the input is voice and is matching task, we find that out the entry in gallery which best matches the input. It is important to note that for tasks, the work in [19] trains separate network for each . However, the major advantage of training under the proposed framework is that it is not restricted to a particular value of . The single stream network can effectively handle any value of without increasing sub-network size. We observe that increasing decreases performance in a linear fashion due to increase in the challenge. Fig. 2 shows the qualitative analysis of forced N-way matching tasks. We realize that the results are comparative, but the detailed result discussion is done in the upcoming Section VI.
Inputs | Learnable PINS. | SVHF-Net | Proposed SSNet |
Voice Face (%) | |||
2 | 84 | 78 | 78 |
4 | 54 | 46 | 56 |
6 | 42 | 39 | 42 |
8 | 36 | 34 | 36 |
10 | 30 | 28 | 30 |

Random | Gender | Random | Gender | |
---|---|---|---|---|
Voice Face () | Face Voice () | |||
Seen-Heard | ||||
Proposed SSNet | 36.27 | 37.20 | 50.00 | 51.20 |
Unseen-Unheard | ||||
Proposed SSNet | 8.70 | - | 13.20 | - |
Configuration | AUC % | EER % |
---|---|---|
Seen-Heard | ||
81.2 | 26.3 | |
91.1 | 17.2 | |
Un-seen-Un-heard | ||
72.6 | 33.6 | |
78.8 | 29.5 |
Inputs | ||
---|---|---|
Voice Face (%) | ||
2 | 73 | 78 |
4 | 49 | 56 |
6 | 38 | 42 |
8 | 34 | 36 |
10 | 29 | 30 |
V-C Cross-modal Retrieval
In this section we evaluate the results of cross-modal retrieval task employing both face and voice as probe with other modality at the retrieval end. We report results in terms of metric which evaluates the top retrieved results. Table IV demonstrates quantitative results of our approach on the said task. We also perform retrieval conditioned on Gender as well for both formulations. Note that in Table IV Random corresponds to retrieval results on complete test set regardless of gender, nationality or age considerations. As an added experiment, we perform for gender.
V-D Qualitative Evaluation
Fig. 3 is tSNE [26] embedding result of learned features extracted from test set of VoxCeleb dataset of

Vi Result Discussion
In this section we discuss the results obtained for all three cross-modal tasks i.e. verification, matching and retrieval and performance of our approach. We compare our approach with [19] and [18] to list out the benefits of employing deep latent space learning framework as a training procedure along with the network coupled with formulated loss function.
Vi-A No Pair Selection
One of the main benefits on learning features in a shared latent space is no overhead of pair or triplet selection. As dataset increases exponentially over time, so does the overhead of pair or triplet selection. The proposed framework ensures that class information is leveraged to penalize distance between learned embedding.
Vi-B Ease of Fine tuning
A biometrics system is supposed to be robust to age progression and capable of learning end-to-end from dynamic scenes as well. It is certain that fine tuning pre-existing systems can solve the problem of age progression. However, a major drawback of existing approaches is need of new pair wise information as soon as the input data changes. For every identity , new pairs have to be selected to ensure robust representations. However, this is not the case with SSNet and the proposed loss function since it relies only on class information. Therefore, even though faces change over time, their parent class i.e. identity remains the same. Consequently, fine tuning just requires new input data with no pre/post processing and pretraining.
Vi-C Component Modularity
We argue that our approach is modular since single stream network coupled with loss functions are just components which are trained under the proposed framework. Therefore, it is inexpensive to switch between networks depending on the data.
Vii Ablation Study
We experiment with the hyperparameter
Viii Conclusion
In this work, we presented a novel training procedure coupled with a single stream network capable of jointly embedding visual and audio signals into a shared latent space without any pairwise or triplet supervision. Furthermore, we introduced a novel supervision signal coupled with the single stream network to aid the joint projection of embedding. We demonstrated results on various cross-modal tasks: verification, matching and retrieval considering several demographic factors such as age, gender, and nationality. We achieved state-of-the-art results on several tasks and comparable results on others despite simple strategy.
References
-
[1]
(2015)
Vqa: visual question answering.
In
Proceedings of the IEEE international conference on computer vision
, pp. 2425–2433. Cited by: §II-C. - [2] (2017) Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 609–617. Cited by: §II-C.
- [3] (2018) Cross-modal scene networks. IEEE transactions on pattern analysis and machine intelligence 40 (10), pp. 2303–2314. Cited by: §II-C.
- [4] (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pp. 892–900. Cited by: §II-C.
- [5] (2004) Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences 8 (3), pp. 129–135. Cited by: §I, §II-A.
-
[6]
(2018)
Git loss for deep face recognition
. In British Machine Vision Conference, Cited by: §II-B. - [7] (2018) VoxCeleb2: deep speaker recognition. In INTERSPEECH, pp. 1086–1090. Cited by: §II-C.
- [8] (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §II-B.
- [9] (1996) Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading. Nature 381 (6577), pp. 66. Cited by: §II-A.
- [10] (1989) Neuro-cognitive processing of faces and voices. In Handbook of research on face processing, pp. 207–215. Cited by: §I.
- [11] (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §II-C.
- [12] (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §II-B.
- [13] (2003) Putting the face to the voice’: matching identity across modality. Current Biology 13 (19), pp. 1709–1714. Cited by: §I, §II-A.
- [14] (2018) On learning associations of faces and voices. In Asian Conference on Computer Vision, pp. 276–292. Cited by: §I.
- [15] (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §IV-C.
- [16] (2004) Specification of cross-modal source information in isolated kinematic displays of speech. The Journal of the Acoustical Society of America 116 (1), pp. 507–518. Cited by: §II-A.
- [17] (1976) Hearing lips and seeing voices. Nature 264 (5588), pp. 746. Cited by: §I, §II-A.
- [18] (2018) Learnable pins: cross-modal embeddings for person identity. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 71–88. Cited by: §I, §II-C, §IV-B, §V-A, §V-A, §V-B, TABLE I, TABLE II, TABLE III, §VI.
-
[19]
(2018)
Seeing voices and hearing faces: cross-modal biometric matching.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 8427–8436. Cited by: §I, §I, §I, §II-C, §III-B, §V-B, TABLE III, §VI. - [20] (2017) Voxceleb: a large-scale speaker identification dataset. INTERSPEECH, pp. 2616–2620. Cited by: §I, §I, §II-C, §III-B, §IV-B.
- [21] (2015) Deep face recognition.. In BMVC, Vol. 1, pp. 6. Cited by: §II-B.
- [22] (2017) Deep neural network embeddings for text-independent speaker verification. In Proc. Interspeech, pp. 999–1003. Cited by: §II-B.
- [23] (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §II-B.
- [24] (2018) Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263. Cited by: §II-C.
- [25] (2017) 3d convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, pp. 22081–22091. Cited by: §II-C.
-
[26]
(2014)
Accelerating t-sne using tree-based algorithms.
The Journal of Machine Learning Research
15 (1), pp. 3221–3245. Cited by: §V-D. - [27] (1998) The moving face during speech communication. Hearing by eye II: Advances in the psychology of speechreading and auditory-visual speech 2, pp. 123. Cited by: §II-A.
- [28] (2016) Order-embeddings of images and language. In International Conference on Learning Representations, Cited by: §II-C.
- [29] (2019) Disjoint mapping network for cross-modal matching of voices and faces. In International Conference on Learning Representations, Cited by: §I, §II-C.
- [30] (2016) A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Cited by: §III-C, §III-C.
- [31] (1998) Quantitative association of vocal-tract and facial behavior. Speech Communication 26 (1-2), pp. 23–43. Cited by: §II-A.
- [32] (2013) Learning the visual interpretation of sentences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1681–1688. Cited by: §II-C.