Human faces have abundant attributes that can implicitly indicate the family heredity from visual appearance. This phenomenon has been studied in psychology, with the aim of discovering how humans visually identify kin related cues from face [1, 2, 3, 4]5] in 2010.
In the literature, kinship verification is mainly explored from a perspective of visual appearance , with most techniques based on still images. Just as people with kin relations tend to share common facial attributes, they may also share common voice attributes. In the genetic study domain, to determine how the human voice is passed down through generations, and to study the key factors influencing our voice, researchers from University of Nottingham carried out a pilot study on the heritability of human voice parameters111https://www.nottingham.ac.uk/news/pressreleases/2016/january/help-the-scientists-find-out-why-you-sound-like-your-parents.aspx. Inspired by this study, we address, for the first time, the use of vocal information for kinship verification. Despite the long history of speech research, assessing kinship relation from voice has received very little attention in literature — some studies have addressed potential performance degradation of automatic speaker verification (ASV) when tested with the voice of persons with close kinship relation, such as identical twins [8, 9]. Furthermore, many related applications, like expression recognition in affective computing, have benefited from using techniques that combine face and voice modalities [10, 11, 12, 13]. In this paper, we hypothesize that fusing face and voice modalities captured in video sequences can improve the accuracy and robustness of systems for kinship verification.
Audio-visual kinship verification has many potential applications ranging from social media analytics, forensics, surveillance and security to kin-related authentication. For instance, social media applications involve an overwhelming amount of data, including face images and videos. Automatic kinship verification could be used in semi-automatic organization of kin relations within different social relations, such as friends or colleagues. Another application of audio-visual kinship verification is to find missing children after some years, even when their appearance changes due to ageing, rather than expensive and invasive DNA test. It can also be employed in surveillance and security control in abnormal behavior detection. Through the analysis of surveillance footage video and verifying kinship relations, crime such as children kidnapping can be detected and kinship verification could be a decision support tool for forensic investigation. Automatic kinship verification system can also be used in kin related authentication. For instance, the United States Department allows people with relatives resided in the U.S. to enter as refugees . Audio-visual kinship verification can implement the real-time kin test with low cost. Audio-visual kinship analysis can also be used for automatic video organization and annotation.
This work focuses on kinship verification using audio-visual information. We investigate verification systems that allow for fusion of facial and vocal modalities to encode a discriminative kin information. The main contributions of this work are summarized as follows.
Since no available kinship database is available for studying multi-modal kinship verification, we collected and analysed a new kinship database called TALking KINship (TALKIN). It consists of both visual (facial) and audio (vocal) information of individuals captured talking in videos. We consider four kin relations: Father-Son (FS), Father-Daughter (FD), Mother-Son (MS) and Mother-Daughter (MD).
We consider sub-problems driven by the TALKIN database: kinship verification from facial images, voice, and from audio-visual information. We investigate the impact on performance (accuracy and complexity) when going from uni-modal to multi-modal cases. Benchmark results for uni-modal kinship verification are provided and then for several fusion methods are analysed and compared with state-of-the-art uni-modal and multi-modal fusion techniques.
A deep Siamese fusion network with contrastive loss is proposed for audio-visual information fusion, to enhance the reliability of kinship predictions. Experiments show that the proposed fusion methods outperform baseline uni-modal and multi-modal fusion methods.
This paper extends our preliminary investigation on audio-visual kinship verification  in several ways. In particular: (1) a comprehensive analysis of related literature from the perspective of kinship verification, automatic speaker verification from close kin relations and multi-modal methods, for a more self-contained presentation; (2) a detailed description of the proposed and baseline methods for kinship verification based on face, voice and multiple modalities; and (3) more proof-of concept experimental results and interpretations, including a detailed analysis of performance for audio vs. video based kinship verification.
The rest of this paper is organized as follows. Section II provides background and previous research related to existing kinship databases, proposed kinship verification techniques, automatic speaker verification for identical twins and multi-modal fusion applications. Section III introduces the TALking Kinship (TALKIN) dataset. In Section IV, kinship verification problem are presented from from a perspective of one modality (face vs. voice) and multiple modalities (face & voice). Techniques for both uni-modal and multi-modal kinship verification are presented. Section V, describes the experiment methodology employed for performance evaluation. Finally, in Section VI the experimental results are presented and discussed.
Ii Related Work
Since 2010, several kinship databases have been published: Cornell KinFace , UB KinFace [15, 16, 17], KinFaceW , UvA-NEMO Smile [18, 19], TSKinFace , KFVW  and FIW . All these datasets address the general problem of kinship verification modeling using facial images or videos, but differ both in their exact task settings as well as the quality and quantity of data. We provide below a brief review of each dataset.
Cornell KinFace  is the first kinship database that aims to verify kin relations using computer vision and machine learning methods. It includes 150 parent-child face image pairs of celebrities. The facial images were collected from the Internet by researchers at Cornell University, and represent therefore uncontrolled, in the wild style data with no control over environments, cameras or poses.
UB KinFace [15, 16, 17] is the only kinship verification database that includes images of parents when they were both young and old. UB KinFace consists of two parts focused on Asian and non-Asian subjects, respectively. Each part has 100 groups of facial images. Each group has one image of child and one image for both young and old parent. Thus, in total there are 600 () facial images. The resolution of images is 89 96.
KinFaceW  has two subsets, KinFaceW-I and KinFace-II. Images of each parent-child pair from KinFaceW-I are collected from different photos while image pairs in KinFaceW-II are from the same family photograph. The facial images are aligned according to eye position and cropped into size of 64 64.
TSKinFace  database addressed the problem that children may partially seem like one parent and also partially look like the other parent. It consists 1015 tri-subject groups (Father-Mother-Child) totally.
Families in the Wild (FIW)  is the largest and most comprehensive visual kinship database with over 13,000 family photos of 1,000 families (with average of 13 of each family). Facial images are resized into 224 224.
UvA-NEMO Smile [18, 19] database addresses the problem of video based kinship verification. It is collected under constrained environment with limited real-world variation that people make a smile face spontaneously and deliberately. To carry out the study of video based kinship verification from more complex environment, Yan et al. collected Kinship Face Videos in the Wild (KFVW)  database under unconstrained environment. It was collected from the TV show on the Internet with 418 pairs of facial videos.
While the above databases cover multiple aspects of kinship verification from faces, no publicly available audio-visual kinship verification database exists. To explore the problem of kinship verification from face and voice modalities, we collected and analysed the TALKIN dataset (see Section III).
Ii-B Kinship verification from faces
Kinship verification from facial images was first addressed by Fang et al.  in 2010. Since then, many works have been proposed and several competitions have been organized [23, 24, 25, 26]. We briefly review below related works on visual kinship verification from still images and videos.
Ii-B1 Image-based verification:
Initial works focused on feature based methods. Fang et al.  extracted 22 facial features and selected the 14 most discriminative ones for classification; distance between two images is calculated and fed into K-nearest Neighbor
(KNN) andSupport Vector Machine (SVM) back-ends to verify the kin/non-kin relation. Yan et al.  proposed prototype-based discriminative feature learning (PDFL) method to learn a feature representation from the labeled face in the wild (LFW) dataset without kin labels. Wu et al.  extracted color texture features to study the importance of color in kinship verification problem. Besides the works on feature representation, metric learning also showed good performance. Lu et al.  proposed neighborhood repulse metric learning (NRML) method which aims to repulse the images without kin relation and minimize the distance between images with kin relation. Liu et al. , in turn, proposed status-aware projection metric learning (SPML) method to solve the asymmetric problem as parent and child are considered with different status that parent is usually older than child. Finally, deep learning  shows high performance in the field of computer vision, kinship verification being no exception [31, 32]. Zhang et al.  proposed an end-to-end convolutional neural network
(CNN) architecture for kinship verification that uses a pair of two RGB images as input, and a softmax layer to predict the kinship relation. Compared with other state-of-the-art methods, such asdiscriminative multimetric learning (DMML) , it yielded 5.2 and 10.1 improvement on KinFaceW-I and II, respectively. To further demonstrate the discrimination of CNN, Lu et al.  presented discriminative deep metric learning (DDML) method to learn a non-linear distance metric. The back-propagation algorithm was used to train the model where the distance between the positive pairs was narrowed and distance between negative pairs was enlarged.
Ii-B2 Video-based verification
Video based kinship verification attracted less attention compared to still image based kinship verification. However, facial expression dynamics can provide useful information for kinship verification. It has been shown that people from the same family display similar facial expressions, such as anger, joy, or sadness . Video-based kinship verification problem was studied by Dibeklioglu et al. . The authors localized 17 facial landmarks and used temporal Completed Local Binary Pattern
(CLBP) descriptors to describe the expressions. Combined with the spatial facial features, temporal CLBP features are fed into SVM to classify kin or non-kin relation. Then, Boutellaaet al. 
proposed to use both shallow spatio-temporal features and deep features to characterize a dynamic face, which got a further improvement. Unconstrained video based kinship verification was recently proposed by Yanet al. . They collected a new kinship database with videos in the wild condition. Several state-of-the-art metric learning algorithms were evaluated on video based kinship verification problem. Yet, previous work is mainly performed from computer vision domain. There is no study that has focused on kinship verification combining face and voice cues.
Ii-C Speaker verification for identical twins
As far as we know, there are neither no specifically focused databases nor kinship verification studies using voice. Some related work exists within reliability assessment of speaker recognition. Voice from two different persons with a close kinship relation might be confusable. One special case — voice of identical twins — was addressed almost five decades ago , when it was found to confuse listeners in same/different speaker discrimination.
More recent studies, involving mostly automatic systems, have also demonstrated that voice of identical twins can be confusable also for automatic systems. The authors of  studied automatic speaker verification
(ASV) performance using voice of identical twins collected at a twin research institute in the UK. There are totally 49 identical twin pairs (40 female and 9 male pairs) involved. A Gaussian mixture model - universal background model (GMM-UBM) with 2048 Gaussians was trained. They reported 0.4% equal error rate (EER) when tested with all speakers, which degraded to 5.2% EER when tested with twin voice. The EER increased from 2.8% (all) to 10.5% EER (twins) with short utterance.
The author of  studied the performance of a commercial forensic automatic speaker recognition with identical twin data. The author compared graphically likelihood ratio distributions and reported EERs from various experiments. Under matched-text condition, the author reported 0% and 0.5% EERs for males and females, respectively, when unrelated speakers were used as non-targets; these errors increased, respectively, to 11% (male) and 19.2% (female) when twins were used as non-targets. This 19.2% was increased up to 48% with mismatched texts. In summary, the tested automatic system experienced performance degradation for both genders and much worse for females.
Besides observing the performance change of automatic systems, a number of studies focus on acoustic differences of twins. For instance,  studies formant dynamics of 8 Shanghainese-Mandarin bilingual identical twin pairs, focused on common diphthong /ua/ found in both languages. The authors discovered that although very similar, identical twins did have significant differences in their formant dynamics. The authors constructed a simple linear discriminant analysis (LDA) classifier formed from the first three formants (F1 to F3) and reported speaker classification rates between 80% to 90%.
Despite the use of small datasets, the above review does suggest that voice of identical twins are potentially confusable by some listeners and ASV systems. While detrimental for ASV, the news are positive from the perspective of kinship verification: it looks possible to devise a system or a method that is sensitive to kinship cues in the human voice, to be used for detecting how closely two speakers are related. While identical twins are a rare special case in the general population, an interesting open question is how accurately kinship relations could be determined from voice for more common kinship relationships addressed in the visual kinship studies. One of the main aims to introduce our TALKIN database is to help answering this question.
|Database||Modalities||Size||Resolution ratio||Family structure||Controlled environment|
|Cornell KinFace ||Image||150 pairs||No||No|
|UB KinFace ||Image||200 groups||No||No|
|KinFaceW ||KinFaceW-I||Image||533 pairs||No||No|
|TSKinFace ||Image||1015 tri-subjects||No||No|
|UvA-NEMO Smile ||Video||1240 videos||No||Yes|
|FIW ||Image||1000 family trees||Yes||No|
|KFVW ||Video||418 pairs of videos||No||No|
|TALKIN (ours)||Video & Audio||400 pairs of videos||1920 1080||No||No|
Ii-D Multi-modal methods
Multi-modal fusion methods have successfully improved the recognition accuracy in many applications found in affective computing , person recognition , large-scale video classification  and gesture recognition , because they can exploit complementary sources of information. Different sources of information are typically integrated through early fusion (feature level) or through late fusion (score or decision levels) . Feature-level fusion using concatenation or aggregation (e.g., canonical correlation analysis or CCA ) is often considered to provide a high level of accuracy, although feature patterns may also be incompatible and increase system complexity. Techniques for score-level fusion using deterministic (e.g., average fusion) or learned functions are commonly employed, but are sensible to the impact of score normalization methods on the overall decision boundaries and the availability of representative training samples. Despite reducing the information content about modalities, techniques for decision-level fusion (e.g., majority voting) can provide a simple framework for combination, although limitations are placed on decision boundaries due to the restricted operations that can be performed on binary decisions.
In the deep learning literature, Neverova et al.  proposed a multi-scale and multi-modal early fusion method — multimodal dropout (ModDrop) — for gesture recognition problems. First, the weights of each modality are pre-trained. Then, a gradual fusion method is proposed by randomly dropping separate channels to learn cross-modal correlations while preserving uni-modality specific representation. Liu et al. , in turn, introduced multi-modal factorized bi-linear pooling (MFB)  method to combine visual and audio representations for video-based classification. In affective computing applications, Tzirakis et al.  proposed an end-to-end multimodal deep NN for emotion recognition. Visual and speech modalities are first trained separately to speed up the fusion training phase. Then, the fusion network is trained in an end-to-end fashion. Concerning late fusion, authors of  considered multiple score fusion techniques for indoor surveillance person recognition. Experimental results showed the efficiency of multimodal methods over the unimodal approaches.
To sum up, prior results in literature suggest that improvements in accuracy and robustness can be obtained by using multi-modal methods over uni-modal techniques. To improve the accuracy and robustness of kinship verification, we therefore investigate algorithms for the fusion of face and voice modalities. To the best of our knowledge, our work is the first attempt to study the kinship verification from both visual and audio information.
Iii The TALKIN Database
In this section, we describe our new kinship database called TALking KINship (TALKIN). Compared with existing kinship databases with facial videos (UvA-NEMO Smile  and KFVW ), TALKIN contains both videos and audio under unconstrained environment. It contains several videos of subjects talking in the wild environment (under unconstrained background, illumination and recording condition). The purpose of collecting our new database is to investigate the problem of audio-visual kinship verification in the wild. A comparison of TALKIN with existing kinship databases is shown in Table I.
Iii-a Data collection pipeline
The overall collection pipeline of the TALKIN dataset is shown in Fig. 1.
Step 1. List of celebrities or family TV shows. First, we prepared a name list that we intend to obtain videos from. The target amount of video for each relation is 100 pairs of clips. Most of the list is formed by celebrities, such as musicians, actors, politician et al., with rest of it from TV series involving family interactivity (non-celebrities).
Step 2. Downloading videos from YouTube. We downloaded the videos from YouTube by searching the name of celebrities or TV series. To avoid biases encountered in some of the previous kinship databases [42, 43], we collected parent’s videos and child’s videos from different video clips corresponding to different backgrounds and recording conditions.
Step 3. Data preparation.
After we getting the raw data from the web, we did data pre-processing. For face detection and alignment, we employedMulti-task Cascaded Convolutional Networks (MTCNN) algorithm  to detect 5 face landmarks in every frame of the video. Finally, the videos are cropped and aligned according to the landmarks. The facial regions are then re-sized into 224224. Both hand-crafted features and deep features are extracted to represent each individual. To represent the audio information, we directly extracted audio from the video clips. The sample rates are all set to 44.1 kHz. Three standard techniques in the speech field, namely GMM-UBM, i-vectors and Deep Neural Network, are used for text-independent kinship analysis.
Iii-B Parameters of the dataset
The TALKIN dataset focuses on four kin relations: Father-Son (FS), Father-Daughter (FD), Mother-Son (MS) and Mother-Daughter (MD), with 100 pairs of videos (with audio) for each relation. As all the data originates from uncontrolled Internet sources, the speech contents vary from subject to subject and video to video, making the voice-related sub-task text-independent kinship verification, analogous with text-independent speaker verification. That is, the task is to verify kinship relations regardless of what was said between individuals.
TALKIN incorporates a wide range of backgrounds, recording environments, poses, occlusions and ethnicities. Table II shows the distribution of ethnicity in TALKIN. The distribution is count by kin pair rather than individuals, in case that one parent might appear multiple times with more than one kid. Note, however, that we exclude mixed-race trials, i.e. the parent and child in a trial has the same ethnicity. The dataset has two parts: video and audio. The length of the video varies from 4.032 seconds to 15 seconds with a resolution of . Audio is extracted from video files. Besides the varied text content, the audio files contain substantial channel variations (e.g. due to differing recording devices). Some of them also contain reverberation and additive noise.
In this section, we will focus on the kinship verification problem using both uni-modal (face vs. voice) and multi-modal methods. Fig. 2 shows examples of signal employed for kinship verification from face and voice modalities. Fig. 4 illustrates the architectures proposed for uni-modal systems. We address kinship detection as a hypothesis testing problem – given a pair of signals (a pair of video sequences or speech utterances), say , the task is to evaluate support for two mutually exclusive hypotheses, null hypothesis and alternative hypothesis,
In practice, we represent and using frame-level feature vectors that are then used to derive recording-level representations of fixed size (regardless of the number of frames). Kinship score – a numerical indicator with higher values associated with stronger support in favor of – is then obtained by computing similarity score between the feature representations. We consider both hand-crafted and data-driven (learned) feature representations and similarity scoring techniques. The following three subsections present methods for face-based, voice-based feature representations and data fusion, respectively.
Iv-a Face-based kinship verification
We employed both hand-crafted and deep feature representations for facial kinship verification. In particular, we adopted image-based representations produced with binarized statistical image feature (BSIF) , local phase quantization (LPQ) , and local binary pattern (LBP) [47, 48] descriptors. These features are extracted from each video frame and averaged over the frames to represent a video sequence. In addition, we considered local binary patterns from three orthogonal planes (LBP-TOP)  that is more naturally suited for representation of faces over multiple frames in a video. Besides these conventional hand-crafted features, we further included state-of-the-art deep Siamese architecture. The Siamese network is a pair-wise match, where feature representations are extracted through metric learning. To assess the kinship similarity score between two video sequences, we computed cosine similarity measure between two feature representation vectors, and :
A threshold is applied to to determine whether two inputs have a kin relation.
Iv-A1 Image-based representation
In , the authors demonstrated the effectiveness of HSV color space for kinship verification problem. We first converted the facial images into HSV color space. We considered several descriptors for extracting the features from the facial images. BSIF is a binary texture descriptor that uses a small set of natural images  as a training set to learn filters. LPQ is a blur-invariant image texture descriptor. LBP shows its effectiveness in face analysis. It computes a binary code for each pixel in an image. The binary patterns are counted into a histogram to represent the image texture.
Iv-A2 Video-based representation
LBP-TOP is an extension of LBP, in which the local binary patterns are extracted from three orthogonal planes of a frame sequence: XY, XT and YT, where X and Y denote the spatial coordinates and T means the time coordinate. For a video or a sequence of image, it can be viewed as a stack of XY planes in axis T, XT planes in axis Y and YT planes in axis X. LBP-TOP extracts features from each separate plane and concatenates them into one feature vector.
Iv-A3 Face network
While the above hand-crafted face descriptors have the benefits of being simple and interpretable, they are not specifically optimized for kinship cue representation. Similar to other visual pattern classification tasks, we expect substantially better results by leveraging from data-driven approaches that are directly optimized for a given task. To this end, we implemented the VGG-Face 
CNN cascaded with an Long Short-Term Memory (LSTM) network for the facial representations. VGG-Face network is trained on a large face dataset with 2.6 million images of over 2662 people 222http://www.robots.ox.ac.uk/~vgg/software/vgg_face/. This network has shown interesting performance on face verification using both images and videos. Furthermore, it also shown the effectiveness of kinship verification with constrained facial videos . As shown at the top of Fig. 3, it consists of 13 convolution layers, each followed by rectified linear unit
(ReLU). Some of them are also followed by max pooling operator. The last two layers are FC layers that have 4096 outputs. We fed the facial frames one by one and collected the deep features from layer fc7. A layer of LSTM with 4096 cells is stack on the basis of VGG-Face descriptor and trained to integrate the spacial information to spatial-temporal features. The network is trained in ‘Siamese’ fashion using contrastive loss . Here, the contrastive loss is defined as:
where threshold denotes margin, is the mini-batch size, , and denote two sample feature vectors that are collected from the last state of LSTM, is the label of the sample pair. equals 1 when the inputs have the kin relation and equals 0 the otherwise.
Iv-B Voice-based kinship verification
As for the voice modality, we adopted three methods from the related task of automatic speaker verification (ASV). Two of them, Gaussian mixture model — universal background model (GMM-UBM)  and identity vector (i-vector) , are standard statistical classifiers while the last one uses deep learning.
We first trained UBM from a training set of disjoint speakers to those used for kinship scoring. The UBM, denoted here by , models speaker-independent distribution of the MFCC features. It serves both as a prior model to obtain speaker-dependent models via maximum a posterior (MAP) adaptation, and as the likelihood model for the alternative hypothesis modeling. If we denote the MFCC sequence of a test utterance by and the speaker model of th speaker by , the detection score is given by the log-likelihood ratio (LLR) ). When the speaker identities of and are the same, LLR score is high (relative to situation when their identities differ).
For kinship modeling, speakers with a positive kin relation share the same source identity (kin label). When we evaluate a particular kin hypothesis, we compare the MFCCs of the test speaker against the speaker model of another speaker. A positive kinship trial occurs when the test speaker (source of ) and the reference speaker (source of ) have a positive kin relation (e.g. mother-son). Pairs with no kin relation constitute negative trials. From this perspective, GMM-UBM is used exactly the same way as in ASV, though the trial labels are defined differently. Note that in kinship verification, all the compared speaker pairs have disjoint identities.
Iv-B2 I-vector based method
I-vector [54, 55] is a compact representation of a speech recording. It is extensively used in speaker and language recognition to represent speech utterances of different lengths as fixed-dimensional embeddings. Akin to GMM-UBM, the i-vector paradigm builds upon GMM modeling of short-term spectral observations. Unlike GMM-UBM, however, the i-vector model leverages from statistical redundancy across different recordings by imposing subspace constraints to the mean vectors of a GMM. In specific, the model assumes that the mean vector of the th Gaussian in recording , denoted by , can be expressed as,
where is recording-independent mean (from the UBM), is recording-independent factor loading matrix and
is a latent random variable with a normal standard prior,. Here, , where is the number of Gaussians, are the model hyper-parameters trained from offline data (i.e. speakers disjoint from those used in kinship training/testing). The i-vector itself, denoted by , is the posterior mean of conditioned on recording-specific Baum-Welch sufficient statistics collected using the UBM. We point the interested reader to  for further details.
A key point is that an i-vector serves as a recording-level feature vector to compactly represent stationary recording-level cues embedded in the GMM means. Importantly, the i-vector extractor is trained in an unsupervised way: the training of (the UBM means) and (the factor loading matrices) are done via dedicated expectation-maximization (EM) approach that requires no training labels. This makes the i-vector itself agnostic to a given classification task at hand. To be useful for a given task (here, kinship verification), one further trains a back-end classifier with labeled i-vectors (here, with known family identity). The support towards positive kinship hypothesis for a pair of i-vectors (e.g. hypothesized mother-son) can be then evaluated with the back-end classifier. After a number of tentative experiments, we ended up to linear discriminant analysis (LDA) trained with family labels, followed by cosine scoring.
|Modality||Techniques||Operation||External data usage||Kinship verification procedure|
|Layers for fine-tune||Kinship classifier|
|Audio||I-vector||Data-driven||No-Within TALKIN333Disjoint speakers from those used in kinship training and scoring||Train UBM from scratch||LDA+Cosine score|
|GMM-UBM||Data-driven||No-Within TALKIN333Disjoint speakers from those used in kinship training and scoring||Train UBM and T matrix from scratch||
|Last two layers||Cosine score|
Iv-B3 Voice network
Both the GMM-UBM and the i-vector methods are built upon fixed acoustic feature extractor (MFCC extractor), followed up by data-driven recording-level representation learning and back-end scoring. Even if both techniques have been successful in a number of speech-related tasks, one may question the usefulness of a fixed acoustic front-end. To this end, we wanted to further replace the i-vector embedding with a deep neural network model that uses convolutive models to extract features from the spectrogram, instead. In specific, we rely on a pre-trained ResNet-50 model trained from a very large speaker verification dataset called VoxCeleb2 . We then fine-tune with TALKIN data to get feature embedding from it for audio based kinship verification, where we fix the convolutional layers and finetune the last fully connected layer.
The audio samples are first converted into single-channel and down sampled to 16 kHz to be consistent with VoxCeleb2. Then the audio samples are segmented into 3-second chunks. A Hamming-window of duration 25ms and 10ms step is applied on the audio. Following , spectrograms with the size of 512 frequency bins
300 frames are extracted. After performing mean and variance normalization on the frequency bin of the spectrum, the normalized spectrograms are fed into the ResNet-50. Similar to the face network, to pull positive pairs (with kin relations) together and push negative pairs (without kin relations) away, the voice network is established as a Siamese network with contrastive loss at the end.
The overall uni-modal kinship verification methods are summarized in Table III.
Iv-C Multi-modal kinship verification
Up to this point, we have considered the visual and voice modalities in isolation from each other. In this sub-section we study effective ways to combine the modalities, where audio-visual kinship verification problem is illustrated in Fig. 4. This includes introducing a novel deep Siamese network for the fusion of the two modalities, and the use of traditional early and late fusion strategies.
Iv-C1 Baseline fusion methods
Two baseline methods for multi-modal kinship verification, early (feature) level and late (score) level fusion methods, are applied. For the early fusion method, after extracting features from face and voice network, PCA is used to make it consistent size for video and audio. Z-score normalization is used to normalize video and audio features separately. Then the video and audio features are concatenated together into one feature vector as the fused feature. Cosine similarity is calculated to classify whether they have kin relation.
For the late fusion method, the evaluation for the video based and audio based kinship verification are performed separately, with corresponding match score and . Then, the average score is selected as the fused score.
Iv-C2 A Siamese network for A-V fusion
The overall architecture of the deep Siamese network is shown in Fig. 5. It is trained to evaluate pair-wise similarities based on face and voice modalities. In a particular implementation, we fine-tune the VGG-Face  CNN cascaded with an LSTM network for the face modality, which is described in detail in subsection 2. For the voice modality (described in subsection 2) extracted from videos, we fine-tune a ResNet-50 with TALKIN which is pre-trained on VoxCeleb2 . For each voice and face network, we use contrastive loss to learn the intra-class similarity and inter-class dissimilarity among subjects.
After training the face and voice networks, we collected their features – 4096 features from the face network and 512 features from the voice network. To make the dimensional balance of both facial and vocal representations, we applied PCA to reduce both facial and vocal feature dimensions into 130. Then they are concatenated into a 260-dimensional feature, followed by a FC layer with 260 nodes. During the training procedure, our system is trained on TALKIN, using backpropagation and contrastive loss to learn the correlation between parent and child based on audio visual modalities, which has no family overlap between training and testing procedure. By adding contrastive loss during the fusion part, we can automatically learn the fusion rule for kinship verification to narrow the distance between pairs with a kin relation, and to enlarge the distance between the negative pairs. After training the network, the feature extracted from the added FC layer is viewed as fusion feature of one facial video and audio signal. Then, the cosine similarityis calculated to represent the distance between two inputs (e.g. parent and child represented by feature vectors and ). A threshold is applied to sim to determine a kin relation.
V Experimental setup
The TALKIN dataset is used to evaluate the performance of uni-modal and multi-modal kinship verification. For each kin relation —FS, FD, MS and MD— there are 100 pairs of videos with a positive kin relation. Likewise, we randomly generate 100 pairs of videos without any kin relation as the negative pairs. Thus, for each sub-task, we have 100 pairs of positive and 100 pairs of negative pairs. We use 5-fold cross-validation setup in our experiment: for a given test fold of 40 pairs, we train a kinship detector from the held-out 160 pairs. There is no family overlap between the 5 folds.
V-a Parameter setup of methods
V-A1 Hand-crafted features
We employed the following image-based feature representations: BSIF, LPQ and LBP. We averaged these frame-by-frame features to represent each video by a single feature vector. The facial frames are first converted into HSV color space  with size of 64 64 3. For BSIF feature extraction, images are divided into non-overlapping 32 32 blocks in each color channel. Each block is represented using 256 features and the whole face with 256 4 3 = 3072 features. For LPQ feature extraction, images are divided into non-overlapping 32 32 blocks in each color channel. Each block is represented using 256 features, leading to 3072-dimensional (256 4 3) feature representation for the whole face. For LBP feature extraction, the images are divided into non-overlapping 16 16 blocks in each color channel. The parameters of LBP are: the radius is set as 1 and the sampling number is 8. 59 histogram values are used to represent each block. Thus, each facial image is represented using 59 16 3 = 2832 features. Furthermore, we also evaluated the video representation, LBP-TOP. In the experiments, the frames are converted into gray scale. Then the face frames are divided into 56 56 non-overlapping blocks. All features extracted from each block volume are connected to represent the appearance and motion of the kinship video. The radius is 1. For each block volume, we extracted 59 histogram features in XY, XT and YT planes, respectively. Thus, one video can be represented as a 59 3 16 = 2832 face features. At last, we computed the cosine similarity between two facial features.
V-A2 Face network
We fed the facial frames one by one with size of 90 224 224
3. The network is trained with 3 epochs with mini batch size of 40. Learning rate is set to. After collecting features from the last state of LSTM, PCA is performed to reduce the dimension into 110.
V-A3 GMM-UBM & I-vector
We used MSR Identity Toolkit  to implement the GMM-UBM and i-vector methods. For both GMM-UBM and I-vector methods, we extract 12 Mel-frequency cepstral coefficients (MFCCs) from the audio samples with frame size of 256 and sample rate of 44.1 kHz. The UBM is trained with 128 Gaussian components. At last, we got i-vectors with dimensionality of 100. We used LDA to reduce the number of dimensions further down to 79 dimensions.
V-A4 Voice network
The voice network is pre-trained on VoxCeleb2 dataset. Then we fine-tune the last two layers of network with learning rate of . The network is trained with mini batch size of 40 for 10 epochs. After training the network, audio features are extracted from the last fully connected layer of dimensionality of 512. PCA is performed to reduce the feature dimension into 144.
V-A5 Baseline fusion methods
During both early fusion and late fusion, we kept the 144 dimensions of both video and audio features with PCA.
V-A6 Siamese network for A-V fusion
V-B Performance evaluation
In our experiments, we adopted the Equal Error Rate (EER), and ROC curves with Area Under Curve (AUC) as measures to evaluate and compare the accuracy techniques. Note that small EER and high AUC indicate the good performance of an algorithm.
Vi Experiment results and discussion
In this section, we present the experimental results and analyze on TALKIN videos for different uni-modal and multi-modal kinship verification methods.
Vi-a Uni-modal kinship verification
|LBP-Average [48, 47]||50.0||44.0||45.0||46.0||46.3|
|VGG-Face + LSTM||27.0||35.0||34.0||34.0||32.5|
Vi-A1 Face-based kinship verification
Table IV shows the EERs of visual kinship verification from four relations and average accuracy. From the overall average, our proposed VGG cascaded with a layer of LSTM shows better performance with about more than 29.3% lower EER compared with hand-crafted features.
The EERs in Table IV indicate the difficulty of kinship detection from faces. The EERs are notoriously high. In fact, as the chance level is 50%, the hand-crafted feature extraction techniques shown in the first four rows do little (or no) better than random guessing. This may not be surprising, remembering that none of the methods uses kinship/family labels. Equivalently, cosine scoring assumes all the feature dimensions to be equally informative, which may not hold for in-the-wild data such as TALKIN. Even if the EERs from the VGG + LSTM approach indicate performance better than random guessing, the error rates are too high to be of practical relevance. This motivates the study of voice-based and bi-modal kinship methods.
Vi-A2 Voice-based kinship verification
EERs for voice-based kinship verification are shown in Table V. As with the face modality, the EERs are high. There is little difference between the two GMM-based techniques (GMM-UBM and i-vector), both yielding results close to the chance rate. As expected, here too the deep approach (Resnet-50) provides the lowest overall EER.
Vi-A3 Comparison of face and voice cases
Vi-B Multi-Modal Kinship Verification
The comparison of different fusion methods and uni-modal performance is given in Table VI for EERs and Fig. 6 for ROC curves. In Fig. 6, area under the ROC curve (AUC) values are also provided. From Table VI, proposed fusion method with deep Siamese network gets the highest performance in average EER. For FS and MD relation, early fusion and late fusion get the best performance with EER of 20.0% and 30.0% separately.
Compared with uni-modal kinship verification methods, fusing both face and voice modalities can lead to better performance with about 3%-10% lower in average EER, which also demonstrates that face and voice modalities can give complementary information in kinship verification task. Multi-modal techniques can help to improve the robustness of kinship verification system.
Vii Conclusion and Future Directions
Using machine learning techniques for kinship verification has become a application of interest within the computer vision committee. Inspired by a study where people with a kinship relation share similar vocal features and can confuse the speaker verification system, we proposed leveraging vocal information for kinship verification.
In the absence of a kinship database that contains vocal information, we collected a new TALKIN kinship database that is comprised of both facial and vocal information captures from videos while subjects talking. First, we conducted experiments for uni-modal kinship verification from both face and voice aspects. Two state-of-the-art deep architectures (face & voice) were trained in a Siamese fashion with contrastive loss to provide the best average accuracy. We also proposed a deep Siamese fusion network for kinship verification to combine visual and vocal information that compares favourably to baseline late and early fusion methods. The experimental results also showed that multi-modal kinship verification provide a higher level of accuracy compared with uni-modal kinship verification.
In the future, we plan to investigate deep architectures for spatio-temporal fusion of visual and audio signals. Discriminative analysis will be carried out to explore the discriminative capability of face modality and voice modalities. Additionally, the efficiency of deep learning models for feature extraction and fusion are also a concern. We are currently enlarging the database and planning to make it publicly available for the research community.
The authors wish to acknowledge CSC-IT Center for Science, Finland, for computational resources. The initial help from Dr. Miguel Bordallo López and Dr. Elhocine Boutellaa is also acknowledged. The authors wish to thank T.H. Kinnunen and A. Hadid for their technical advice for this paper.
-  L. M. DeBruine, F. G. Smith, B. C. Jones, S. C. Roberts, M. Petrie, and T. D. Spector, “Kin recognition signals in adult faces,” Vision research, vol. 49, no. 1, pp. 38–43, 2009.
-  M. DalMartello and L. Maloney, “Lateralization of kin recognition signals in the human face,” Journal of vision, vol. 10, no. 8, p. 9, 2010.
-  M. Dal-Martello and L. Maloney, “Where are kin recognition signals in the human face?” Journal of Vision, vol. 6, no. 12, p. 2, 2006.
-  H. Wu, S. Yang, S. Sun, C. Liu, and Y.-J. Luo, “The male advantage in child facial resemblance detection: Behavioral and erp evidence,” Social neuroscience, vol. 8, no. 6, pp. 555–567, 2013.
-  R. Fang, K. D. Tang, N. Snavely, and T. Chen, “Towards computational models of kinship verification,” in ICIP 2010.
-  J. Lu, X. Zhou, Y.-P. Tan, Y. Shang, and J. Zhou, “Neighborhood repulsed metric learning for kinship verification,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 2, pp. 331–345, 2014.
N. Kohli, D. Yadav, M. Vatsa, R. Singh, and A. Noore, “Supervised mixed norm autoencoder for kinship verification in unconstrained videos,”IEEE Transactions on Image Processing, 2018.
-  A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker verification for differentiating between identical twins,” Science & Justice: Journal of the Forensic Science Society, vol. 48, no. 4, pp. 182–186, Dec. 2008.
-  H. Künzel, “Automatic Speaker Recognition of Identical Twins,” International Journal of Speech Language and the Law, vol. 17, no. 2, Feb. 2011. [Online]. Available: http://www.equinoxjournals.com/IJSLL/article/view/7829
-  P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017.
-  A. Chowdhury, Y. Atoum, L. Tran, X. Liu, and A. Ross, “Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos,” in ICPR 2018. IEEE, 2018.
-  J. Liu, Z. Yuan, X. Wang, and C. Wang, “Towards good practices for multi-modal fusion in large-scale video classification,” arXiv preprint arXiv:1809.05848, 2018.
-  N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive multi-modal gesture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1692–1706, 2016.
-  X. Wu, E. Granger, T. Kinnunen, X. Feng, and A. Hadid, “Audio-visual kinship verification in the wild,” in ICB 2019.
S. Xia, M. Shao, and Y. Fu, “Kinship verification through transfer learning,” inIJCAI 2011.
-  M. Shao, S. Xia, and Y. Fu, “Genealogical face recognition based on ub kinface database,” in CVPRw 2011.
-  S. Xia, M. Shao, J. Luo, and Y. Fu, “Understanding kin relationships in a photo,” Multimedia, IEEE Transactions on, vol. 14, no. 4, pp. 1046–1056, 2012.
-  H. Dibeklioğlu, A. Salah, and T. Gevers, “Are you really smiling at me? spontaneous versus posed enjoyment smiles,” in ECCV 2012.
-  ——, “Like father, like son: Facial expression dynamics for kinship verification,” in ICCV 2013.
-  X. Qin, X. Tan, and S. Chen, “Tri-subject kinship verification: Understanding the core of a family,” arXiv preprint arXiv:1501.02555, 2015.
-  H. Yan and J. Hu, “Video-based kinship verification using distance metric learning,” Pattern Recognition, vol. 75, pp. 15–24, 2018.
-  J. P. Robinson, M. Shao, Y. Wu, H. Liu, T. Gillis, and Y. Fu, “Visual kinship recognition of families in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  J. Lu, J. Hu, X. Zhou, J. Zhou, M. Castrillón-Santana, J. Lorenzo-Navarro, L. Kou, Y. Shang, A. Bottino, and T. Figuieiredo Vieira, “Kinship verification in the wild: The first kinship verification competition,” in IJCB 2014.
-  J. Lu, J. Hu, V. E. Liong, X. Zhou, A. Bottino, I. U. Islam, T. F. Vieira, X. Qin, X. Tan, S. Chen et al., “The fg 2015 kinship verification in the wild evaluation,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 1. IEEE, 2015, pp. 1–7.
-  J. P. Robinson, M. Shao, H. Zhao, Y. Wu, T. Gillis, and Y. Fu, “Rfiw: Large-scale kinship recognition challenge,” 2017, pp. 1971–1973.
-  ——, “Recognizing families in the wild (rfiw): Data challenge workshop in conjunction with acm mm 2017,” in WRFW 2017.
-  H. Yan, J. Lu, and X. Zhou, “Prototype-based discriminative feature learning for kinship verification,” IEEE Transactions on Cybernetics, vol. 45, no. 11, pp. 2535–2545, Nov 2015.
-  X. Wu, E. Boutellaa, M. B. López, X. Feng, and A. Hadid, “On the usefulness of color for kinship verification from face images,” in WIFS 2016.
-  H. Liu and C. Zhu, “Status-aware projection metric learning for kinship verification,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 319–324.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
-  K. Zhang, Y. Huang, C. Song, H. Wu, and L. Wang, “Kinship verification with deep convolutional neural networks,” in BMVC 2015.
-  J. Lu, J. Hu, and Y.-P. Tan, “Discriminative deep metric learning for face and kinship verification,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4269–4282, 2017.
-  H. Yan, J. Lu, W. Deng, and X. Zhou, “Discriminative multimetric learning for kinship verification,” Information Forensics and Security, IEEE Transactions on, vol. 9, no. 7, pp. 1169–1178, 2014.
-  G. Peleg, G. Katzir, O. Peleg, M. Kamara, L. Brodsky, H. Hel-Or, D. Keren, and E. Nevo, “Hereditary family signature of facial expression,” Proceedings of the National Academy of Sciences, vol. 103, no. 43, pp. 15 921–15 926, 2006.
-  E. Boutellaa, M. Bordallo, S. Ait-Aoudia, X. Feng, and A. Hadid, “Kinship verification from videos using texture spatio-temporal features and deep learning features,” in International Conference on Biometrics (ICB’16), 2016.
-  A. Rosenberg, “Listener performance in speaker verification tasks,” IEEE Transactions on Audio and Electroacoustics, vol. 21, no. 3, pp. 221–225, Jun. 1973.
-  D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, no. 1, pp. 19–41, Jan. 2000. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1051200499903615
-  D. Zuo and P. P. K. Mok, “Formant dynamics of bilingual identical twins,” Journal of Phonetics, vol. 52, pp. 1–12, Sep. 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0095447015000182
-  T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  N. M. Correa, T. Adali, Y.-O. Li, and V. D. Calhoun, “Canonical correlation analysis for data fusion and group inferences,” IEEE signal processing magazine, vol. 27, no. 4, pp. 39–50, 2010.
-  Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in ICCV 2017.
-  B.-L. M. and B. E. . H. A., “Comments on the ”kinship face in the wild” data sets.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
-  M. Dawson, A. Zisserman, and C. Nellåker, “From same photo: Cheating on visual kinship challenges,” arXiv preprint arXiv:1809.06200, 2018.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, Oct 2016.
-  J. Kannala and E. Rahtu, “BSIF: Binarized statistical image features,” in ICPR 2012.
-  V. Ojansivu and J. Heikkilä, “Blur insensitive texture classification using local phase quantization,” in Image and Signal Processing, vol. 5099, 2008, pp. 236–243.
-  T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern recognition, vol. 29, no. 1, pp. 51–59, 1996.
-  T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037–2041, 2006.
-  G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 915–928, 2007.
-  A. Hyvärinen, J. Hurri, and P. O. Hoyer, Natural image statistics: A probabilistic approach to early computational vision. Springer Science & Business Media, 2009, vol. 39.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conf., 2015.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  L. Li, X. Feng, X. Wu, Z. Xia, and A. Hadid, “Kinship verification from faces via similarity metric based convolutional neural network,” in International Conference Image Analysis and Recognition. Springer, 2016, pp. 539–548.
-  N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  P. Kenny, “A small footprint i-vector extractor,” in Odyssey 2012: The Speaker and Language Recognition Workshop, Singapore, June 25-28, 2012, 2012, pp. 1–6. [Online]. Available: http://www.isca-speech.org/archive/odyssey_2012/od12_001.html
-  J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech 2018.
-  S. O. Sadjadi, M. Slaney, and L. Heck, “Msr identity toolbox v1.0: A matlab toolbox for speaker recognition research,” Tech. Rep., September 2013. [Online]. Available: https://www.microsoft.com/en-us/research/publication/msr-identity-toolbox-v1-0-a-matlab-toolbox-for-speaker-recognition-research-2/
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous systems, 2015,” Software available from tensorflow. org, vol. 1, no. 2, 2015.
-  A. Vedaldi and K. Lenc, “Matconvnet – convolutional neural networks for matlab,” in Proceeding of the ACM Int. Conf. on Multimedia, 2015.