It is a common and an easy practice for us, humans, to identify our relatives from faces. Relatives usually wonder which facial attributes does a new born baby inherit from which family member. The human ability of kinship recognition has been the object of many psychological studies [5, 6]. Inspired by these studies, automatic kinship (or family) verification [9, 23] has been recently considered as an interesting and open research problem in computer vision which is receiving an increasing attention from the research community.
Automatic kinship verification from faces aims at determining whether two persons have a biological kin relation or not by comparing their facial attributes. Kinship verification is important for automatically analyzing the huge amount of photos daily shared on social media. It helps understanding the family relationships in these photos. Kinship verification is also useful in case of missing children, elderly people with Alzheimer or possible kidnapping cases. For instance, a suspicious behavior between two persons (e.g. an adult and a child) captured by a surveillance camera can be subject to further analysis to determine whether they are from the same family or not to prevent crimes and kidnapping. Kinship verification can also be used for automatically organizing family albums and generating family trees.
Kinship verification using only facial images is a very challenging task. It inherits the research problems of face verification from images captured in the wild under adverse pose, expression, illumination and occlusion conditions. In addition, kinship verification should deal with wider intra-class and inter-class variations, as persons from the same family may look very different while faces of persons with no kin relation may look similar. Moreover, automatic kinship verification poses new challenges, since a pair of input images may be from persons of different sex (e.g. brother-sister kin) and/or with a large age difference (e.g. father-daughter kin).
) dealing with automatic kinship verification over the past few years have shown some promising results. Typical current best-performing methods combine several face descriptors, apply metric learning approaches and compute Euclidean distances between pairs of features for kinship verification. It appears that most of these works are mainly based on shallow handcrafted features. Hence, they are not associated with the recent significant progress in machine learning, that suggests the use of deep features. Moreover, the role of facial dynamics in kinship verification is mostly unexplored as allmost all the existing works focus on analyzing still facial images instead of video sequences. Based on these observations, we propose to approach the problem of kinship verification from a spatio-temporal point of view and to exploit the recent progress in deep learning for facial analysis.
Given two face video sequences, to verify their kin relationship, our proposed approach starts with detecting, segmenting and aligning the face images based on eye coordinates. Then, two types of descriptors are extracted: shallow spatio-temporal texture features and deep features. As spatio-temporal features, we extract local binary patterns (LBP) , local phase quantization (LPQ) 
and binarized statistical image features (BSIF)
. These features are all extracted from Three Orthogonal Planes (TOP) of the videos. Deep features are extracted by convolutional neural networks (CNNs)
. The feature vectors of face pairs to compare are then combined to be used as inputs to Support Vector Machines (SVM) for classification. We conduct extensive experiments on the benchmark UvA-NEMO Smile database obtaining very promising results, especially with the deep features. The results also clearly demonstrate the superiority of using videos over still images, hence pointing out the important role of facial dynamics in kinship verification. Furthermore, the fusion of the two types of features (i.e. shallow spatio-temporal texture features and deep features) results in significant performance improvements compared to state-of-the-art methods.
2 Related work
Kinship verification using facial images is receiving increasing interest from the research community. This is mainly motivated by its potential applications, especially in analyzing daily shared data in social web. The first approaches to tackle kinship verification were based on low-level handcrafted feature extraction and SVM orK
-NN classifiers. For instance, Gabor gradient orientation pyramids have been used by Zhouet al. ; Yan et al.  used a spatial pyramid learning descriptor; and Self-similarity of Weber faces is used by Kohli et al. . However, the best performance is usually obtained by combining several types of features. For example, in the last kinship competition , all the proposed methods used three or more descriptors. The best performing method in this competition employed four different local features (LBP, HOG, OCLBP and Fisher vectors).
On the other hand, different metric learning approaches have been investigated to tackle the kinship verification problem. For example, Lu et al.  learned a distance metric where the face pairs with a kin relation are pulled close to each other and those without a kin relation are pushed away. Recently, Zhou et al.  applied ensemble similarity learning for solving the kinship verification problem. They learned an ensemble of sparse bi-linear similarity bases from kinship data by minimizing the violation of the kinship constraints between pairs of images and maximizing the diversity of the similarity bases. Yan et al.  and Hu et al.  learned multiple distance metrics based on various features, by simultaneously maximizing the kinship constraint (pairs with a kinship relation must have a smaller distance than pairs without a kinship relation) and the correlation of different features.
The most recent trends are motivated by the impressive success of deep learning approaches in various image representation and classification 
in general and face recognition in particular. Zhang et al.  recently proposed a convolution neural network architecture for face-based kinship verification. The proposed architecture is composed by two
convolution max poolinglayers followed by a convolution layer then a fully connected layer. A two-way softmax classifier is used as the final layer to train the network. The network takes a pair of RGB face images of different persons as an input, checking the possible kin relations. However, their reported results do not outperform the shallow methods presented in the FG15 kinship competition on the same datasets . The reason behind this may be the scarcity of training data, since deep learning approaches require the availability of enough training samples, a case that is not fulfilled by the currently available face kinship databases.
While most of the published works cope with kinship problem from images, to our knowledge the only work that performed kinship from videos was conducted by Dibeklioglu et al. . The authors combined facial expression dynamics with temporal facial appearance as features and used SVM for classification. In the present work, we aim to exploit the temporal information present in face videos, investigating the use of both spatio-temporal shallow features and deep features describing faces.
3 Video-based kinship verification
This section describes our approach for kinship verification using facial videos. In the following, the steps of the proposed approach are detailed.
3.1 Face detection and cropping
In our approach, the first step consists in segmenting the face region from each video sequence. For that purpose, we have employed an active shape model (ASM) based approach that detects 68 facial landmarks. The regions containing faces are then cropped from every frame in the video using the detected landmarks. Finally, The face-regions are aligned using key landmark points and registered to a predefined template.
3.2 Face description
For describing faces from videos, we use two types of features: texture spatio-temporal features and deep learning features. These features are introduced in this subsection.
3.2.1 Spatio-temporal features
Spatio-temporal texture features have been shown to be efficient for describing faces in various face analysis tasks, such as face recognition and facial expression classification. In this work, we extract three local texture descriptors: LBP , LPQ  and BSIF . These three features are able to describe an image using a histogram of decimal values. The code corresponding to each pixel in the image is computed from a series of binary responses of the pixel neighborhood to a filter bank. In LBP and LPQ the filters are handcrafted while the filters of BSIF are learned from natural images. Specifically, the binary code of a pixel in LBP is computed by thresholding its value with the circularly symmetric neighboring pixels (on a circle of radius
). LPQ encodes the local phase information of four frequencies of the short term Fourier transform (STFT) over a local window of sizesurrounding the pixel. BSIF binarizes the responses of independent filters of size
learnt by independent component analysis (ICA).
The spatio-temporal textural dynamics of the face in a video are extracted from three orthogonal planes XY, XT, and YT , separately. X and Y are the horizontal and vertical spatial axes of the video, and T refers to the time. The texture features of each plane are aggregated into a separate histogram. Then the three histograms are concatenated into a single feature vector. To take benefit of the multi-resolution representation , the three features are extracted at multiple scales, varying their parameters. For the LBP descriptor, the selected parameters are and . For LPQ and BSIF descriptors, the filter sizes were selected as .
3.2.2 Deep learning features
Deep neural networks have been recently outperforming the state of the art in various classification tasks. Particularly, convolutional neural networks (CNNs) demonstrated impressive performance in object classification in general and face recognition in particular. However, deep neural networks require a huge amount of training data to learn efficient features. Unfortunately, this is not the case for the currently available kinship databases. We conducted preliminary experiments using a Siamese CNN architecture as well as a deep architecture proposed by a previous work . As expected both approaches resulted in lower performance than using shallow features, due to the lack of enough training data. An alternative for extracting deep face features is to use a pre-trained network. A number of very deep pre-trained architectures has already been made available to the research community. Motivated by the similarities between face recognition and kinship verification problems, where the goal is to compute the common features in two facial representations, we decided to use the VGG-face  network. VGG-face has been initially trained for face recognition on a reasonably large dataset of million images of over people. This network has been evaluated for face verification from both pairs of images and videos showing interesting performance compared against state of the art.
The detailed parameters of the VGG-face CNN are provided by Table 1. The input of the network is an RGB face image of size pixels. The network is composed of linear convolution layers (conv), each followed by a non-linear rectification layer (relu). Some of these rectification layers are followed by a non-linear max pooling layer (mpool). Following are two fully connected layers (fc) both outputting a vector of size . At the top of the initial network are a fully connected layer with the size of classes to predict () and a softmax
layer for computing the class posterior probabilities.
In this context, to extract deep face features for kinship verification, we input the video frames one by one to the CNN and collect the feature vector issued by the fully connected layer fc7 (all the layers of the CNN except the class predictor fc8 layer and the softmax layer are used). Finally, all the frames’ features of a given face video are averaged, resulting in a video descriptor that can be used for classification.
To classify a pair of face features as positive (the two persons have a kinship relation) or negative (no kinship relation between the two persons), we use a bi-class linear Support Vector Machine classifier (SVM). Before feeding the features to the SVM, each pair of features has to be transformed into a single feature vector as imposed by the classifier. We have examined various ways for combining a pair of features, such as concatenation and vector distances. We have empirically found that utilizing the normalized absolute difference shows the best performance. Therefore, in our experiments, a pair of feature vectors and is represented by the vector where :
|Relation||Subj. #||Vid. #||Sub. #||Vid. #|
4.1 Database and test protocol
To evaluate the proposed approach, we use UvA-NEMO Smile database , which is currently the only available video kinship database. The database was initially collected for analyzing posed versus spontaneous smiles of subjects. Videos are recorded with a resolution of pixels at a rate of frames per second under controlled illumination conditions. A color chart is placed on the background of the videos to allow further illumination and color normalization. The videos are collected in controlled conditions and do not show any kind of bias . The ages of the subjects in the database vary from to years. Many families participated in the database collection, allowing its use for evaluation of automatic kinship from videos. A total of kin relations were identified between subjects in the database. There are seven different kin relations between pairs of videos: Sister-Sister (S-S), Brother-Brother (B-B), Sister-Brother (S-B), Mother-Daughter (M-D), Mother-Son (M-S), Father-Daughter (F-D), and Father-Son (F-S). The association of the videos of persons having kinship relations gives pairs of spontaneous and pairs of posed smile videos. The statistics of the database are summarized in Table 2.
Following , we randomly generate negative kinship pairs corresponding to each positive pair. Therefore, for each positive pair we associate the first video with a video of another person within the same kin subset while ensuring there is no relation between the two subjects. For all the experiments, we perform a per-relationship evaluation and report the average of spontaneous and posed videos. The accuracy for the whole database, by pooling all the relations, is also provided. Since the number of pairs of each relation is small, we apply a leave-one-out evaluation scheme.
4.2 Results and analysis
We have performed various experiments to assess the performance of the proposed approach. In the following, we present and analyze the reported results.
Deep features against shallow features: First we compare the performance of deep features against the spatio-temporal features. The results for different features are reported in Table 3. The ROC curves for separate relations as well as for the whole database are depicted in Fig. 1. The performances of the three spatio-temporal features (LBPTOP, LPQTOP and BSIFTOP) show competitive results on different kinship relations. Considering the average accuracy and the accuracy of the whole set, LPQTOP is the best performing method, closely followed by the BSIFTOP, while LBPTOP shows the worst performance.
On the other hand, deep features report the best performance on all kinship relations significantly improving the verification accuracy. The gain in verification performance of the deep features varies between and , for different relations, when compared to the best spatio-temporal accuracy. These results highlight the ability of CNNs in learning face descriptors. Even though the network has been trained for face recognition, the extracted face deep features are highly discriminative when used in the kinship verification task.
Comparing relations: The best verification accuracy is obtained for B-B and F-S while the lowest are S-B and M-S. These results are maybe due to the different sex of the pairs. One can conclude that checking the kinship relation is easier between persons of the same gender. However, a further analysis of this point is needed as the accuracy of S-S is average in our case. It is also remarkable that the performance of kinship between males (B-B and F-S) is better than between females (M-D and S-S). Moreover, large age differences between the persons composing a pair have an effect on the kinship verification accuracy. For instance, the age difference of brothers (best performance) is lower than it is for M-S (lowest performance).
|Fang et al. ||61.36||56.67||56.25||56.14||55.56||57.14||55.26||56.91||53.51|
|Guo & Wang ||65.91||56.67||60.94||58.77||62.50||67.86||55.26||61.13||56.14|
|Zhou et al. ||63.64||70.00||60.94||57.02||56.94||66.07||60.53||62.16||58.55|
|Dibeklioglu et al. ||75.00||70.00||68.75||67.54||75.00||75.00||78.95||72.89||67.11|
Videos vs. images: We have carried out an experiment to check if verifying kinship relations from videos instead of images is worthy. Therefore, we employ the first frame from each video of the database. For this experiment, spacial variants of texture features (LBP, LPQ and BSIF) and deep features are extracted from the face images. Fig. 2 shows the ROC curve comparing the performance of videos against still images for the pool of all relationships. The superiority of the performance of videos compared with still images is obvious for each feature, demonstrating the importance of face dynamics in verifying kinship between persons. Again, deep features extracted from still face images demonstrate high discriminative ability, outperforming both the spatial texture features extracted from images and the spatio-temporal features extracted from videos. We note that, in still images (see Fig. 2), LPQ features outperforms both LBP and BSIF, achieving analogous results to the ones computed using video data.
Feature fusion and comparison against state of the art: In order to check their complementarity, we have fused spatio-temporal features and deep features. We performed preliminary experiments and empirically found that score-level fusion performs better than feature fusion. In this context, and for simplicity, we have opted for a simple sum at the score-level to perform the fusion. Table 4 shows a comparison of the fusion results with the previous works. Overall, the proposed fusion scheme imrpoved further the verification accuracy by a significant margin. This effect is more evident in the relationships depicted by different sex and higher age variation, such as M-S (improved by ) and F-D (improved by ).
Comparing our results against the previously reported state-of-the-art demonstrates considerable improvements in all the kinship subsets, as shown in Table 4. Depending on the relation type, the improvement in verification accuracy of our approach compared with the best performing method presented by Dibeklioglu et al.  ranges from to . The average accuracy of all the kin relations has been improved by over .
In this work, we have investigated the kinship verification problem from face video sequences. In our experiments, faces are described using both spatio-temporal features and deep learned features. Experimental evaluation has been performed using the kinship protocol of the UvA-NEMO Smile video database. We have shown how in kinship verification, the approaches using video information out-perform the ones using still images. Our study demonstrates the high efficiency of face recognition related deep features in describing faces for inferring kinship relations. Further fusion of spatio-temporal features and deep features exhibited interesting improvements in the verification accuracy. A comparison of our approach against the previous state-of-the-art work indicates significant improvements in verification accuracy.
Even using a pre-trained CNN for face recognition, we have obtained improved results for kinship verification. This demonstrates the generalization ability of deep features to similar tasks. Even though the deep features experiments shown in our work are very promising, these features are extracted in a frame basis way. Employing a video deep architecture would probably lead into better results. However, the scarcity of kinship videos in currently available databases prevented us from opting to such a solution. In this context, future work includes the collection of a large kinship video database including real world challenges to enable learning deep video features.
E. Boutellaa is acknowledging the financial support of the Algerian MESRS and CDTA under the grant number 060/PNE/ENS/FINLANDE/2014-2015. The support of the Academy of Finland, Northwestern Polytechnical University and the Shaanxi Province is also acknowledged.
-  T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 28(12):2037–2041, 2006.
T. Ahonen, E. Rahtu, V. Ojansivu, and J. Heikkila.
Recognition of blurred faces using local phase quantization.
Int. Conf. on Pattern Recognition, pages 1–4, 2008.
-  M. Bordallo Lopez, E. Boutellaa, and A. Hadid. Comments on the ”kinship face in the wild” data sets. IEEE Trans. Pattern Anal. Mach. Intell., 2016.
-  C. H. Chan, M. Tahir, J. Kittler, and M. Pietikäinen. Multiscale local phase quantization for robust component-based face recognition using kernel fusion of multiple descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 35(5):1164–1177, 2013.
-  M. F. DalMartello and L. T. Maloney. Where are kin recognition signals in the human face? Journal of Vision, 6(12):2, 2006.
-  L. M. DeBruine, F. G. Smith, B. C. Jones, S. C. Roberts, M. Petrie, and T. D. Spector. Kin recognition signals in adult faces. Vision Research, 49(1):38 – 43, 2009.
-  H. Dibeklioğlu, A. Salah, and T. Gevers. Are you really smiling at me? spontaneous versus posed enjoyment smiles. In Computer Vision-ECCV, pages 525–538. Springer, 2012.
-  H. Dibeklioğlu, A. Salah, and T. Gevers. Like father like son facial expression dynamics for kinship verification. In IEEE Int. Conf. on Computer Vision, pages 1497–1504, 2013.
-  R. Fang, K. Tang, N. Snavely, and T. Chen. Towards computational models of kinship verification. In IEEE Int. Conf. on Image Processing, pages 1577–1580, 2010.
-  G. Guo and X. Wang. Kinship measurement on salient facial features. IEEE Instrum. Meas. Mag., 61(8):2322–2325, 2012.
-  J. Hu, J. Lu, J. Yuan, and Y.-P. Tan. Large margin multi-metric learning for face and kinship verification in the wild. In Computer Vision–ACCV, pages 252–267. Springer, 2015.
-  J. Kannala and E. Rahtu. BSIF: Binarized statistical image features. In Int. Conf. on Pattern Recognition (ICPR), pages 1363–1366, 2012.
-  N. Kohli, R. Singh, and M. Vatsa. Self-similarity representation of weber faces for kinship classification. In IEEE Int. Conf. on Biometrics: Theory, Applications and Systems (BTAS), pages 245–250, 2012.
-  J. Lu et al. Kinship verification in the wild: The first kinship verification competition. In IEEE Int. Joint Conf. on Biometrics (IJCB), pages 1–6, 2014.
-  J. Lu et al. The FG 2015 kinship verification in the wild evaluation. In IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG), volume 1, pages 1–7, 2015.
-  J. Lu, X. Zhou, Y.-P. Tan, Y. Shang, and J. Zhou. Neighborhood repulsed metric learning for kinship verification. IEEE Trans. Pattern Anal. Mach. Intell., 36(2):331–345, 2014.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conf., 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 1891–1898, 2014.
-  H. Yan, J. Lu, W. Deng, and X. Zhou. Discriminative multimetric learning for kinship verification. IEEE Trans. Inf. Forensics Security, 9(7):1169–1178, 2014.
-  K. Zhang, Y. Huang, C. Song, H. Wu, and L. Wang. Kinship verification with deep convolutional neural networks. In British Machine Vision Conf. (BMVC), 2015.
-  G. Zhao and M. Pietikainen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell., 29(6):915–928, 2007.
-  X. Zhou, J. Hu, J. Lu, Y. Shang, and Y. Guan. Kinship verification from facial images under uncontrolled conditions. In ACM Int. Conf. on Multimedia, pages 953–956, 2011.
-  X. Zhou, J. Lu, J. Hu, and Y. Shang. Gabor-based gradient orientation pyramid for kinship verification under uncontrolled environments. In ACM Int. Conf. on Multimedia, pages 725–728, 2012.
-  X. Zhou, Y. Shang, H. Yan, and G. Guo. Ensemble similarity learning for kinship verification from facial images in the wild. Information Fusion, 2015.