Face recognition using neural networks and RGB image data is still on the rise thanks to the abundance of raw data on the Internet, more public data[12, 20, 4]
, and popular computational frameworks being open source. Algorithms in this field reach new records[10, 11] both in quality and resource consumption which were unimaginable several years before.
Automatic kinship recognition using visual information is very similar to face verification and thanks to growing Families In the Wild (FIW)  dataset is getting more attention in the research community .
Annual Recognizing Families In the Wild (RFIW) Data Challenge tries to bring the difficult problem closer to have a solution applicable to the real-world tasks. And every year contestants show new approaches based on metric learning [17, 22], ensembling different facial features  and the introduction of additional data pre-processing . We decided to take a step back and start with a better baseline model and then gradually apply different approaches to improve performance. To our surprise, basic fine-tuning happened to be enough to achieve better performance than other competitors. In this work we make three main contributions:
We propose a new baseline for evaluating kinship recognition methods, which is based on recent advancements in the face recognition field.
We designed a pipeline for the fine-tuning face verification models for visual kinship recognition task based on the new baseline.
We show that our approach111We encourage you to download our implementation and test it: https://github.com/vuvko/fitw2020 achieves the best performance in Recognizing Families In the Wild Data Challenge (Tracks 1&3).
The rest of this paper is organized as follows: in section II we briefly review existing methods that are useful for our task of kinship recognition and family search, in section III we explain our proposed baseline and pipeline, in section IV we evaluate our approach on RFIW challenge dataset, in section V we conclude our work and discuss further possible improvements.
Ii Related Work
Our task is to recognize a binary feature (kin, non-kin) given two face images. There are several different approaches we can use to tackle this problem: from hand-crafted features to generative methods for data augmentation. But we will focus on closer work in similar fields.
Convolutional neural networks (CNN) became the backbone in the variety of methods in computer vision tasks. Since the introduction of AlexNet  hand-crafted features had been quickly replaced by better and sometimes even faster algorithms based on CNNs.
Face recognition is an example of a fast-moving field in computer vision. The progress is due to novel datasets [12, 4, 20] for training and evaluation, introduction of different architectures and learning methods [25, 18, 26, 6], more data challenges [14, 8] and industrial interest . Similarity with visual kinship recognition tasks makes the face recognition models a good candidate for fine-tuning for our task. And it was shown  that better base models can achieve much higher performance than carefully tuned older ones.
Image retrieval also had a success incorporating CNNs. The deep representation from the penultimate layer was shown [9, 2] to be a good feature extractor. Typically methods were tuned for retrieval task with metric learning  but classification approach using proxies [21, 27] became a new promising design.
Iii Proposed Pipeline
In  fine-tuned SphereFace  was proposed as the best baseline benchmark for our task. However, better algorithms were proposed in recent years. ArcFace  achieved better performance in face verification and was widely recognized 222Original implementation: https://github.com/deepinsight/insightface in Github community with several re-implementations easily obtainable in every popular framework. Thus, we tried to use it as a new baseline and foundation for our design.
Iii-a Extracting Face Embeddings
We could use images from FIW dataset without the special preparations to obtain the image embedding, but facial recognition models work differently based on the different face alignment techniques that were used during their training. Moreover, better face detection and registration further improve model performance . Knowing this, we re-detected faces in the challenge’s dataset and aligned them with landmarks from the RetinaFace  detector. At this step, some faces were not detected with the selected confidence threshold and were removed from the training and validation set. For the test set, such images were just resized to fit into the face recognition model. There were a total of images removed from the training set, images from validation, and problematic images that occurred in the test set.
The ArcFace 
model was used for features extraction. This model was pre-trained on cleaned MS-Celeb-1M dataset 333Clean dataset can be downloaded from https://github.com/deepinsight/insightface/wiki/Dataset-Zoo and has the embedding dimension of 512.
To compare two images and we used cosine distance between their computed embeddings and :
Iii-B Transfer Learning
The main difference between face and kinship recognition tasks is in the relative distance between different people. While face recognition cares mostly about pictures of the same person being close in embedding space, kinship recognition is trying to achieve a much harder task. In the later different people must be closer to each other than the other groups of people, ideally forming family clusters. The difference in the available labeled data and similarity between the tasks makes face recognition a great source domain for transferring to the kinship recognition domain.
In the previous iterations of RFIW, the metric learning was used as a transfer learning approach and achieved a great performance
. However, it requires pairs of images to be carefully sampled to confidently estimate the distribution of all distances in the metric space. Another concern is that it requires a significant amount of time to train the final model. Given the information for each person about their family association, we can construct a family classification problem similar to the recent methods in face recognition[18, 6, 26] and metric learning for image retrieval [21, 27]
. With this our loss function looks like:
where is the batch size, is the number of families, is the image embedding of a member of the -th family, is the weight matrix (with denoting it’s -th column) and is the bias from classification layer.
Iii-C Forming Validation Pairs
In track 1 of RFIW Data Challenge images are divided between families. As the distribution between persons and between families in challenge’s data is non-uniform, we need to be careful with sampling the pairs, as validating model offline is crucial given the limited number of submissions. We sampled positive and negative pairs selected uniformly between all families from the validation set using algorithm 1.
Using this approach validates our model without the issue of popular families that have lots of images. We used AUC ROC metric on the validation pairs (see fig. 1
) for choosing the best model for submission Another problem occurs though: as we select uniformly between the families, families with low member count are given higher priority. We chose to resolve this issue with a higher binarization threshold.
Iii-D Choosing Binarization Threshold
Given a comparison of two image embeddings using cosine distance (1) we need some binarization function to get the needed result. In our case, a simple binarization by threshold was used. The threshold can be chosen based on the trade-off between the false positive rate and the true positive rate. As we had no prior knowledge of how the test pairs were selected we chose target false positive rate based on our three submissions for the final phase.
Given the information about the pair’s possible kind of kinship relation, we could have chosen the threshold for every kind separately. Alas, we could only submit results and some kinds of relations had a low number of pairs to confidently select threshold. We further discuss this in section IV-C.
Iii-E Using for Retrieval
The retrieval task can be reduced to a series of verification tasks. Every probe image is matched with every gallery image to construct a retrieval matrix (every row contains a retrieval result for the probe image). Our task also has more than one image for every probe person, and we can solve this in different ways. One way is creating an aggregated feature for each probe person. We can do this by averaging all the embeddings from their images. Thus we reduced our task to a single feature per probe and can form a retrieval matrix.
The other way is using all matching results from all images. We need to add aggregation function that would consolidate distances between images to adapt to this. Given a gallery image and a probe person with images we can sort the gallery images using the distance:
Aggregation function should take a vector with an arbitrary number of elements and return a single real number. There are different options for such function but we tested onlymean and max in this challenge.
With this approach, we need to compare embeddings for every probe image with every gallery image. It can be difficult to compute with reasonable resources when there is a large number of images per person. There is a variety of approaches  for that purpose: from reducing latent space [13, 3] to constructing special structures  but in this work we will only show (see IV-D) that getting a mean embedding for a person is comparable with searching using the aggregated function .
Iv-a Recognizing Families In the Wild Data Challenge
The Recognizing Families In the Wild Data Challenge (RFIW2020) focuses on determining blood relations based on visual facial similarities. For that, it has images from families for training and validation sets. This year there were a total of three tracks: one-to-one kinship verification (track 1), two-to-one kinship verification (track 2), and family search and retrieval (track 3). Our team chose to participate in tracks 1&3 so we will focus only on them.
In track 1 there were pairs for the final testing. Methods were evaluated based on the average accuracy of binary classification (kin, non-kin) over all of the testing pairs. Additional measurements were provided separately for every kinship type.
Track 3 is the image retrieval problem with one family member as a query (or probe) and other family members with distractors as a gallery. There were probe subjects (each with a different number of images) and images in the gallery. Methods were evaluated based on mean average precision (mAP) and rank@k metrics.
Iv-B Implementation Details
We used Mxnet  for the implementation of our pipeline. For detection and feature extraction insightface python package was used. In particular, retinaface_r50_v1 which is RetinaFace implementation with ResNet50 as the backbone and arcface_r100_v1 which is modified ResNet101 trained with ArcFace loss on cleaned MS-Celeb-1M dataset.
Re-detected and aligned (as described in III-A) faces were given to the feature extractor model to obtain image embeddings. Performance of this approach (pretrained on fig. 1) was used as a baseline to test our hypotheses.
First, we tried to add a simple classification layer and finetune the whole model on the train set with stochastic gradient descent with base learning rate of, momentum , linear warmup for first batches of size , linear cooldown for last batches, multiplying learning rate by
on epochsfor epochs. Random color jitter and random lightning with parameter were used for the data augmentation. No random cropping or similar technique was used to not confuse the model that was trained on similar aligned images. After that, we added normalization of the embeddings and retrained the model starting with pre-trained weights. Performance of these two models on our sampled validation pairs (see III-C) can be seen in fig. 1.
Iv-C Verification Results
We needed to binarize our predictions to submit for test verification on track 1. We chose threshold such that we had true positive rate (TPR) of on our sampled validation images for our first submission (pretrained in table I). Next submissions were tested with different thresholds and between several strategies, the one that showed the best average performance was to choose the threshold such that the method would have a false positive rate (FPR) of . Other entries in table I are provided with that strategy of choosing the binarization threshold.
We tested the simple fine-tuning using a classification layer (+classification) with a similar approach where the embeddings are normalized to have a unit norm before the classification layer (+normalization). Both on our validation data and test set the second approach was superior. This indicates that consistency with our cosine distance metric that we use for image comparison is crucial for fine-tuning the model for kinship verification.
From comparison table I we can see that our approach performs poorly on grandparents-grandchildren type of kinship because there is a small number of images with this type of relationship in the training set, but we can mitigate this bias through a different threshold for every kind of relationship. Sadly, we could not test this idea due to the lack of time, but we provide a proof of this hypothesis with +different thresholds submission where we improved the performance for grandmother-grandson relationship by lowering binarization threshold.
We should note that though our approach scores first on average it is not by a far margin and mostly due to our great performance on sibling pairs. Having an average accuracy at best, automatic kinship recognition still needs to be improved to be considered for usage in real-world applications. For the reference, the best face verification models perform with FPR around .
Iv-D Retrieval Results
In track 3 we needed to aggregate several embeddings per probe person to rank gallery images. We used average consolidated embedding for our baseline submission pretrained and compared it to every gallery image using the cosine metric to get a resulting retrieval matrix. The same procedure was used with our best model from track 1 (+norm+class) and we can see that our approach gives improvements not only to the verification task but also to the retrieval. Then we compared this search method with different aggregation functions (see III-E). We can see that averaging embeddings for probe subjects perform worse than searching using all available embeddings with aggregation function but is still comparable. Furthermore, we can see that max aggregation function, which searches for the image from the gallery that is closest to any of the query images, has a higher rank@K metric than mean aggregation but lower mAP.
Table II shows that even the pre-trained ArcFace model performs much better than the other competitors and our pipeline improved this performance even further. But even such great performance is too low for trying to use this pipeline in a real-world scenario.
V Conclusions and Future Works
In this work we show that using better face verification models is crucial for improving kinship recognition due to more available data. We presented the new baseline for kinship verification and retrieval tasks, which is based on more accurate face recognition model than the previous baseline. Furthermore our designed pipeline for the verification task improved this result and achieved the best performance in the recent challenge.
In future work we plan to provide a more thorough analysis of methods suitable for the automatic kinship recognition task. Different feature extractors and ensembling are the most promising next steps from our perspective.
The author would like to thank Nikolai Amiantov, Konstantin Aleshkin, and Anastasia Belikova for the helpful discussion in preparing this publication, all reviewers for their valuable comments, and the competition organizers for the opportunity to show this work.
-  (2019-05) Heatmap-guided balanced deep convolution networks for family classification in the wild. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), External Links: Cited by: §I.
Aggregating local deep features for image retrieval. In Proceedings of the IEEE international conference on computer vision, pp. 1269–1277. Cited by: §II.
-  (2018) Revisiting the inverted indices for billion-scale approximate nearest neighbors. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 202–216. Cited by: §III-E.
-  (2018) Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 67–74. Cited by: §I, §II.
Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, Cited by: §IV-B.
-  (2019) ArcFace: additive angular margin loss for deep face recognition. In CVPR, Cited by: Achieving Better Kinship Recognition Through Better Baseline, §II, §III-A, §III-B, §III.
-  (2019) RetinaFace: single-stage dense face localisation in the wild. In arxiv, Cited by: Achieving Better Kinship Recognition Through Better Baseline, §III-A.
-  (2019) Lightweight face recognition challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §II.
-  (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647–655. Cited by: §II.
-  (2018-11) Ongoing face recognition vendor test (FRVT) part 2: identification. Technical report National Institute of Standards and Technology. External Links: Cited by: §I.
-  (2019-12) Face recognition vendor test part 3: demographic effects. Technical report National Institute of Standards and Technology. External Links: Cited by: §I, §II, §IV-C.
-  (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European conference on computer vision, pp. 87–102. Cited by: §I, §II, §III-A.
-  (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §III-E.
The megaface benchmark: 1 million faces for recognition at scale.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4873–4882. Cited by: §II.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §II.
Kinship verification based deep and tensor features through extreme learning machine. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–4. Cited by: §I.
-  (2017) Kinnet: fine-to-coarse deep metric learning for kinship verification. In Proceedings of the 2017 Workshop on Recognizing Families In the Wild, pp. 13–20. Cited by: §I, §III-B.
-  (2017) SphereFace: deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, §III-B, §III.
-  (2018) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence. Cited by: §III-E.
-  (2018) Iarpa janus benchmark-c: face dataset and protocol. In 2018 International Conference on Biometrics (ICB), pp. 158–165. Cited by: §I, §II.
-  (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §II, §III-B.
-  (2019) Kinship verification using deep siamese convolutional neural network. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–5. Cited by: §I.
-  (2016) Families in the wild (fiw): large-scale kinship image database and benchmarks. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 242–246. Cited by: §I.
-  (2018) Visual kinship recognition of families in the wild. IEEE transactions on pattern analysis and machine intelligence 40 (11), pp. 2624–2637. Cited by: §I, §II, §III.
-  (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §II.
-  (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §II, §III-B.
-  (2019) Classification is a strong baseline for deep metric learning. In British Machine Vision Conference (BMVC), Cited by: §II, §III-B.
-  (2017) SIFT meets cnn: a decade survey of instance retrieval. IEEE transactions on pattern analysis and machine intelligence 40 (5), pp. 1224–1244. Cited by: §II, §III-E.