Top 3 in FG 2021 Families In the Wild Kinship Verification Challenge

by   Junyi Huang, et al.

Kinship verification is the task of determining whether a parent-child, sibling, or grandparent-grandchild relationship exists between two people and is important in social media applications, forensic investigations, finding missing children, and reuniting families. We demonstrate high quality kinship verification by participating in the FG 2021 Recognizing Families in the Wild challenge which provides the largest publicly available dataset in the field. Our approach is among the top 3 winning entries in the competition. We ensemble models written by both human experts and OpenAI Codex. We make our models and code publicly available.



There are no comments yet.


page 2


Recognizing Families In the Wild (RFIW): The 4th Edition

Recognizing Families In the Wild (RFIW): an annual large-scale, multi-tr...

Recognizing Families In the Wild (RFIW): The 5th Edition

Recognizing Families In the Wild (RFIW), held as a data challenge in con...

SelfKin: Self Adjusted Deep Model For Kinship Verification

One of the unsolved challenges in the field of biometrics and face recog...

Recognizing Families through Images with Pretrained Encoder

Kinship verification and kinship retrieval are emerging tasks in compute...

The HASYv2 dataset

This paper describes the HASYv2 dataset. HASY is a publicly available, f...

A Unified Approach to Kinship Verification

In this work, we propose a deep learning-based approach for kin verifica...

Challenge report: Recognizing Families In the Wild Data Challenge

This paper is a brief report to our submission to the Recognizing Famili...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The ability to recognize kinship between faces based only on images presents an important contribution to applications such as social media, forensics, reuniting families, and genealogy. However, these fields each possess unique datasets that are highly varied in terms of image quality, lighting conditions, pose, facial expression, and viewing angle that makes creating an image processing algorithm that works in general quite challenging. To address these issues, an annual automatic kinship recognition challenge Recognizing Families In the Wild (RFIW) releases a sizeable multi-task dataset to aid the development of modern data-driven approaches for solving these important visual kin-based problems [5, 6, 7, 8].

We develop and evaluate models for task of kinship verfication in the RFIW 2021 challenge, which entails the binary classification of two pictures’ relationship as kin or non-kin. We use the architecture shown in Figure 1 in which a variety of models are written by both human experts and automatically by OpenAI Codex [1]

. This is the first use of program synthesis to generate a diverse set of neural network models. The models are then ensembled to predict the confidence that a pair of face images are kin. Each model utilizes a Siamese convolutional backbones with pre-trained weights to encode one-dimensional embeddings of each image. We combine the embeddings through feature fusion

[2, 3, 11], and feed the fused encoding through a series of fully connected layers in order to make a prediction. The network predictions of many models are ensembled before applying a threshold to obtain a binary classification. We present an ablation study to quantify the impact of our methods.

Fig. 1: System architecture: We use multiple deep Siamese networks. A pair of images for verification are fed through a pre-trained convolutional backbone [2, 3]. The backbones project the images into a latent feature space which are flattened and then combined by feature fusion [11]. The result of the feature fusion is fed through a fully connected network in which the final layer is a single binary classification predicting kin or non-kin. We ensemble multiple Siamese networks written by both human experts and OpenAI Codex.
Fig. 2: Example face pairs of each relationship type in the FIW dataset: BB - Brother-Brother, SS - Sister-Sister, SIBS - Brother-Sister (left column); FD - Father-Daughter, FS - Father-Son, MD - Mother-Daughter, MS - Mother-Son (second left column); GFGD - Grandfather-Granddaughter, GFGS - Grandfather-Grandson, GMGD - Grandmother-Granddaughter, GMGS - Grandmother-Grandson (third left column); Not Kin (right column).

Ii Methods

Ii-a Dataset

Families In the Wild (FIW) [5] is the largest database for kinship recognition to date. The FIW dataset is split into disjoint training, validation, and test sets. Table I provides our exact splits, with the number of unique faces and families in our dataset.

The test set consists of roughly an equal number of positive and negative examples. For each image, the dataset contains: (i) a binary kinship label, kin or not kin; (ii) the type of relationship if one exists; (iii) a unique ID for each person in a family, and the families are disjoint.

There are 11 types of relationships in the RFIW challenge dataset split into three overarching groups as shown in Figure 2:

  • Parent-child pairs (4): father-daughter (FD), father-son (FS), mother-daughter (MD), mother-son (MS).

  • Sibling pairs (3): sister-sister (SS), brother-brother (BB), and brother-sister (SIBS)

  • Grandparent-grandchildren pairs (4): grandfather-granddaughter (GFGD), grandfather-grandson (GFGS), grandmother-granddaughter (GMGD), grandmother-grandson (GMGS)

Ii-B Data augmentation

We apply image transformations to regularize our models and improve their generalization ability. We perform experiments to identify transformations that improve validation and test accuracy. Our best performing model as shown in Table III includes applying data augmentation to the input pair. We did not observe a significant change in the generalization error of our models. Further experimentation shows that transformations such as large angle rotations and vertical flips of the images during training time may degrade model performance. Augmentations such as random small angle rotations, minor crops, horizontal flips, and color channel transformations such as brightness shifts regularize our models, particularly in scenarios where we allow for fine tuning of the backbones.

In addition to leveraging data augmentation during training, we also introduce test time augmentation [10]. We evaluate our trained model on the raw test image pair, and also on a variety of augmented versions of this pair. We generate two additional copies of the input pair and perform a horizontal flip on one copy and a color transformation of the other. We then predict the kinship between the two augmented pairs and original pair, and average their confidences. We leave the order of the pair consistent, since our model’s feature fusion [11] is invariant to the ordering of the images. Other models that are not invariant to the image order may benefit from swapping the images as an additional transformation.

Split # single unique faces # of families
TABLE I: Dataset splits: Number of unique faces and number of families used for training, validation, and testing.

Ii-C Architecture

We utilize a deep Siamese network for kinship verification [11]

. The deep Siamese network contains two separate branches, where each branch is given one image from the pair selected for verification. Each branch begins with a deep convolutional neural network which projects the image into a latent feature space. The resulting feature vectors from the separate branches are then combined by feature fusion

[11] and fed into a fully connected network to capture non-linear interactions among the relationship between the two feature vectors. The final layer of the fully connected network is a binary classification which predicts whether a given pair of images is kin or non-kin.

Ii-D Feature Fusion

Each backbone of our architecture produces a 1x embedding for an input image. Each 1D image embedding () in the pair is then fused with its counterpart by (i) taking the Hadamard product of the feature vectors (); (ii) the squared difference of the feature vectors , and; (iii) the absolute value of the difference of squares ().

Ii-E Ensemble

To increase the generalization ability and robustness of our model, we ensemble the results of a diverse set of network architectures to obtain a final prediction. These architectures include models with different feature extraction backbones to leverage the features learned by disparate network structures. We use ResNet50

[2], SENet50 [3], FaceNet [9], and VGGFace [4] as backbones.

During training, we also ensemble across different splits of the training data. We first split the data into folds, selecting one fold as the validation data for networks trained on all other folds. We repeat the process for fold 1 to fold , which results in ensemble member networks based on one individual backbone model. With 4 backbone models, as mentioned above for instance, ensemble member networks are generated for prediction.

Ii-F Sampling

There are key features of the FIW dataset that make sampling methodologies non-trivial: several people have more pictures than others, and several families have more people than others. Any sampling method must compromise between evenly sampling across pairs of people and families alike. Based on our ablation studies, we found that models perform well when we increase diversity. Therefore, we prioritize sampling evenly across: (i) families; (ii) people; and (iii) pictures, in that ranking order.

Ii-G Test Set Split

The test set includes roughly the same numbers of positive pairs and negative pairs. We utilize this information by setting an adaptive threshold on the original model outputs which makes our final prediction roughly equally split into positive and negative labels. Knowing the structure of the test set distributions allows us to factor in prior probabilities to our predictions, which slightly improves model performance.

Ii-H Program Synthesis

We leverage program synthesis to improve performance on this challenge by synthesizing architectural components and hyperparameters. To do so we provide prompts including part of our model code to OpenAI’s Codex

[1] and incorporate the architectural and hyperparameter changes generated as model variants.

These architectural changes written automatically by OpenAI Codex [1] suggest different combinations for stacking and mixing feature maps in Siamese networks. The same applies to prompting Codex to ensemble multiple models together – through a series of well-defined sentences provided as prompts, Codex is able to write code for ensembling multiple models together and improve overall performance.

While Codex is unable to solve an open-ended coding task such as: ”Build a model that can recognize whether a kinship relation is present between two facial images”, we find that providing guidance through human code snippets allows Codex to solve these tasks by automatically writing variants of existing code. We apply program synthesis to rapidly generate a diverse set of models, and include these variations in our ensemble.

Iii Results

We perform an ablation study as shown in Table II

, on the number of dropout layers, batch normalization layers, addition of difference of squares in feature fusion, sampling techniques, test time augmentation, and model ensembling. The results show that all the proposed architectural components improve upon the baseline. The strongest results is achieved by ensembling multiple instances of multiple models, which is our full model.

Model Accuracy
Ensembling multiple instances
of multiple models 0.741
Ensembling multiple instances
of one model 0.726
Adding to the concatenated features
Test time data augmentation 0.730
Adding batch normalization layers
and improving sampling 0.685
Adding dropout layers 0.633
Adding training pairs 0.579
Adding a dropout layer 0.525
Baseline model 0.510
TABLE II: Human ablation study: A ranking of different methods that we use to improve our models. Multiple ablation experiments are performed on a smaller dataset consisting of 5,045 training images and 4,437 test images for faster turnaround time. Each improvement in ranking represents an enhancement on top of the prior model. The best performing method consisted of an ensemble of multiple models with different architectures.
zxm123 2021 0.80 (1) 0.82 (1) 0.80 (1) 0.84 (1) 0.75 (4) 0.82 (1) 0.80 (1) 0.77 (2) 0.76 (3) 0.71 (4) 0.75 (2) 0.59 (10)
vuvko 2020 0.78 (2) 0.80 (2) 0.77 (2) 0.80 (2) 0.75 (7) 0.81 (5) 0.78 (2) 0.74 (7) 0.78 (2) 0.69 (6) 0.76 (1) 0.60 (9)
nc2893 2021 0.77 (3) 0.79 (3) 0.75 (4) 0.79 (3) 0.76 (2) 0.78 (12) 0.75 (7) 0.74 (9) 0.70 (15) 0.67 (9) 0.70 (8) 0.59 (10)
jh3450 2021 0.77 (3) 0.79 (3) 0.75 (4) 0.79 (3) 0.76 (2) 0.78 (12) 0.75 (7) 0.74 (9) 0.70 (15) 0.67 (9) 0.70 (8) 0.59 (10)
paw2140 2021 0.77 (4) 0.78 (4) 0.75 (5) 0.79 (4) 0.75 (6) 0.78 (12) 0.76 (4) 0.74 (8) 0.68 (18) 0.69 (7) 0.72 (5) 0.59 (10)
DeepBlueAI 2020 0.76 (5) 0.77 (5) 0.75 (6) 0.77 (5) 0.74 (8) 0.81 (6) 0.75 (6) 0.74 (10) 0.72 (7) 0.73 (3) 0.67 (11) 0.68 (1)
ustc-nelslip 2021 0.76 (6) 0.75 (6) 0.72 (9) 0.74 (8) 0.76 (3) 0.82 (2) 0.75 (8) 0.75 (4) 0.79 (1) 0.69 (7) 0.76 (1) 0.67 (2)
TABLE III: RFIW kinship verification accuracy scores of top 7 entries in 2020-2021. The Table shows accuracy for each of the 11 types of relationships (three sibling pairs BB, SIBS, SS; four parent-child FD, FS, MD, MS; and four grandparent-grandchild relationships GFGD, GFGS, GMGD, GMGS) and the average accuracy. Our top 3 entries are shown in bold.
Method Average BB SIBS SS FD FS MD MS
Ensemble 0.77 0.80 0.77 0.80 0.74 0.76 0.77 0.75
Codex model variant 1 0.75 0.78 0.75 0.78 0.71 0.73 0.75 0.72
Codex model variant 2 0.76 0.80 0.76 0.79 0.72 0.76 0.76 0.72
Human model variant 1 0.75 0.78 0.74 0.78 0.73 0.75 0.75 0.73
Human model variant 2 0.76 0.79 0.76 0.79 0.75 0.77 0.75 0.75
TABLE IV: RFIW kinship verification accuracy scores of our ensemble, Codex model variants and human model variants. Human model variant 1 is based on ResNet50 [2] and human model variant 2 is based on SENet50 [3]. Codex model variants automatically write multiple fully connected layers given the human model variants. Finally, we ensemble all four model variations.

Table III contains the result of our best performing model trained on the 2021 RFIW kinship verification dataset shown by the bold entry in the user column. We compare our results to the top submissions from 2020-2021. Our model performs in the top three overall. In Table IV, the prediction accuracy of four ensemble networks and a super ensemble model are compared. Human variant 1 and Codex variant 1 are constructed with ResNet50 [2] as the backbone model, while human variant 2 and Codex variant 2 use SENet50 [3] as the backbone model. The Codex variants contain architectural modifications of human variants with more fully connected layers after the Siamese networks. Multiple instances of each model are trained by k-fold cross-validation. All models are ensembled to form a model which performs best overall. We make our code and models publicly available 111

Iv Conclusions

This work achieves a top 3 position in the FG 2021 kinship verification challenge over all years. We use a base Siamese network architecture for predictions. Our top performing model is an ensemble of diverse models. This work is the first to use models written both human experts and automatically by OpenAI Codex [1]. This work provides a strong method for use on the important kinship verification task, and demonstrates that a hybrid human-machine approach may advance the field when applied to other common task framework challenges.


  • [1] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021)

    Evaluating large language models trained on code

    arXiv preprint 2107.03374. Cited by: §I, §II-H, §II-H, §IV.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 770–778. Cited by: Fig. 1, §I, §II-E, TABLE IV, §III.
  • [3] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: Fig. 1, §I, §II-E, TABLE IV, §III.
  • [4] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015)

    Deep face recognition

    British Machine Vision Conference. Cited by: §II-E.
  • [5] J.P. Robinson, M. Shao, Y. Wu, and Y. Fu (2016) Families in the wild (FIW): Large-scale kinship image database and benchmarks. In Proceedings of the ACM on Multimedia Conference, Cited by: §I, §II-A.
  • [6] J. P. Robinson, M. Shao, and Y. Fu (2021) Survey on the analysis and modeling of visual kinship: A decade in the making. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I.
  • [7] J. P. Robinson, M. Shao, Y. Wu, H. Liu, T. Gillis, and Y. Fu (2018) Survey on the analysis and modeling of visual kinship: A decade in the making. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I.
  • [8] J. P. Robinson, Y. Yin, Z. Khan, M. Shao, S. Xia, M. Stopa, S. Timoner, M. A. Turk, R. Chellappa, and Y. Fu (2020) Recognizing families in the wild (RFIW): The 4th Edition. In IEEE International Conference on Automatic Face and Gesture Recognition, pp. 857–862. Cited by: §I.
  • [9] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §II-E.
  • [10] Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt (2020) Test-time training with self-supervision for generalization under distribution shifts. In

    International Conference on Machine Learning

    pp. 9229–9248. Cited by: §II-B.
  • [11] J. Yu, M. Li, X. Hao, and G. Xie (2020) Deep fusion siamese network for automatic kinship verification. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 892–899. Cited by: Fig. 1, §I, §II-B, §II-C.