The ability to recognize kinship between faces based only on images presents an important contribution to applications such as social media, forensics, reuniting families, and genealogy. However, these fields each possess unique datasets that are highly varied in terms of image quality, lighting conditions, pose, facial expression, and viewing angle that makes creating an image processing algorithm that works in general quite challenging. To address these issues, an annual automatic kinship recognition challenge Recognizing Families In the Wild (RFIW) releases a sizeable multi-task dataset to aid the development of modern data-driven approaches for solving these important visual kin-based problems [5, 6, 7, 8].
We develop and evaluate models for task of kinship verfication in the RFIW 2021 challenge, which entails the binary classification of two pictures’ relationship as kin or non-kin. We use the architecture shown in Figure 1 in which a variety of models are written by both human experts and automatically by OpenAI Codex 
. This is the first use of program synthesis to generate a diverse set of neural network models. The models are then ensembled to predict the confidence that a pair of face images are kin. Each model utilizes a Siamese convolutional backbones with pre-trained weights to encode one-dimensional embeddings of each image. We combine the embeddings through feature fusion[2, 3, 11], and feed the fused encoding through a series of fully connected layers in order to make a prediction. The network predictions of many models are ensembled before applying a threshold to obtain a binary classification. We present an ablation study to quantify the impact of our methods.
Families In the Wild (FIW)  is the largest database for kinship recognition to date. The FIW dataset is split into disjoint training, validation, and test sets. Table I provides our exact splits, with the number of unique faces and families in our dataset.
The test set consists of roughly an equal number of positive and negative examples. For each image, the dataset contains: (i) a binary kinship label, kin or not kin; (ii) the type of relationship if one exists; (iii) a unique ID for each person in a family, and the families are disjoint.
There are 11 types of relationships in the RFIW challenge dataset split into three overarching groups as shown in Figure 2:
Parent-child pairs (4): father-daughter (FD), father-son (FS), mother-daughter (MD), mother-son (MS).
Sibling pairs (3): sister-sister (SS), brother-brother (BB), and brother-sister (SIBS)
Grandparent-grandchildren pairs (4): grandfather-granddaughter (GFGD), grandfather-grandson (GFGS), grandmother-granddaughter (GMGD), grandmother-grandson (GMGS)
Ii-B Data augmentation
We apply image transformations to regularize our models and improve their generalization ability. We perform experiments to identify transformations that improve validation and test accuracy. Our best performing model as shown in Table III includes applying data augmentation to the input pair. We did not observe a significant change in the generalization error of our models. Further experimentation shows that transformations such as large angle rotations and vertical flips of the images during training time may degrade model performance. Augmentations such as random small angle rotations, minor crops, horizontal flips, and color channel transformations such as brightness shifts regularize our models, particularly in scenarios where we allow for fine tuning of the backbones.
In addition to leveraging data augmentation during training, we also introduce test time augmentation . We evaluate our trained model on the raw test image pair, and also on a variety of augmented versions of this pair. We generate two additional copies of the input pair and perform a horizontal flip on one copy and a color transformation of the other. We then predict the kinship between the two augmented pairs and original pair, and average their confidences. We leave the order of the pair consistent, since our model’s feature fusion  is invariant to the ordering of the images. Other models that are not invariant to the image order may benefit from swapping the images as an additional transformation.
|Split||# single unique faces||# of families|
We utilize a deep Siamese network for kinship verification 
. The deep Siamese network contains two separate branches, where each branch is given one image from the pair selected for verification. Each branch begins with a deep convolutional neural network which projects the image into a latent feature space. The resulting feature vectors from the separate branches are then combined by feature fusion and fed into a fully connected network to capture non-linear interactions among the relationship between the two feature vectors. The final layer of the fully connected network is a binary classification which predicts whether a given pair of images is kin or non-kin.
Ii-D Feature Fusion
Each backbone of our architecture produces a 1x embedding for an input image. Each 1D image embedding () in the pair is then fused with its counterpart by (i) taking the Hadamard product of the feature vectors (); (ii) the squared difference of the feature vectors , and; (iii) the absolute value of the difference of squares ().
To increase the generalization ability and robustness of our model, we ensemble the results of a diverse set of network architectures to obtain a final prediction. These architectures include models with different feature extraction backbones to leverage the features learned by disparate network structures. We use ResNet50, SENet50 , FaceNet , and VGGFace  as backbones.
During training, we also ensemble across different splits of the training data. We first split the data into folds, selecting one fold as the validation data for networks trained on all other folds. We repeat the process for fold 1 to fold , which results in ensemble member networks based on one individual backbone model. With 4 backbone models, as mentioned above for instance, ensemble member networks are generated for prediction.
There are key features of the FIW dataset that make sampling methodologies non-trivial: several people have more pictures than others, and several families have more people than others. Any sampling method must compromise between evenly sampling across pairs of people and families alike. Based on our ablation studies, we found that models perform well when we increase diversity. Therefore, we prioritize sampling evenly across: (i) families; (ii) people; and (iii) pictures, in that ranking order.
Ii-G Test Set Split
The test set includes roughly the same numbers of positive pairs and negative pairs. We utilize this information by setting an adaptive threshold on the original model outputs which makes our final prediction roughly equally split into positive and negative labels. Knowing the structure of the test set distributions allows us to factor in prior probabilities to our predictions, which slightly improves model performance.
Ii-H Program Synthesis
We leverage program synthesis to improve performance on this challenge by synthesizing architectural components and hyperparameters. To do so we provide prompts including part of our model code to OpenAI’s Codex and incorporate the architectural and hyperparameter changes generated as model variants.
These architectural changes written automatically by OpenAI Codex  suggest different combinations for stacking and mixing feature maps in Siamese networks. The same applies to prompting Codex to ensemble multiple models together – through a series of well-defined sentences provided as prompts, Codex is able to write code for ensembling multiple models together and improve overall performance.
While Codex is unable to solve an open-ended coding task such as: ”Build a model that can recognize whether a kinship relation is present between two facial images”, we find that providing guidance through human code snippets allows Codex to solve these tasks by automatically writing variants of existing code. We apply program synthesis to rapidly generate a diverse set of models, and include these variations in our ensemble.
We perform an ablation study as shown in Table II
, on the number of dropout layers, batch normalization layers, addition of difference of squares in feature fusion, sampling techniques, test time augmentation, and model ensembling. The results show that all the proposed architectural components improve upon the baseline. The strongest results is achieved by ensembling multiple instances of multiple models, which is our full model.
|Ensembling multiple instances|
|of multiple models||0.741|
|Ensembling multiple instances|
|of one model||0.726|
|Adding to the concatenated features|
|Test time data augmentation||0.730|
|Adding batch normalization layers|
|and improving sampling||0.685|
|Adding dropout layers||0.633|
|Adding training pairs||0.579|
|Adding a dropout layer||0.525|
|zxm123||2021||0.80 (1)||0.82 (1)||0.80 (1)||0.84 (1)||0.75 (4)||0.82 (1)||0.80 (1)||0.77 (2)||0.76 (3)||0.71 (4)||0.75 (2)||0.59 (10)|
|vuvko||2020||0.78 (2)||0.80 (2)||0.77 (2)||0.80 (2)||0.75 (7)||0.81 (5)||0.78 (2)||0.74 (7)||0.78 (2)||0.69 (6)||0.76 (1)||0.60 (9)|
|nc2893||2021||0.77 (3)||0.79 (3)||0.75 (4)||0.79 (3)||0.76 (2)||0.78 (12)||0.75 (7)||0.74 (9)||0.70 (15)||0.67 (9)||0.70 (8)||0.59 (10)|
|jh3450||2021||0.77 (3)||0.79 (3)||0.75 (4)||0.79 (3)||0.76 (2)||0.78 (12)||0.75 (7)||0.74 (9)||0.70 (15)||0.67 (9)||0.70 (8)||0.59 (10)|
|paw2140||2021||0.77 (4)||0.78 (4)||0.75 (5)||0.79 (4)||0.75 (6)||0.78 (12)||0.76 (4)||0.74 (8)||0.68 (18)||0.69 (7)||0.72 (5)||0.59 (10)|
|DeepBlueAI||2020||0.76 (5)||0.77 (5)||0.75 (6)||0.77 (5)||0.74 (8)||0.81 (6)||0.75 (6)||0.74 (10)||0.72 (7)||0.73 (3)||0.67 (11)||0.68 (1)|
|ustc-nelslip||2021||0.76 (6)||0.75 (6)||0.72 (9)||0.74 (8)||0.76 (3)||0.82 (2)||0.75 (8)||0.75 (4)||0.79 (1)||0.69 (7)||0.76 (1)||0.67 (2)|
|Codex model variant 1||0.75||0.78||0.75||0.78||0.71||0.73||0.75||0.72|
|Codex model variant 2||0.76||0.80||0.76||0.79||0.72||0.76||0.76||0.72|
|Human model variant 1||0.75||0.78||0.74||0.78||0.73||0.75||0.75||0.73|
|Human model variant 2||0.76||0.79||0.76||0.79||0.75||0.77||0.75||0.75|
Table III contains the result of our best performing model trained on the 2021 RFIW kinship verification dataset shown by the bold entry in the user column. We compare our results to the top submissions from 2020-2021. Our model performs in the top three overall. In Table IV, the prediction accuracy of four ensemble networks and a super ensemble model are compared. Human variant 1 and Codex variant 1 are constructed with ResNet50  as the backbone model, while human variant 2 and Codex variant 2 use SENet50  as the backbone model. The Codex variants contain architectural modifications of human variants with more fully connected layers after the Siamese networks. Multiple instances of each model are trained by k-fold cross-validation. All models are ensembled to form a model which performs best overall. We make our code and models publicly available 111https://tinyurl.com/xfe72kvr.
This work achieves a top 3 position in the FG 2021 kinship verification challenge over all years. We use a base Siamese network architecture for predictions. Our top performing model is an ensemble of diverse models. This work is the first to use models written both human experts and automatically by OpenAI Codex . This work provides a strong method for use on the important kinship verification task, and demonstrates that a hybrid human-machine approach may advance the field when applied to other common task framework challenges.
Evaluating large language models trained on code. arXiv preprint 2107.03374. Cited by: §I, §II-H, §II-H, §IV.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: Fig. 1, §I, §II-E, TABLE IV, §III.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: Fig. 1, §I, §II-E, TABLE IV, §III.
Deep face recognition. British Machine Vision Conference. Cited by: §II-E.
-  (2016) Families in the wild (FIW): Large-scale kinship image database and benchmarks. In Proceedings of the ACM on Multimedia Conference, Cited by: §I, §II-A.
-  (2021) Survey on the analysis and modeling of visual kinship: A decade in the making. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I.
-  (2018) Survey on the analysis and modeling of visual kinship: A decade in the making. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I.
-  (2020) Recognizing families in the wild (RFIW): The 4th Edition. In IEEE International Conference on Automatic Face and Gesture Recognition, pp. 857–862. Cited by: §I.
-  (2015) FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §II-E.
Test-time training with self-supervision for generalization under distribution shifts.
International Conference on Machine Learning, pp. 9229–9248. Cited by: §II-B.
-  (2020) Deep fusion siamese network for automatic kinship verification. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 892–899. Cited by: Fig. 1, §I, §II-B, §II-C.