Recognizing Families In the Wild (RFIW): The 5th Edition

by   Joseph P. Robinson, et al.

Recognizing Families In the Wild (RFIW), held as a data challenge in conjunction with the 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG), is a large-scale, multi-track visual kinship recognition evaluation. This is our fifth edition of RFIW, for which we continue the effort to attract scholars, bring together professionals, publish new work, and discuss prospects. In this paper, we summarize submissions for the three tasks of this year's RFIW: specifically, we review the results for kinship verification, tri-subject verification, and family member search and retrieval. We take a look at the RFIW problem, as well as share current efforts and make recommendations for promising future directions.



There are no comments yet.


page 1

page 4

page 5


Recognizing Families In the Wild (RFIW): The 4th Edition

Recognizing Families In the Wild (RFIW): an annual large-scale, multi-tr...

Deep Fusion Siamese Network for Automatic Kinship Verification

Automatic kinship verification aims to determine whether some individual...

Recognizing Families through Images with Pretrained Encoder

Kinship verification and kinship retrieval are emerging tasks in compute...

Challenge report: Recognizing Families In the Wild Data Challenge

This paper is a brief report to our submission to the Recognizing Famili...

Visual Kinship Recognition: A Decade in the Making

Kinship recognition is a challenging problem with many practical applica...

Top 3 in FG 2021 Families In the Wild Kinship Verification Challenge

Kinship verification is the task of determining whether a parent-child, ...

Families in the Wild (FIW): Large-Scale Kinship Image Database and Benchmarks

We present the largest kinship recognition dataset to date, Families in ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic kinship recognition could be used for a variety of uses, such as forensic analysis, automated photos app management, historical genealogy, multimedia analysis, missing kids and human trafficking tragedies, and immigration and border patrol concerns. Nonetheless, the challenges in such face-based tasks (i.e., fine-grained classification in unconstrained settings) are only magnified in kin-based problem sets, as the data exhibits a high degree of variability in pose, illumination, background, and clarity, and soft biometric target labels, which only exacerbates the challenges with directional relationships consideration. As a result, the practical benefits of improving kinship-based technologies are counterbalanced by the difficulties posed by the problem of automated kinship comprehension. The RFIW challenge series was born out of this need: a large-scale data challenge supporting various tasks with the goal of advancing kinship detection technology. RFIW will act as a place for expert and junior researchers to present and discuss their findings in an open forum.

In conjunction with FG, participants of the fifth RFIW111RFIW2021 webpage, continues to push state-of-the-art (SOTA) in each of the supported tasks (Fig. 1). In parallel to this effort is the focus on improving and extending the Families In the Wild (FIW) dataset [robinson2016families, robinson2018visual, wang2017kinship]– a large-scale, multi-task image set for kinship recognition.222FIW project page, The size and scope of FIW have proven to match the demand of modern-day, data-hungry deep networks [AdvNet, ertugrul2017will, gao2019will, li2017kinnet, wu2018kinship]. Nonetheless, there is room to grow in the quality, quantity, and even organization of the existing labels and evaluation protocols. For this, we propose new technologies to interface with the data, along with future plans to improve the scopes of the experiments (i.e., more realistic settings).

Figure 1: An illustration of the three tasks supported in RFIW.
1 Generation 2 Generation 3 Generation














P 991 1,029 1,588 712 721 736 716 136 124 116 114 6,983
F 303 304 286 401 404 399 402 81 73 71 66 2790
S 39,608 27,844 35,337 30,746 46,583 29,778 46,969 2,003 2,097 1,741 1,834 264,540
P 433 433 206 220 261 200 234 53 48 56 42 2,186
val F 74 57 90 134 135 124 130 32 29 36 27 868
S 8,340 5,982 21,204 7,575 9,399 8,441 7,587 762 879 714 701 71,584


P 469 469 217 202 257 230 237 40 31 36 33 2,221
F 149 150 89 126 133 136 132 22 21 20 22 1,190
S 3,459 2,956 967 3,019 3,273 3,184 2,660 121 96 71 84 39,743
Table I: Counts for T-1: the number of unique pairs (P), families (F), and faces (S[robinson2020recognizing].

Specifically, the FIW dataset introduced imagery scraped in the wild, with unconstrained family photos from the web. Although it is the largest and most comprehensive dataset of its kind, there still exist several concerns. For starters, the diversity of the data: the near 1,000 families are vast relative to other datasets, but not compared to the real world. Furthermore, still-faces in images are not the only target in multimedia that can infer kinship, the facial dynamics in videos and speech signals were shown to compliment still-faces (i.e., FIW in Multimedia (FIW MM)). Hence, future RFIW will support the multimedia content. Lastly, the experimental settings should to be tweaked and further enhanced to bridge the gap between research and reality.

Like in [robinson2020recognizing], the 2021 RFIW comprised three tasks depicted in Fig. 1: (T-1) Kinship verification, (T-2) Tri-subject verification, and (T-3) Family member search & retrieval.

The remainder of the paper is organized as follows. First, we review the related (Section II). A brief review of the data and task protocols follows (Section III). We then introduce the top methods of the challenge (Section IV). Lastly, we end with a discussion (Section V) and conclusion (Section VI).

Ii Related Works

Kinship understanding is a critical vision problem first published in the 2010 ICIP [fang2010towards]. As a human-centered visual learning problem, earlier works focused on low-level features of facial imagery. To combat challenges inherent in visual kinship verification (e.g

., variations from age), researchers incorporated mainstream learning approaches like transfer learning

[Xia201144, xia2012understanding], and metric learning [lu2014neighborhood, wang2017kinship]. More recently, advances in the evaluation protocols paved the way to pragmatic problem formulations (e.g., tri-subject kinship verification [qin2015tri] and family recognition [robinson2016families, robinson2018visual]). Furthermore, as a part of this RFIW data challenge series, with its debut in last year’s edition [robinson2020recognizing], search and retrieval of missing family problems mimic a variety of practical use-cases.333 Thus, they further closed the gap from research to reality. Robinson et al. analyze the evolution of the visual kinship problem domain over time, along with the various paradigms, SOTA, and promising future directions in a recent survey [robinson2021survey] and dissertation [robinson2020automatic].

During the past decade, deep learning has been prominent across face-based vision systems 

[wang2020deep]. Initial visual kinship benchmarks (e.g., KinWild [lu2014neighborhood] and Family101 [fang2013kinship]) had a great impact in organizing and promoting the problem. However, such minimal data were insufficient to meet the capacity needed to train deep models, with few exceptions (e.g., in  [zhang12kinship], smaller, parts-based models gained an incremental boost in performance compared with the pre-existing low-level methods). Furthermore, these earlier datasets mostly used faces from the same photos, such as color features [lopez2016comments] and then with the same photo detectors [dawson2018same] claimed SOTA, highlighting problems with constructing a verification set using face samples from the same photo.

The shortage of data quality and labels, and, hence, the absence of proper data distribution of the faces of families, motivated the release of the large-scale dataset FIW [FIW]. Ever since, FIW has supported various deep learning approaches [wu2018kinship, wei2019adversarial]. Even using generative models to predict appearances of family members [gao2019will, ozkan2018kinshipgan], additional modalities (i.e., FIW MM [robinson2021families]), and as the basis of tutorials at top-tier conferences (i.e., ACM MM [robinson2018recognize] and CVPR444

As a series of workshops and data challenges based on FIW, RFIW has been held in different venues over the past four years [robinson2017recognizing, robinson2020recognizing]. Plus, FIW premiered in a Kaggle competition that attracted over 500 teams to make submissions 555 As part of the FG, we hosted the fifth edition of RFIW (i.e., 2021 RFIW). Hence, there is a continued effort to keep updating and promoting the FIW dataset to push SOTA and inspire researchers in the years to come.

1 Generation 2 Generation 3 Generation












Baseline [robinson2020recognizing] 0.61 0.66 0.69 0.62 0.66 0.71 0.73 0.68 0.57 0.64 0.50 0.64
TeamCNU [id2new] 0.82 0.84 0.80 0.76 0.82 0.75 0.77 0.76 0.71 0.75 0.59 0.80
vuvko [id4] 0.75 0.81 0.78 0.74 0.78 0.69 0.76 0.60 0.80 0.80 0.77 0.78
nc2893 [id1] 0.76 0.78 0.75 0.74 0.70 0.67 0.70 0.59 0.79 0.79 0.75 0.77
jh3450 [id1] 0.76 0.78 0.75 0.74 0.70 0.67 0.70 0.59 0.79 0.79 0.75 0.77
paw2140 [id1] 0.75 0.78 0.76 0.74 0.68 0.69 0.72 0.59 0.78 0.79 0.75 0.77
DeepBlueAI [id3] 0.74 0.81 0.75 0.74 0.72 0.73 0.67 0.68 0.77 0.77 0.75 0.76
ustc-nelslip [id6, id3new] 0.76 0.82 0.75 0.75 0.79 0.69 0.76 0.67 0.75 0.74 0.72 0.76
stefhoer [id2] 0.77 0.80 0.77 0.78 0.70 0.73 0.64 0.60 0.66 0.65 0.76 0.74
Table II: Averaged verification accuracy scores for T-1.

Iii Task Evaluations, Protocols, Benchmarks

We briefly introduce each task: we refer readers to our previous white paper for additional details [robinson2020recognizing]. Also, see Fig. 1 for a visual depiction of each.

Historically, most focus has been on T-1 [duan2017advnet, li2017kinnet, wang2017kinship, wu2018kinship]. Then, we introduced T-2 and T-3 in 2020 [robinson2020recognizing], for which several participants engaged [id2, id3, id4, id5, id6, id8, id9]. The three data splits are formed at the family level to enable multi-task solutions, along with the previous need to keep the ground truth for a subset of the families from the public (i.e., for blind testing). In other words, the three sets (i.e., train, val, and test) contain the same families across all tasks. Specifically, 60% of the families make up the train set, while the remaining 40% was split between val and test. Hence, the three sets are disjoint in family and identity, which remain consistent across the different tasks.

Since the ground truth was released after last RFIW, it was most fair to provide all data and labels. As usual, teams were asked to only process the test set when generating submissions, and any attempt to analyze or understand the test pairs was prohibited. Also, outputs were scored on the server, for which we received and scored all submissions via Codalab (i.e., T-1666, T-2777, and T-3888

All faces were encoded via Sphereface Convolutional Neural Network (CNN[Liu_2017_CVPR] (i.e., 512 D), with the pre-processing and training from the original work.999 similarity determined the closeness of pairing faces by comparing features and , which is defined as  [nguyen2010cosine].

Iii-a Kinship Verification (T-1)

To verify kinship is to predict whether a pair of individuals are blood relatives. In computer vision, we compare faces to classify the pairs as KIN or NON-KIN (

i.e., true or false, respectively). This one-to-one view of kinship recognition is typically assumes prior knowledge in the relationship type [robinson2018recognize]. Hence, relationship types are evaluated independently. With the introduction of FIW, the number of face pairs and relationship types for kinship verification (i.e., T-1) has significantly increased. Three sets of the data (i.e., train, val, and test) are partitioned for RFIW (Table I). The test set had an equal number of positive and negative pairs and no family (and, hence, subject identity) overlaps between sets. As of 2020, the challenge began to support grandparent-grandchildren types, i.e., grandfather-granddaughter (GFGD), grandfather-grandson (GFGS), grandmother-granddaughter (GMGD), grandmother-grandson (GMGS

). Due to insufficient counts across folds, along with an incredible bias skewed away from the few families that comprise pairs across three generations, the great grandparent-great grandchild pairs of

FIW are omitted from T-1 of RFIW.

Verification accuracy is used to evaluate the performance. Specifically, , where . Then, the overall accuracy is the weighted sum. The threshold for positive and negative pairs was determined by the value that maximizes the accuracy on the val set. Results are listed in Table II.

Team FMS FMD Avg.
Baseline [robinson2020recognizing] 0.68 0.68 0.68
TeamCNU [id2new] 0.86 0.82 0.84
stefhoer [id2] 0.74 0.72 0.73
DeepBlueAI [id3] 0.77 0.76 0.77
ustc-nelslip [id6, id3new] 0.80 0.78 0.79
Table III: Verification accuracy scores for T-2.

Iii-B Tri-Subject Verification (T-2)

Tri-Subject Verification focuses on a different view of kinship verification– the goal is to decide if a child is related to a pair of parents. First introduced in [qin2015tri], it makes a more realistic assumption, as knowing one parent often means the other potential parent(s) can be easily inferred.

Triplet pairs consist of Father (F) / Mother (M) - Child (C) (FMC) pairs, where the child C could be either a Son (S) or a Daughter (D) (i.e., triplet pairs are FMS and FMD).

Triplets were formed by first matching each mother-father pair with their biological children to form the list of positives. Then, negative triplets were generated by shuffling the children in the positive list such that pairs of parents remained constant to yield the same number of negatives. Note that the number of possible negatives is far more than positives, so a pair of faces of a parent pair was used once and only once to produce a balanced list. Again, no family or subject identity overlaps between sets: the same families make up the train, val, and test sets across the tasks.

Verification accuracy WAS first calculated per triplet-pair type (i.e., FMD and FMS), and then averaged via the weighted sum. A score was assigned to each triplet in the val and test sets using the formula , where , and

are the feature vectors of the

-th triplet. Scores were compared to a threshold to infer a label (i.e., KIN if the score surpasses the threshold; else, NON-KIN). The threshold was found experimentally on the val set. The threshold was applied to the test (Table III).

(a) Top scoring.
(b) Runner-up (i.e., scoring).
(c) Lowest scoring.
Figure 2: The top-ranking (a), runner-up (b), and last-place (c) families, on average, in T-3. The probe (i.e., search query) displays a white star in the lower-left corners of the respective montage. The top two families (i.e., (a) and (b)) are queries of children with both parents in the gallery, while the family with the lowest accuracy was a subject with children and a sibling (i.e., a sister) present. Note that connections indicate blood relatives, where (a) and (b) the parents that share children (i.e., the only blood connection shared is in the children), while the topmost of (c) does not include the mother of the children, for she is not a blood relative of the query (i.e., who is a father and brother in the types of relationships in focus here).

Iii-C Search and Retrieval (T-3)

As a search cue, kinship information can improve conventional FR search systems as prior knowledge for mining social or ancestral relationships in industries like However, the task is most directly related to missing persons. Thus, we pose T-3 as a set based paradigm. For this, we imitate template-based evaluations on the probe side but with a gallery of faces [whitelam2017iarpa]. Furthermore, the goal is to find relatives of search subjects (i.e., subject-level probes) in a search pool (i.e., face-level gallery).

The protocol of T-3 could be used to find parents and other relatives of unknown, missing children. The gallery contains 31,787 facial images from 190 families in the test set. The inputs are sets of media for a subject (i.e., probes), and outputs are ranked lists of all faces in the gallery. The number of relatives varies for each subject, ranging anywhere from 1 to 20+. Furthermore, probes have one-to-many samples– the means of fusing feature sets (i.e., a probe’s media) is an open research question [zhaomulti]. This many-to-many task is currently set up in closed form (i.e., every probe has relative(s) in the gallery).

For each of the test probes (i.e., family ), the average precision (AP) is calculated:

where is the number of true-positives for the -th family. The average AP (i.e., mean average precision (mAP)) is then reported: . Finally, Rank@5 is reported: the average number of probes that returned at least one TP in the top five gallery faces. Baseline, along with others on the scoreboard, are shown in Table IV.

Team Average mAP Rank@5
ustc-nelslip [id6, id3new] 0.35 0.15 0.54
Baseline [robinson2020recognizing] 0.28 0.11 0.45
TeamCNU [id2new] 0.40 0.21 0.60
HCMUS notweeb [id9] 0.17 0.07 0.28
DeepBlueAI [id3] 0.19 0.06 0.32
vuvko [id4] 0.39 0.18 0.60
Table IV: Performance ratings for T-3.

Iv Summary of submissions

New solutions published as part of the 2021 RFIW challenge in the FG proceedings are introduced. Readers are referred to the paper references for additional details.

Figure 3: FIW

is now available in FiftyOne - an open-source tool for dataset curating and model analysis via an easy-to-use Python interface and application ( Our motivations for choosing FiftyOne, and its corresponding dataset zoo, are to offer simpler access, built-in data exploration features, inspire collective contributions of the community, and, hence, acquire a better understanding for FIW in ways to improve, extend, and demonstrate. All tasks will be supported in this incentive and, eventually, the multimedia variant FIW MM [robinson2021families].


proposed a contrastive learning framework to tackle and lead all three tasks [id2new]. The core idea of their solution is that self-supervised contrastive learning can help to learn powerful representations for different downstream tasks such as three tracks in RFIW. Following the 2020-RFIW’s winning team Vuvko [id4], ArcFace (i.e., a pre-trained ResNet-101) was used to encode raw faces, and then a

multi-layer perceptron

(MLP) was attached to obtain the low-dimensional feature pair for computing the contrastive loss [chen2020simple]

. To output the kinship verification results for all three tasks, it only needs to remove the MLP layers and take the mid-level features extracted by the Siamese backbone to compute the similarity score. Then, a predefined threshold is naturally applied to select the positive pairs. It surprisingly outperforms all the former SOTA methods on most of the tracks in the RFIW challenge.


used ensemble learning on both data and networks levels [id1]. Specifically, this team applied data augmentation to obtain more representative input data, like random rotation in angle, minor crops, horizontal flips, and channel-wise transformation. Moreover, the authors used multiple duplicate data samples with different augmentations as the ensemble inputs for testing. The network ensemble has employed multiple structures such as ResNet50 [he2016deep], FaceNet [schroff2015facenet], VGGFace [parkhi2015deep], and SENet50 [hu2018squeeze] as the backbones where the features from different backbones fused in multiples ways. The Hadamard product, squared, and absolute value difference of the pairs of features are concatenated for the high-level similarity quantization. Moreover, there is the ensemble across different splits of the training data with k folds. For each instance, the final 4*k ensemble member networks, trained with k different splits and four Siamese networks for each split, are applied for the prediction. To further boost the performance, they take the program synthesis based on the OpenAI’s Codex [chen2021evaluating] to automatically generate variants of networks for the ensemble.


used a Siamese neural network designed for all three tasks 

[id3new]. For the one-vs-one kinship verification (T-1), the team takes a two-branch deep Siamese neural network with the enhanced feature fusion (i.e., concatenation of squared difference, the difference of squared features, and dot product). To address the two-vs-one verification (T-2), they proposed a pair of deep Siamese neural networks, each comprised of four branches. There is only one branch network for the child data shared by the two Siamese networks. Moreover, the other two branches are designed for mother and father images, respectively. To obtain the rank of similar images given a query one (T-3), ustc-nelslip introduced the feature fusion similarity and cosine similarity for measurement. Finally, the obtained similarity scores of mother-child and father-child are weighted to generate the parent-child similarity score for kinship verification.

V Discussion

To explore easy and hard cases of T-3, we took the average across the baseline and two submissions of this RFIW. The top two queries in mAP are of children, for which both parents are present in the gallery. The last-place query is a subject with no parents present in the query (Fig. 2). It is interesting to note that the parents of the second family are inter-racial: as expected, children tend to inherit features from both parents, on average.

A challenge in facial recognition problems is age: it is typically more difficult to recognize someone at an older age. This problem is especially true in the problem of visual kinship recognition, as the patterns the come with age are variant. The age of the faces representing the last place query is all at an older age (Fig. 

1(c)). Furthermore, all blood relatives present in the gallery are the opposite sex (i.e., one sister and three daughters), whereas both the top-scoring (Fig. 1(a)) and the runner-up (Fig. 1(b)) have at least half of their known (i.e., present) relatives of the same sex with a majority of the face samples.

FIW was added to fiftyone’s datazoo [moore2020fiftyone] - accessing and exploring the data made efficient. Additionally, we plan to release evaluations per Voxel51’s Python API (Fig. 3). FIW is more accessible and extendable as we work to incorporate improved protocols, data quality, and MM in future RFIWs.

Vi Conclusion

Another year of RFIW in conjunction with the 2021 FG, SOTA in kin-based vision models continue to improve across all three tasks currently supported by the FIW dataset. TeamCNU topped the score charts via a contrastive learning framework geared to learn better representation for comparing faces in all three kinship recognition tasks. We are working to improving the existing data and settings by working to make FIW more accessible, while also bringing in the multimedia data in the benchmarks. Baseline code at