I Introduction
Automatic kinship recognition has numerous uses. For instance - as an aid in forensic investigations, automated photo library management, historical lineage and genealogical studies, social-media-based analysis, tragedies of missing children and human trafficking, and concerns about immigration and border patrol. Nonetheless, the challenges in such face-based tasks (i.e., fine-grained classification in unconstrained settings), are only amplified in the kin-based problem sets, as the data exhibits a high degree of variability in pose, illumination, background, and clarity, along with soft bio-metric target labels (i.e., kinship), which only further exacerbates the challenges with consideration for the directional relationships. Hence, the usefulness brought by the practical benefits of enhancing kinship-based technology is matched by the challenges posed by the problem of automatic kinship understanding. This motivated the launching of the RFIW challenge series: a large-scale data challenge in support of multiple tasks with the aim to advance kinship recognition technologies. We intend for RFIW to serve as a platform for expert and junior researchers to present and share thoughts in an open forum.
The Families In the Wild (FIW) dataset [17, 18, 21]– a large-scale, multi-task image set for kinship recognition– supports the annual RFIW.111FIW project page, https://web.northeastern.edu/smilelab/fiw/. The aim of the RFIW challenge is to bridge the gap between research-and-reality using its large scale, variation, and rich label information. This makes modern-day data-driven approaches possible, as has been seen since its release in 2016 [1, 3, 6, 9, 23].
We summarize the evaluation protocols– practical motivation, technical background, data splits, metrics, and benchmarks– of the 2020 RFIW challenge. Specifically, this manuscript serves as a white-paper of the RFIW held in conjunction with the FG. Additional and information supplemental on the challenge website.222RFIW2020 webpage, https://web.northeastern.edu/smilelab/rfiw2020/.
The remainder of the paper is organized as follows. The three tasks that make-up RFIW2020 are introduced separately (Section III-B, III-C, and III-D). For each task, a clear problem statement, the intended use, data splits, task protocols (i.e., evaluation settings and metrics), and benchmark results are provided. From there, we bring up the discussion (Section IV) on broader impacts and potential next steps. Then, we conclude (Section IV-B).
Ii Related Works
Kinship recognition, as seen in the machine vision, started in [5], where minimal data and low-level features set the stage for the task of kinship verification between parents and child. Soon thereafter, [24] took a gender specific view of the problem– moreover, the problem was viewed as a low rank transfer subspace problem, where the source and target are set as faces of the parent at younger and older ages, respectively [20]. Family101 [4] was the first facial image dataset with family tree labels; at about the same time, KinWild [12] was released and used to organize data challenges [11]. The task of tri-subject kinship verification (i.e., Track 2), was inspired by the work that came next, in [16], for which data (i.e., TS-Kin) and benchmarks were released. Until the release of FIW in 2016 [17]
, deep learning models were not widely applied to the kin-based domain, with the minimal exception (
i.e., [25]), as the data capacity of their more complex machinery was not met by previous datasets. As part of the first RFIW [19]), FIW was further extended [18, 21], making ever more kin-based problems possible to approach [6, 8]. A major focus of this (i.e., RFIW 2020) is to establish a record of state-of-the-art for the latest-and-greatest version of the FIW image-set.width= BB SS SIBS FD FS MD MS GFGD GFGS GMGD GMGS Total Train P 991 1,029 1,588 712 721 736 716 136 124 116 114 6,983 F 303 304 286 401 404 399 402 81 73 71 66 2790 S 39,608 27,844 35,337 30,746 46,583 29,778 46,969 2,003 2,097 1,741 1,834 264,540 val P 433 433 206 220 261 200 234 53 48 56 42 2,186 F 74 57 90 134 135 124 130 32 29 36 27 868 S 8,340 5,982 21,204 7,575 9,399 8,441 7,587 762 879 714 701 71,584 test P 469 469 217 202 257 230 237 40 31 36 33 2,221 F 149 150 89 126 133 136 132 22 21 20 22 1,190 S 3,459 2,956 967 3,019 3,273 3,184 2,660 121 96 71 84 39,743

FM-S | FM-D | Total | ||
train |
P | 662 | 639 | 1,331 |
F | 375 | 364 | 739 | |
S | 8,575 | 8,588 | 17,163 | |
val |
P | 202 | 177 | 379 |
F | 116 | 117 | 233 | |
S | 2,859 | 2,493 | 5,352 | |
test |
P | 205 | 178 | 383 |
F | 116 | 114 | 230 | |
S | 2,805 | 2,400 | 5,205 | |
FD | FS | MD | MS | SIBS | BB | SS | GFGD | GFGS | GMGD | GMGS | Avg. | |
Sphereface | 0.61 | 0.66 | 0.69 | 0.62 | 0.66 | 0.71 | 0.73 | 0.68 | 0.57 | 0.64 | 0.50 | 0.64 |
Iii Task Evaluations, Protocols, Benchmarks
RFIW 2020 supported three tasks: kinship verification (T-1), tri-subject verification (T-2), and search & retrieval of family members for missing children (T-3). We next describe each task separately, following the same outline: the problem statement and motivation, data splits and protocols, and benchmark experiments (i.e., baselines). A brief section on experimental settings common to all tasks precedes the detailed descriptions of each task in separate subsections.
Iii-a Experimental settings
The FIW
dataset provides the most extensive set of face pairs for kin-based face recognition.
FIW provides the data needed to train modern-day data-driven deep models [2, 9, 21, 23]. FIW was split into three parts: train, val, and test. Specifically, 60% of the families were assigned to the train set; the remaining 40% was split evenly between val and test. The three sets are completely disjoint in family and identity. Labeled train and unlabeled val were first released, with servers open for scoring (Phase 1). Then, ground-truth for val was made available (Phase 2). Finally, the “blind” test set was released at the start of Phase 3. Phase 3 lasted for ten days to allow teams to process and make final submissions for scoring. Teams were asked to only process the test set when generating submissions and any attempt to analyze or understand the test pairs was prohibited.As preprocessing, faces for all three sets were encoded via Sphereface CNN [10] (i.e., 512 D). All pre-processing and the model weights were from the original work.333https://github.com/wy1iu/sphereface
Also common, is the use of cosine similarity to determine closeness of a pair of facial features
and [14]. This is defined asScores were then compared to threshold (i.e., infers KIN; else, NON-KIN) or sorted (i.e., T-3).
Iii-B Kinship verification
The goal of kinship verification is to determine whether a pair of faces are blood relatives. This classical Boolean problem has two possible outcomes, KIN or NON-KIN (i.e., true or false, respectively). Hence, this is the one-to-one view of kin-based problems. The classical problem can be further extended by considering the type of kin relation between a pair of faces, rather than treating all kin relations equally.
Prior research mainly considered parent-child kinship types, i.e., father-daughter (FD), father-son (FS), mother-daughter (MD), mother-son (MS). Less attention has been given to sibling pairs, i.e., sister-sister (SS), brother-brother (BB), and brother-sister (SIBS
). Research findings in psychology and computer vision found that different relationship types share different familial features
[13]. Hence, each relationship type can be modeled and evaluated independently. Thus, additional kinship types would further both our understanding and capabilities of automatic kinship recognition. With FIW, the number of facial pairs accessible for kinship verification has dramatically increased, with a subset of the pair types and face pairs listed in Table I. Additionally, benchmarks now include grandparent-grandchildren types, i.e., grandfather-granddaughter (GFGD), grandfather-grandson (GFGS), grandmother-granddaughter (GMGD), grandmother-grandson (GMGS).Iii-B1 Data splits
FIW supports eleven different relationship types that were used in RFIW (Table I). The test set had an equal number of positive and negative pairs and with no family (and, hence, subject identity) overlap between sets.
Iii-B2 Settings and metrics
Conventional face verification supports different modes [7], which is followed here:
-
Unsupervised: No labels provided, i.e., the prior knowledge about kinship or subject IDs.
-
Image-restricted: Kinship labels (i.e., KIN/NON-KIN) will be provided for a training set that is completely disjoint from ”blind” evaluation set, i.e., no subject or family overlap between training and evaluation sets.
-
Image unrestricted: Along with the kinship labels, subject IDs are provided. This allows for the ability to generate additional negative pair-wise samples.
Verification accuracy is used to evaluate. Specifically,
where . Then, the the overall accuracy is calculated as a weighted sum (i.e., weight by the pair count to determine the average accuracy).
Iii-B3 Baseline results
Iii-C Tri-subject verification
Tri-Subject Verification focuses on a different view of kinship verification– the goal is to decide if a child is related to a pair of parents. First introduced in [16], it makes a more realistic assumption, as having knowledge of one parent often means the other potential parent(s) can be easily inferred.
Triplet pairs consist of Father (F) / Mother (M) - Child (C) (FMC) pairs, where the child C could be either a Son (S) or a Daughter (D) (i.e., triplet pairs are FMS and FMD).
FMS | FMD | Avg. | |
Baseline | 0.68 | 0.68 | 0.68 |

Iii-C1 Data splits
Following the procedure in [16], we create positive (have kin relation) triplets by matching each husband-wife spouse pair with their biological children, and negative (no kin relation) triplets by shuffling the positive triplets until every spouse pair is matched with a child which is not theirs (Table II). Because the number of potential negative samples far exceeds the number of potential positive examples, we only generate one negative triplet for each positive triplet, again following the procedure of [16].
We post-process the positive triplets before generating negatives to ensure balance among individuals, families, and spouse pairs, since a naive data selection procedure which weights every face sample similarly would result in some individuals and families being severely over-represented due to an abundance of face samples for some identities and families. The post-processing is done by limiting the number of samples of any triplet , where , , and are identities of a father, mother, and child to 5, then limiting the appearance of each spouse-pair to 15, and then finally limiting the number of triplet samples from each family to 30. The test set has an equal number of positive and negative pairs. Lastly, note that there is no family or subject identity overlapping between any of the sets.
Iii-C2 Settings and metrics
Per convention in face verification, we offer 3 modes (i.e., the same as in task 1 listed in Section III-B2). The metric used is, again, verification accuracy, which is first calculated per triplet-pair type (i.e., FMD and FMS). Then, the weighted sum (i.e., average accuracy) determines the leader-board.
Iii-C3 Baseline results
Baseline results are shown in Table IV. A score was assigned to each triplet in the validation and test sets using the formula
where , and
are the feature vectors of the father, mother, and child images respectively from the i-th triplet. Scores were compared to a threshold
to infer a label (i.e., predict KIN if the score was above the threshold; else, NON-KIN). The threshold was found experimentally on the val set. The threshold was applied to the test (Table IV).Iii-D Search and retrieval
T-3 is posed as a many-to-many, i.e., one-to-many samples per subject (Fig 3). Thus, we imitate template-based evaluations on the probe side, but faces in the gallery are not labeled by subject. Furthermore, the goal is to find relatives of search subjects (i.e., probes) in a search pool (i.e., gallery).

Probe | Gallery | Total | ||
train |
I | – | 3,021 | 3,021 |
F | – | 571 | 571 | |
S | – | 15,845 | 15,845 | |
val |
I | 192 | 802 | 994 |
F | 192 | 192 | 192 | |
S | 1,086 | 4,030 | 5,116 | |
test |
I | 190 | 783 | 9d73 |
F | 190 | 190 | 190 | |
S | 1,487 | 31,787 | 33,274 | |
Kin information, as a search cue, can be leveraged to improve conventional FR search systems, or even as prior knowledge for mining social or family relationships in industries like Ancestry.com. However, the task is most directly related to missing persons. Thus, we formulate it as such.
T-3 mimics finding parents and other relatives of unknown, missing children. The gallery contains 31,787 facial images from 190 families (Fig. 4): inputs are subject labels (i.e., probes), and outputs are ranked lists of all faces in the gallery. The number of relatives varies for each subject, ranging anywhere from 0 to 20+. Furthermore, probes have one-to-many samples– the means of fusing samples of probes is an open research question. This many-to-many task is currently setup in closed form (i.e., all probes have relative(s)).
Iii-D1 Data spits
This task will be composed of search subjects (i.e., probes) from different families. Probes are supported by several samples of query subject, text description of family (e.g., ethnicity, some relationships between selected members, etc.), and list of relatives present in gallery. The test set will only consist of sets of images for the probes. Again, three disjoint sets were split (Table V).
Run ID | Network(s) | mAP | Rank@5 |
---|---|---|---|
Baseline-2 | Sphereface | 0.016 | 0.098 |

Iii-D2 Evaluation settings
Each subject (i.e., probe) gets searched independently, with 190 in total: hence, 190 families make-up the test set. Probes have one-to-many faces. Following template conventions of other many-to-many face evaluations, facial images for unique subjects are separated by identity, with a gallery containing variable number of relatives, each with a variable number of faces [22].
Teams were allowed to submit up to six final submissions, with each submissions being a ranked-list of all subjects in the gallery. Submissions were accompanied by a brief (text) description of the system used to generate results. With that was a ranked list per probe in the test. Per RFIW rules, participants were permitted to analyze test results, as this was the purpose of the 192 families provided as the val set.
Evaluation Metric
MAP was the underlying metric used for comparisons. Mathematically speaking, scores for each of the missing children are calculated as follows:
where average precision (AP) is a function of family with a total of true-positive rate (TPR). We then average all AP scores to determine overall MAP score as follows:
Iii-D3 Baseline results

Iv Discussion
Iv-a A broader impact
The fourth RFIW gained fair attention. Task 1, kinship verification, saw the most (10+ submissions). Track 2 (i.e., tri-subject) and 3 (search and retrieval) were both supported for the first time by RFIW, are more complex than the classic task of T-1, and are practically motivated. Submissions for all tracks passed baselines by notable margins (leader-board coming soon).
The scope of kin-based problems spans much wider than RFIW. Specifically, in application (e.g., generative-based tasks [6, 15]) and experimental settings [8], focuses on particular views of the visual kinship recognition problem. Tasks of RFIW were thought to be appropriate, provided the difficulty and practicality; the question how best to formulate the problem is an open research question, in itself.
Iv-B Conclusion
This paper presented the 2020 RFIW challenge organized in conjunction with the FG. The 2020 challenge is the fourth edition of the RFIW annual evaluation. For this, we added 2 new tracks, tri-subject verification and search & retrieval of missinig children; the traditional kinship verification task continued to be supported as well. The FIW dataset was used to pose each of the challenge tracks. As challenging it may be, many entries outperformed the “vanilla” baselines in all tasks. Regardless, in all three cases, there still exists much room for improvement. Accuracy on the Verification and Tri-subject Verification tasks has just begun to approach the 80% mark, with Search & Retrieval further behind. Code and baselines available online (github.com/visionjo/pykinship). RFIW supports research efforts. As we see it, the story of FIW is still in its infancy.
References
- [1] Q. Duan and L. Zhang. Advnet: Adversarial contrastive residual net for 1 million kinship recognition. In Proceedings on RFIW Workshop in ACM MM, 2017.
- [2] Q. Duan and L. Zhang. Advnet: Adversarial contrastive residual net for 1 million kinship recognition. In Proceedings on RFIW Workshop in ACM MM, pages 21–29, 2017.
- [3] I. Ö. Ertugrul and H. Dibeklioglu. What will your future child look like? modeling and synthesis of hereditary patterns of facial dynamics. In IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2017.
- [4] R. Fang, A. Gallagher, T. Chen, and A. Loui. Kinship classification by modeling facial feature heredity. In International Conference on Image Processing (ICIP). IEEE, 2013.
- [5] R. Fang, K. D. Tang, N. Snavely, and T. Chen. Towards computational models of kinship verification. In International Conference on Image Processing (ICIP). IEEE, 2010.
- [6] P. Gao, S. Xia, J. Robinson, J. Zhang, C. Xia, M. Shao, and Y. Fu. What will your child look like? dna-net: Age and gender aware kin face synthesizer. arXiv:1911.07014, 2019.
- [7] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, UMass, Amherst, 2007.
-
[8]
C. Kumar, R. Ryan, and M. Shao.
Adversary for social good: Protecting familial privacy through joint
adversarial attacks.
In
Conference on Artificial Intelligence (AAAI)
, 2020. - [9] Y. Li, J. Zeng, J. Zhang, A. Dai, M. Kan, S. Shan, and X. Chen. Kinnet: Fine-to-coarse deep metric learning for kinship verification. In Proceedings on RFIW Workshop in ACM MM, pages 13–20, 2017.
-
[10]
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
Sphereface: Deep hypersphere embedding for face recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017. - [11] J. Lu, J. Hu, V. E. Liong, X. Zhou, A. Bottino, I. Ul Islam, T. Figueiredo Vieira, X. Qin, X. Tan, S. Chen, et al. The fg 2015 kinship verification in the wild evaluation. In IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2015.
- [12] J. Lu, X. Zhou, Y.-P. Tan, Y. Shang, and J. Zhou. Neighborhood repulsed metric learning for kinship verification. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 36(2), 2014.
- [13] S. X. M. Shao and Y. Fu. Genealogical face recognition based on ub kinface database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2011.
- [14] H. V. Nguyen and L. Bai. Cosine similarity metric learning for face verification. In Asian Conference on Computer Vision. Springer, 2010.
- [15] S. Ozkan and A. Ozkan. Kinshipgan: Synthesizing of kinship faces from family photos by regularizing a deep face network. In International Conference on Image Processing (ICIP), 2018.
- [16] X. Qin, X. Tan, and S. Chen. Tri-subject kinship verification: Understanding the core of a family. CoRR, abs/1501.02555, 2015.
- [17] J. P. Robinson, M. Shao, Y. Wu, and Y. Fu. Families in the wild (fiw): Large-scale kinship image database and benchmarks. In ACM on International Conference on Multimedia (MM), 2016.
- [18] J. P. Robinson, M. Shao, Y. Wu, H. Liu, T. Gillis, and Y. Fu. Visual kinship recognition of families in the wild. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018.
- [19] J. P. Robinson, M. Shao, H. Zhao, Y. Wu, T. Gillis, and Y. Fu. Recognizing families in the wild (rfiw). In Proceedings on RFIW Workshop in ACM MM, 2017.
- [20] M. Shao, C. Castillo, Z. Gu, and Y. Fu. Low-rank transfer subspace learning. In 2012 IEEE 12th International Conference on Data Mining, pages 1104–1109. IEEE, 2012.
- [21] S. Wang, J. P. Robinson, and Y. Fu. Kinship verification on families in the wild with marginalized denoising metric learning. In Conference on Automatic Face and Gesture Recognition (FG), 2017.
- [22] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, et al. Iarpa janus benchmark-b face dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, pages 90–98, 2017.
- [23] Y. Wu, Z. Ding, H. Liu, J. Robinson, and Y. Fu. Kinship classification through latent adaptive subspace. In Conference on Automatic Face and Gesture Recognition. IEEE, 2018.
- [24] S. Xia, M. Shao, J. Luo, and Y. Fu. Understanding kin relationships in a photo. IEEE Trans. on Multimedia, 14(4):1046–1056, 2012.
-
[25]
K. Zhang, Y. Huang, C. Song, H. Wu, and L. Wang.
Kinship verification with deep convolutional neural networks.
In Proceedings of the British Machine Vision Conference (BMVC), pages 148.1–148.12. BMVA Press, September 2015.
Comments
There are no comments yet.