, which is valuable in applications such as finding missing children, criminal pursing, and social media analysis. In the field of computer vision, most existing kinship-related works focus on kinship identification[16, 12, 31, 32, 47], i.e., justifying whether a given pair of faces has kinship. Very few works except , , and  focus on kinship face synthesis, i.e., synthesizing a child face given a parent face, which is first proposed by Ertuğrul et al. . Kinship face synthesis is referred as descendant face synthesis in this paper since we focus on the kinship between parents and children.
The drawbacks of these methods are summarized as follows. First, modeling one-versus-one relation ignores complementary information from the other parent face since child face has a resemblance to both parent faces . Second, they lack control over the resemblance of the synthesized face to parent faces because they implicitly learn the connection between one parent face and one child face without explicit emphasis on the resemblance. Third, they have a fatal issue that in the training data one input face might correspond to multiple output faces because a couple could have several children under the same gender. This might mess up the model during training. Fourth, they have control over the gender of the synthesized face, but no control over the age.
To alleviate the above issues, we propose a novel method to model two-versus-one relation between two parent faces and one child face for controllable descendant face synthesis based on generative adversarial networks. It has explicit control over the resemblance of facial components between the synthesized face and parent faces and also has control over age and gender. Note that the two-versus-one relation has been studied only for kinship verification , but has not been studied for descendant face synthesis. As shown in Fig. (b)b, our framework consists of two modules, i.e., an inheritance module and an attribute enhancement module. The former is designed to control the resemblance of facial components between the synthesized face and parent faces. If a component of a child face resembles to that of the father face, it is referred as that the child inherits the component from the father. This module generates high-quality intermediate faces according to the control vector of the inheritance of facial components. Though a couple might have multiple children, the specification of inheritance almost makes a pair of parent faces correspond to only one child face during training, which alleviates the third issue above. The latter is designed for the enhancement of age and gender on the intermediate faces. Both modules are jointly learned in an end-to-end manner.
Currently, there is no large scale database with the kinship annotation of father-mother-child triplets. TSKinFace  contains only 1015 tri-subject groups. Families in the Wild (FIW)  has a large set of pairwise kinship annotations such as father-son, mother-son, etc., but it has only 2059 tri-subject groups. They are not enough to train a deep net to model the two-versus-one relation. Hence, we propose an effective strategy for model learning without using the ground truth descendant faces by exploiting low-quality synthetic faces and the designed component exchange strategy. Fig. 1 shows the generated descendant faces of two generations with control over the inheritance of components, gender, and age by our method.
Our primary contributions are summarized as follows:
We propose a novel method to model two-versus-one kin relation for controllable descendant face synthesis. It has explicit control over the resemblance of facial components between the synthesized face and its parent faces and also has control over age and gender.
We propose an effective strategy for model learning by exploiting low-quality synthetic faces and the component exchange to compensate for the lack of a large scale database with father-mother-child kin annotation.
2 Related Work
Face synthesis. Great improvements have been achieved in several sub areas of face synthesis on basis of GANs , including face reconstruction [22, 28], face swap [4, 27], facial attribute manipulation [46, 6], face makeup transfer [30, 5], and face aging [49, 45]. These methods aim to modify local facial regions according to a specified attribute, to swap the whole face region, to transfer makeup from a specified template, or to generate faces at different age stages. However, they do not focus on generating descendant faces.
Kinship verification. Most previous studies on kin relation focus on kinship verification , including pairwise kinship [44, 10, 47, 17, 32, 29, 31, 48, 43] and triplet-wise kinship [36, 15]. In order to use temporal information, video-based kinship verification methods are proposed in  and . Kinship is also used to assist the learning of age progression in . These methods aim to judge whether a given pair or triplet of faces has a kinship, rather than synthesizing a descendant face.
Kinship (descendant) face synthesis. Very few works have studied kinship face synthesis except , , and .  uses a GAN for descendant face synthesis and uses a gender label to control the gender of the synthesized face.  uses four auto-encoders to model the relations of father-son, father-daughter, mother-son, and mother-daughter, respectively. The gender is controlled by the selection of one of the four auto-encoders. Both methods aim to generate one child face given only one parent face by modeling one-versus-one kin relation. The synthesized face is supposed to be the same as the ground truth child face. They simply treat the parent face as the input of an auto-encoder or GAN and use the child face as the output to learn a direct mapping between them as shown in Fig. (a)a. However, the image quality of visual results is poor in , , and . They do not perform well on keeping the resemblance between the parent face and the ground truth of the child face, which does not satisfy their original purpose. Compared with them, the main difference is that our method focuses on synthesizing a descendant face by modeling the two-versus-one relation with explicit control over the resemblance to the parent faces as well as control over age and gender, while  and  model the one-versus-one relation by implicitly learning the mapping from one parent face to one child face without guarantee on the resemblance and the age.
3 The Proposed Approach
We propose a novel method to model two-versus-one kin relation for descendant face synthesis with control over the resemblance of facial components between the synthesized face and its parent faces as well as control over age and gender. The framework of the proposed method is shown in Fig. 3. We first introduce the strategy for learning without using the ground truth descendant faces in Sec. 3.1. Then, we present the structures of two modules in Sec. 3.2 and Sec. 3.3 followed by the designed losses in Sec. 3.4.
3.1 Learning without using the ground truth descendant faces
Previous studies mainly focus on kinship verification , but very few works focus on descendant face synthesis except  and . Existing databases such as Families in the Wild (FIW) , TSKinFace , Sibling-Face , Family 101  and KinFaceW-I/II  are constructed for kinship verification. Since most works of kinship verification aim to identify pairwise kinship, most databases contain only pairwise kinship annotation. They can be used for the one-versus-one descendant face synthesis [14, 34], but are not applicable to our two-versus-one descendant face synthesis. The largest databases with the triplet-wise annotation of father-mother-child are FIW and TSKinFace which contain only 2059 and 1015 tri-subject groups, respectively. They are not enough to train a deep model that contains millions of parameters.
We propose a strategy for learning without the ground truth of descendant faces by decomposing the task into two sub tasks and leveraging low-quality synthetic faces. One is to take control over the resemblance of facial components between the synthesized face and its parent faces, i.e., the inheritance module. The other is to take control over age and gender, i.e., the attribute enhancement module. To supervise the learning of the inheritance module, we exchange facial components of parent faces according to the control vector of inheritance to generate synthetic faces. The selected components of one parent face are replaced with the corresponding components from the other parent face by using color correlation . Each patch of each component is divided by a Gaussian blur of itself and then multiplied by a Gaussian blur of the target face. Note that the quality of synthetic faces is low since there are noticeable artifacts around facial components. We use the low quality synthetic faces as the input of the inheritance module.
Inside the inheritance module, facial components will be exchanged back according to the control vector in the latent space. The intermediate face generated by the decoder will be compared to the original face to provide supervision. Let denote the input male parent face and denote the female one. Let and denote faces after component exchange by color correlation . is a 5-bit binary control vector of the inheritance, of which bits correspond to facial components, including left eye&brow, right eye&brow, nose, mouth and profile. () means the -th facial component inherits from the male face while means it inherits from the female face. ‘eye&brow’ means eye and brow are included in one patch. Let and denote the age and gender of the male face and and for the female. The generation of synthetic faces by component exchange can be represented as
where and are the inputs of the inheritance module.
Please note that if a large scale database with the father-mother-child kinship annotation is available, our method can be easily extended to exploit the ground truth descendant faces by adding a reconstruction loss between them and the generated descendant faces instead of using the low-quality synthetic faces as input.
3.2 Inheritance module
The inheritance module is designed to control the resemblance of facial components between the synthesized face and its parent faces. The inputs of the module consist of three parts, i.e., a pair of parent faces, the control vector of inheritance, and the age and gender of each parent face. As shown in Fig. 3, parent faces are firstly decomposed into five facial components according to facial landmarks. Each component is represented by a patch. Note that the profile is the face image whose components are filled with black masks. Then each component is fed into an encoder individually to get its feature map. Since components have different appearances, we use individual encoders to capture their specific features of shape, color, and texture.
The inheritance of facial components is performed by the exchange of feature maps between the female and the male according to the control vector in the latent space. Two combinations of feature maps can be generated through the component exchange. One follows the control vector and the other follows the inverse of the control vector. The feature maps of each combination are integrated into a new feature map according to their positions in the input face. To incorporate the information of age and gender, we expand the labels of age and gender to two feature maps as the same size as the integrated feature map. Then these feature maps are concatenated with a noise feature map and fed into a decoder to generate an intermediate face. Let denote the output male face of the inheritance module and denote the female. The inheritance module can be represented as
3.3 Attribute enhancement module
The attribute enhancement module is used to enhance gender and age on the intermediate faces from the inheritance module. The intermediate faces are encoded into a latent space by an encoder. The latent features are concatenated with the expanded vectors of age and gender and then fed into a decoder to generate the final descendant faces. The attribute enhancement module can be represented as
where and are the final descendant faces.
3.4 Losses for joint learning of both modules
WGAN  is used as the generator of the inheritance module for its stability of training. The adversarial loss is
where is a randomly sampled image and
is the hyperparameter of WGAN.is the discriminator to distinguish real images from fake ones. As in , it outputs a 2
2 probability map instead of a single scalar value. As shown in Sec.3.1, by using synthetic faces and the exchange of components, the difference between the output faces of the inheritance module and the original faces can be used to provide supervision. The pixel-wise loss is defined as
where and . Since facial components of the intermediate face inherit from parent faces which could be very different in appearance and age, we use the information of age and gender to improve their consistency. We use ResNet18 
to build a pre-trained age classifier and a gender classifier to constrain the generator. The losses are defined as
where is the gender classifier and is the age classifier to classify four age stages (‘infant’, ‘teen’, ‘adult’, and ‘older adult’) , i.e., ‘A’ (0-5), ‘B’ (6-15), ‘C’ (16-45), ‘D’ (
). Note that the age can be divided into more groups with only a minor change in the number of output neurons of the age classifier.and are the age and gender labels of the input face . Besides, we use the pre-trained 19-layer VGG to compute perceptual loss  to gain more facial details. The perceptual loss is defined as
where is the feature map obtained by the -th convolution layer before the -th maxpooling layer in VGG19. The total loss of the inheritance module is computed as
The conditional auto-encoder  is augmented with a discriminator to distinguish real images from fake ones in CAAE . However, the generated images are generally ambiguous. Inspired by , to better preserve the identity and improve the quality of generated faces, we enhance CAAE with a perceptual loss for modeling both age and gender in the attribute enhancement module. The loss of the discriminator is defined as
where is the discriminator to discriminate real faces from synthesized descendant faces. The reconstruction loss is defined by using the pixel-wise difference between the synthesized faces and the input faces, i.e.,
As the same as the definition of , the perceptual loss is
The total loss of the attribute enhancement module is
The full objective function of the joint learning of both modules is defined as
Datasets. CelebAHQ  is a high-resolution database from which 11,915 female subjects and 7,756 male subjects are collected for our task. SiblingDB  is a high-resolution database from which 77 female subjects and 103 male subjects are collected. Each subject has one image. We use of images in both databases for training and the left for testing. There is no overlap between the training and testing sets. TSKinFace  is a database with the annotation father-mother-child kin relationship, which only contains 1015 tri-subject groups. It is used to compare the generated faces with the ground truth children faces. Note that our method does not require the input pair of parents to be a true couple during training. So the face of any male and the face of any female can be treated as a pair to feed into our network, which enables us to construct a large set of pairs for model learning.
for age estimation. Faces in two databases are aligned according to the positions of two eye centers, and then cropped and resized into the size of. After face alignment, the positions and sizes of bounding boxes of facial components are determined. The sizes are , , , , and for left eye&brow, right eye&brow, nose, mouth, and face profile, respectively. The inputs of our network are a pair of parent faces, a control vector of inheritance and the age and gender labels of parent faces. An image pair consists of a female face and a male face, which is randomly generated in two databases. We randomly generate 76,800 female-male face pairs and control vectors in SiblingDB, and about 4M (millions) image pairs and control vectors in CeleAHQ.
Structure. The decoder and encoder of the inheritance module have residual blocks . The encoder and decoder of the attribute enhancement module have convolution layers and
fully connected layer. Each convolution layer is followed by a max-pooling layer. The details about the networks are presented in the supplementary.
Training. We jointly learn both modules of DFS-GAN. The hyperparameters are , , , , , and . We use Adam  for optimization. The learning rate is 0.0001 and the batch size is 8. The attribute enhancement module is pre-trained on UTKFace  and the training sets of SiblingDB and CelebAHQ. UTKFace  is used to compensate for the imbalanced age distribution in SiblingDB and CelebAHQ. We update attribute enhancement module once every 500 iterations of training inheritance module for joint learning.
Ablation study. We have four types of losses: adversarial loss (AD), pixel loss (PI), age and gender control loss (AG), and perceptual loss (PE). Results of using different losses are shown in Fig. 7, including AD+PI, AD+PI+AG, and AD+PI+AG+PE. AD+PI is the baseline. AG and PE are used for further enhancement. The performance gets better in terms of image quality and facial details when adding AG and PE gradually. PE contributes more than AG.
4.2 Visual results
4.2.1 Control over the inheritance of components
The control vector of inheritance is used to determine the resemblance of facial components between the descendant face and its parent faces. As shown in Figure 4, given the same parent faces, the generated descendant faces under different control vectors by our method are illustrated. Analyses are summarized as follows. Firstly, the combination of facial components is exactly according to the specified control vector. The descendant faces preserve the similarity of components to the corresponding components of their parent faces. For example, the vector ‘00110’ means that the left and right eye&brows inherit from the male, the nose and mouth inherit from the female, and the profile inherits from the male. Comparing each component of the synthesized face and parent faces, the shape and texture of eye&brow and profile retain the similarity to the father, while nose and mouth retain the similarity to the mother. Secondly, a descendant face under a control vector can be distinguished from the descendant face under another control vector according to their facial appearance. Thirdly, though texture, shape, color, and lighting of two parent faces are very different, our method could make the fusion of inherited components harmonically on descendant faces. The above analyses show that our method has accurate control over the inheritance of facial components and can generate harmonic descendant faces with retaining appearance details.
4.2.2 Control over the age and gender
We use the attribute enhancement module to control the age and gender of the descendant face. Fig. 5 presents descendant faces with the specified control vector of inheritance under four different age stages. The results show that our model captures the distinctive features of appearance under different age stages, including the shape and size of facial components, the wrinkles, the color of lips, and the tightness and glossiness of skin. For example, in the first row of the left figure, the tightness and glossiness of the synthesized face decrease as the age increases and the wrinkles become more noticeable. The eyes of one child face are larger and brighter than that of an older. Besides, the redness of lips decreases as the age increases.
For the evaluation of control over gender, Fig. (a)a presents the synthesized faces with different genders given the specified control vector. Fig. (b)b illustrates the evolution of gender from a female face to a male face. The results show that our method is able to capture the differences of facial appearance between female and male descendant faces in terms of the beard, the thickness of brow, and the texture of skin. For example, as shown in Fig. (a)a, the beard of the male descendant face becomes much more noticeable than the female as the age increases. The brows of the male face are thicker than the female. The cheek of the female face is plumper than the male face. The evolution of these details can be observed in Fig. (b)b. As the evolution processes, the features of the male on the descendant face become more noticeable such as the beard, the cheek and the thickness of brow. The above visual results demonstrate the capability of our method on the control of gender.
4.2.3 Enrich the diversity of descendant faces
As shown in Fig. 4, our method has accurate control over the inheritance of each facial component. However, descendant faces that differ in only one component look similar when other components are nearly the same. Fortunately, our model is flexible to enrich the diversity of facial appearance of descendant faces by introducing random noise in the latent feature space. During the phase of component exchange in the inheritance module, we can select one or multiple facial components and add random noises to their latent features to increase the diversity of facial appearance.
Fig. 8 shows the results of adding different noises on the individual component as well as all components. When different noises are added to an individual component, the change of its appearance is noticeable. For example, as shown in the last row of Fig. 8, when we add different noises to all components, descendant faces have a different facial appearance. The results show that we can increase the diversity of descendant faces by simply using random noise in our model during inference.
4.3 Comparison with the state-of-the-art
The comparison with  and  on the TSKinFace database  is shown in Fig. (i)i. Since  and  generate a descendant face given only one parent face, we present their results of father-son (F-S), father-daughter (F-D), mother-son (M-S), and mother-daughter (M-D). As our model generates a descendant face given a pair of parent faces, we present the results of parents-son (P-S) and parents-daughter (P-D).
The analyses are summarized as follows. Firstly, our method achieves much better image quality than competitive methods since we use the carefully designed modules while they simply exploit an auto-encoder or GAN. They encounter the issue that one input corresponds to multiple outputs during training. It messes up the network. We use the control vector of inheritance to alleviate this issue. Secondly, descendant faces generated by our method have higher similarity to the ground truth descendant faces. Our method also keeps a better resemblance between the generated face and its parent faces. Thirdly, our method has better diversity in terms of the profile and the appearance of facial components. The profiles of synthesized faces by  are almost the same given different input faces.
4.4 Quantitative evaluation
Kinship verification. To quantitatively evaluate the proposed method, we perform a cross-database kinship verification experiment. We train the kinship classifier  on FIW  and test on TSKinFace . The two databases have no overlap. We apply our method and competitive methods [14, 34] on TSKinFace  to synthesize descendant faces and then generate F-S, F-D, M-S, and M-D pairs for testing. The verification results are shown in Table 2. Our method achieves much better verification accuracy than other methods. Our result is slightly worse than using the ground truth children faces for verification. The results further demonstrate the effectiveness of the proposed method.
To verify whether synthesized faces can be distinguished from parent faces, we use off-the-shelf face recognition models to perform face verification on TSKinFace, including VGG-Face , Microsoft Face API , and Amazon ReKognition API . Verification results are shown in Table 2. The accuracies of the three models are low, which shows that most of the synthetic descendant faces can be distinguished from their parent faces.
User study evaluation. To further demonstrate the effectiveness of the proposed method, we perform three user studies to evaluate our method in terms of the resemblance of facial components, age estimation, and gender recognition. For the resemblance, the average accuracy of identifying which parent face each component of the descendant face comes from is . The accuracy of ranking ages of descendant faces at four stages is . The accuracy of gender recognition is . The user studies show that our method can capture the resemblance of a descendant face to its parent faces and capture the difference of facial appearance under different ages and genders. Detailed settings and results are presented in the supplementary material.
We propose a novel method to model two-versus-one kin relation for controllable descendant face synthesis with explicit control over the resemblance between the synthesized face and its parent faces as well as control over age and gender. Our model contains an inheritance module for controlling the resemblance and an attribute enhancement module for controlling age and gender. As the databases with father-mother-child kinship annotation are relatively small, we propose an effective strategy for model learning by using low-quality synthetic faces instead. Evaluations including visual results and quantitative evaluations demonstrate the effectiveness of our method.
-  (2007) Differential facial resemblance of young children to their parents: who do children look like more?. Evolution and Human behavior 28 (2), pp. 135–144. Cited by: §1, §1.
-  Amazon rekognition api. Note: https://azure.microsoft.com/en-au/services/cognitive-services/face/ Cited by: §4.4, Table 2.
-  (2012) Autoencoders, unsupervised learning, and deep architectures. In ICML workshop, Cited by: §3.4.
-  (2008) Face swapping: automatically replacing faces in photographs. In TOG, Cited by: §2.
-  (2018) Pairedcyclegan: asymmetric style transfer for applying and removing makeup. In CVPR, Cited by: §2.
StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2.
-  Color balance. Note: https://en.wikipedia.org/wiki/Color_balanceWikipedia.2018-10-17 Cited by: §3.1, §3.1.
-  (2006) Where are kin recognition signals in the human face?. Journal of Vision 6 (12), pp. 2–2. Cited by: §1.
-  (2009) Kin recognition signals in adult faces. Vision research 49 (1), pp. 38–43. Cited by: §1.
-  (2014) Who do i look like? determining parent-offspring resemblance via gated autoencoders. In CVPR, Cited by: §2.
-  (2013) Like father, like son: facial expression dynamics for kinship verification. In ICCV, Cited by: §2.
-  (2017) Visual transformation aided contrastive learning for video-based kinship verification. In ICCV, Cited by: §1, §2.
-  (2017) What will your future child look like? modeling and synthesis of hereditary patterns of facial dynamics. In FG, pp. 33–40. Cited by: §1, §2.
-  (2018) Modeling and synthesis of kinship patterns of facial expressions. IVC. Cited by: Figure 2, (a)a, §1, §2, §3.1, §4.3, §4.4, (c)c, (g)g, Table 2, Table 2.
-  (2013) Kinship classification by modeling facial feature heredity. In ICIP, Cited by: §2, §3.1.
-  (2018) Modeling of facial aging and kinship: a survey. IVC 80, pp. 58–79. Cited by: §1, §2, §3.1.
-  (2014) Family verification based on similarity of individual family member’s facial segments. Machine Vision and Applications. Cited by: §2.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §2.
-  (2017) Improved training of wasserstein gans. Cited by: §3.4.
-  (2014) Graph-based kinship recognition. In ICPR, Cited by: §3.1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.4, §4.1.
-  (2017) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In ICCV, Cited by: §2.
-  (2013) Children’s consideration of relevant and non-relevant facial features in kinship detection. LAnnee psychologique 113 (3), pp. 321–334. Cited by: §1.
-  (2018) Progressive growing of gans for improved quality, stability, and variation. ICLR. Cited by: §4.1.
-  (2014) One millisecond face alignment with an ensemble of regression trees. In CVPR, Cited by: §4.1.
-  (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §4.1.
Fast face-swap using convolutional neural networks. ICCV. Cited by: §2.
-  (2018) Unsupervised holistic image generation from key local patches. ECCV. Cited by: §2.
-  (2017) Kinnet: fine-to-coarse deep metric learning for kinship verification. In Proceedings of the 2017 Workshop on Recognizing Families In the Wild, pp. 13–20. Cited by: §2.
-  (2016) Makeup like a superstar: deep localized makeup transfer network. IJCAI. Cited by: §2.
-  (2017) Discriminative deep metric learning for face and kinship verification. TIP. Cited by: §1, §2.
-  (2014) Neighborhood repulsed metric learning for kinship verification. TPAMI. Cited by: §1, §2, §3.1.
-  Microsoft face api. Note: https://azure.microsoft.com/en-au/services/cognitive-services/face/ Cited by: §4.4, Table 2.
-  (2018) Kinshipgan: synthesizing of kinship faces from family photos by regularizing a deep face network. In ICIP, Cited by: Figure 2, (a)a, §1, §2, §3.1, §4.3, §4.3, §4.4, (d)d, (h)h, Table 2, Table 2.
-  (2015) Deep face recognition.. In BMVC, Cited by: §4.4, Table 2.
-  (2015) Tri-subject kinship verification: understanding the core of a family. TMM. Cited by: §1, §1, §2, §3.1, §4.1, §4.3, §4.4, §4.4.
-  (2016) Families in the wild (fiw): large-scale kinship image database and benchmarks. In ACM MM, Cited by: §1, §3.1, §4.4.
-  (2015) Dex: deep expectation of apparent age from a single image. In ICCV workshops, Cited by: §4.1.
-  (2016) Learning from simulated and unsupervised images through adversarial training. CVRR. Cited by: §3.4.
-  (2016) Kinship-guided age progression. Pattern Recognition 59, pp. 156–167. Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. CVPR. Cited by: §3.4.
-  (2014) Detecting siblings in image pairs. The Visual Computer. Cited by: §4.1.
-  (2018) Cross-generation kinship verification with sparse discriminative metric. TPAMI. Cited by: §2.
-  (2014) Leveraging appearance and geometry for kinship verification. In ICIP, Cited by: §2.
-  (2018) Face aging with identity-preserved conditional generative adversarial networks. In CVPR, Cited by: §2.
-  (2018) ELEGANT: exchanging latent encodings with gan for transferring multiple face attributes. ECCV. Cited by: §2.
-  (2014) Discriminative multimetric learning for kinship verification. TIFS. Cited by: §1, §2.
-  (2015) Kinship verification with deep convolutional neural networks. In BMVC, Cited by: §2, §4.4.
-  (2017) Age progression/regression by conditional adversarial autoencoder. In CVPR, Cited by: §2, §3.4, §4.1.