A desirable technology for biometric identification would be to be able to match different biometric modalities to each other. For example, recent speech2face [oh2019speech2face] study proposed a method to generate face images from speech input. This way one can compare speech signals with the face images in the gallery, performing cross-modal person identification. Although audio-visual person identification datasets are abundant, since speech and image signals have different characteristics, it is still very challenging to learn a mapping between them. On the other hand, there are various visual biometric modalities, such as fingerprint, ear, iris, hand, and face, and since they are all from the visual domain, it would be easier to learn a mapping between them. In this paper, we focus on learning a deep mapping between face and ear modalities. We opted for using face and ear modalities, since multimodal person identification datasets are rather limited and since we were able to collect a large amount of ear-face image pairs by utilizing the Multi-PIE [multi_pie] and FERET [FERET_dataset] datasets.
In human nature, genotype is one of the main factors that determines the face [peng2013detecting, claes2014modeling, richmond2018facial, srinivas2017dna2face] and other biometric parts. Several works have shown that the relationship between DNA information and face appearance can be established [peng2013detecting, claes2014modeling, srinivas2017dna2face, richmond2018facial, crouch2018genetics, sero2019facial]. Since both ear and face are biometric traits of an individual and their phenotypes are generated based on genotypes, we expect an implicit relationship between ear and face through genetic knowledge [srinivas2017dna2face, crouch2018genetics, richmond2018facial, sero2019facial]. These previous works have motivated us to investigate the correlation between different visual biometric modalities, namely, ear and face, and learn a mapping between them in order to have a cross-modal biometric identification system.
In this work, we focus on learning this relationship between different modalities, ear and face, using generative adversarial networks (GAN). After generative adversarial network was proposed, it has demonstrated superior performance on many different tasks, such as image to image translation, e.g., sketch2image [hu2017now, sangkloy2017scribbler, chen2018sketchygan], style transfer [huang2017arbitrary, chen2018gated], and cross-modal learning [kim2017learning], e.g., speech2face [oh2019speech2face], DNA2Face [srinivas2017dna2face]. The most relevant works to our study are [oh2019speech2face, wav2pix2019icassp], in which authors aim at learning the mapping between voice and face modalities.
The proposed GAN model, named as Ear2Face, takes an ear image as input and generates a frontal face image as shown in Figure 1. We formulated the problem as an image-to-image translation task and developed our model based on the model proposed in [isola2017image]
. Besides, we benefited from feature and style reconstruction losses to enhance the learning capacity of the network. We summed up adversarial loss, pixel loss, feature reconstruction loss, and style reconstruction loss in the objective function and we presentedand
coefficients to restrict the effect of feature and style reconstruction losses on overall loss function. We also employedcoefficient for pixel loss as in [isola2017image]. For the experiments, we collected ear and frontal face image pairs from the Multi-PIE [multi_pie] and the FERET [FERET_dataset] datasets, which are popular face datasets that also contain profile views of the subjects. Afterwards, since we need to preserve identity information during reconstruction, we matched ear and face images of each subject to perform paired training. Later, we fed the network with these image pairs. In the training phase, we utilized ResNet-50 [he2016deep] deep CNN model with weights of VGGFace2 [cao2018vggface2]
, which is a robust face recognition model, to extract features from reconstructed and ground truth images to measure feature reconstruction and style reconstruction losses. In the end, we evaluated our model capability with five different metrics in terms of quality of generated face images. We also conducted face recognition experiments using the reconstructed face images.
Our contributions can be summarized as follows:
We presented a novel study on biometric modality mapping. We formulated the problem as an image-to-image translation between images of different biometric modalities.
We showed that the genotype-based implicit relationship between ear and face can be learnt via GANs.
We created a GAN model based on [isola2017image] and added face reconstruction and style reconstruction losses, in addition to adversarial loss and pixel loss to improve the quality of generated images.
We used five different metrics to evaluate our method’s reconstruction performance. Besides, we conducted face recognition experiments using the reconstructed face images. Moreover, we performed cross-dataset experiments to assess the generalization capacity of our deep mapping model.
We showed that our model generates not only perceptually appealing face images but also preserves identity information.
The rest of the paper is organized as follows: In Section 2, we give brief information about previous work on ear biometrics, image-to-image translation, and cross-modal learning. In Section 3, we explain the proposed network, used loss functions, and the training procedure. In Section 4, we first present the datasets and evaluation metrics, then, we discuss the results of Ear2Face model on both datasets. We also present and discuss the obtained face recognition results using the reconstructed face images. Finally, we conclude our work in Section 5.
2 Related Work
Ear biometrics. Ear images have been utilized in many different works for the purpose of person identification [more_than_ziga, emervsivc2017training, our_ear_journal]
, age estimation[yaman2018age, yaman2019multimodal], and gender classification [pflug2012ear, abaza2013survey, gnanasivam2013gender, khorsandi2013gender, lei2013gender, yaman2018age, yaman2019multimodal]. All these works show the usefulness and effectiveness of ear as a biometric trait. Besides, there exists some works [zhang2011hierarchical, yaman2019multimodal] that utilize both profile face and ear images together to improve the biometric system’s performance. The results indicate that profile face and ear have complementary features and using them together leads to a performance improvement, especially for the age and gender predictions.
Image2image translation. After generative adversarial network [goodfellow2014generative] is proposed, it has been adapted for different tasks beyond artificial image generation from noise. Particularly, image to image translation is one of the important and popular fields in generative works. GANs are used in many different image to image translation works, such as domain transfer [isola2017image, zhu2017unpaired, choi2018stargan, zhang2018self, tang2019multi]ledig2017photo, lim2017enhanced, wang2018high], style transfer [huang2017arbitrary, chen2018gated], and etc. All these studies show the effectiveness of the GAN models in terms of learning a mapping between different domains for different purposes.
Cross-modal learning. The high performance of generative models on image generation and image to image translation tasks led its use in other areas, such as cross-modal learning. Image generation from text [reed2016generative, dash2017tac, zhang2017stackgan, xu2018attngan, zhu2019dm] and audio [chen2017deep, qiu2018image, hao2018cmcgan, wav2pix2019icassp, duan2019cascade, oh2019speech2face, wan2019towards, deng2020unsupervised] are the most common examples of cross-modal learning. The idea is to explore the relationship in the feature space between different modalities with generative models to translate these modalities to each other. In [wav2pix2019icassp], face images are reconstructed from speech data using conditional GANs in an end-to-end manner. In [oh2019speech2face], authors aim at learning mapping between face features and speech features. They converted speech data to the spectrogram format to feed voice encoder in order to embed it into feature space. In the system, face and voice data are used as paired and features are extracted from both spectrogram and face image. While encoder network that is employed for embedding voice data is a trainable part of the system, the other parts, feature extractor and decoder for the face reconstruction models are well-known pretrained models. Moreover, features of voice and face data are utilized to calculate loss during training. In [qiu2018image] music data is employed to generate scene images which represent what related audio makes feel. In [wan2019towards], GAN-based model is employed to create more qualified audio-image pairs. Other studies [chen2017deep, hao2018cmcgan, duan2019cascade] focus on audio data, which belongs to musical instruments, to reconstruct a scene with related instrument using GANs. Two different GAN models are developed to generate images from audio and audio from image in [chen2017deep]. Unlike the previous study, in [hao2018cmcgan], a combined cyclic generative adversarial network, which is named as CMCGAN, is proposed. Lastly, to enhance the quality of the coarse outputs and obtain fine-grained results, authors provided two-stage GAN network in [duan2019cascade]. For a detailed review of audio-image translation tasks, please refer to a recent survey [zhu2020deep].
3 Face Reconstruction
In this section, we explain the proposed GAN model and employed loss functions in the training. The proposed Ear2Face network is shown in Figure 2. While generator part of this model takes an ear image as an input and tries to reconstruct face data, discriminator is trained with real images and is responsible to discriminate between real and fake data. Pretrained VGGFace2 [cao2018vggface2]
is employed for feature extraction, and pixel loss, feature reconstruction loss[johnson2016perceptual], and style reconstruction loss [gatys2015texture, gatys2015neural] are measured between generated face image and real face image in both pixel space and feature space.
For our deep biometric modality mapping system, we employed a GAN architecture based on pix2pix model [isola2017image], and named our model as Ear2Face. In [isola2017image], generator network is adjusted based on U-Net architecture [ronneberger2015u] and skip connections are included. For the discriminator, similar architecture with [li2016precomputed] is employed.
While conditional generator network struggles to generate artificial data that can deceive the discriminator, discriminator network tries to learn training data distribution and it is responsible to discriminate between real and fake data. In this work, the generator fetches random noise and source image as an input and then it learns the relationship between target image and input data, .
3.2 Loss functions
Adversarial loss. The objective function of the conditional GAN is
where G is a generator and D is a discriminator network. While unconditional GAN tries to generate artificial data from random noise, z, conditional GAN gets additional input which is a source image in this work and represented as x in Equation 1.
Pixel loss. In addition to previous cost function, we also included an additional function, which is L1 distance as in [isola2017image] and expressed in Equation 2. This function is responsible to compare generated data and real data in the pixel space and thus, it forces the network to generate analogous samples with target data. Since several papers revealed that using L2 distance caused blurry images, L1 distance is employed.
In this equation, while y represents target (real) image, G(x,z) is the reconstructed face image.
Feature reconstruction loss. In order to compare generated and real images in a feature space, feature reconstruction loss is added to the objective function. The intention of using feature reconstruction loss is to stimulate network to learn similar feature representation with target data in order to retain structure and content [johnson2016perceptual]. The feature reconstruction loss is defined as
where and are number of channels, width, and height of the image, respectively. While represents the model that is employed for feature extraction, is the layer of that features are obtained from. The normalized root of the Euclidean distance between features of generated image and real image are calculated as a feature reconstruction loss. In our system, pretrained VGGFace2 model [cao2018vggface2] is employed, which is based on ResNet-50 architecture [he2016deep]
. Features are extracted from global average pooling layer of this model. The dimension of feature vector is.
Style reconstruction loss. In addition to the feature reconstruction loss, style reconstruction loss [gatys2015texture, gatys2015neural] is utilized as well. The main motivation behind employing this loss function is to charge distinctness between fake and real images in terms of their style, such as textures and colors. In order to compute style reconstruction loss, Gram matrix is needed to be calculated beforehand. The Gram matrix formulation is
where is a feature map that is acquired from global average pooling layer of the ResNet-50 model [he2016deep] as in feature reconstruction loss. The feature map and its transpose are multiplied and then normalized to obtain Gram matrix. This is repeated with generated image feature map and target image feature map. Afterwards, style reconstruction loss is calculated using these Gram matrices as follows:
In this equation, is the layer of ResNet-50 model () for feature extraction as in feature reconstruction loss. Afterward, the extracted feature maps are forwarded to perform Gram matrices calculation using Equation 4. The Gram matrix is calculated both for fake and real images, thereafter the style reconstruction loss is handled via computing squared Frobenius norm (F-norm) of the outcome of subtraction of Gram matrices.
3.3 Training procedure
The final objective function of the proposed method is presented in Equation 6.
This formula is the combination of adversarial loss, L1 loss, feature reconstruction loss, and style reconstruction loss. All these loss terms, except adversarial loss, are multiplied with the corresponding coefficients, . According to empirical evaluation, we set the and parameters to 10, 0.25, 0.1, respectively. Finally, the optimization objective of this function is to minimize generator, G, and maximize discriminator, D.
3.4 Face recognition
In order to assess whether the reconstructed face images are able to preserve identity information, hence useful for person identification, we have conducted face recognition experiments. The proposed face recognition scheme is shown in Figure 3. In the experiments, we employed pretrained ResNet-50 CNN model that was trained on VGGFace2 dataset [cao2018vggface2] for feature extraction. We extracted features both from reconstructed face images, which correspond to the probe set, and original face images, which correspond to the gallery set. Then, the cosine similarity between the feature vectors of each image from gallery and probe sets are calculated to generate similarity matrix. Afterwards, the face recognition accuracy is calculated using this similarity matrix.
4 Experimental Results
In this section, we presented the datasets and evaluation metrics that we used in our experiments. We, then, provided and discussed face reconstruction and face recognition performance. Finally, we shared cross-dataset experiment results to quantify proposed model’s generalization capability.
In this work, we generated paired images both from the Multi-PIE [multi_pie] and the FERET [FERET_dataset] datasets. We followed a similar approach to compose ear-face image pairs from these datasets. We first executed OpenCV [opencv] ear detection algorithm and dlib face detector [dlib] to capture ear and frontal face images. Afterwards, we resized the detected ear and face images to the same size, which is .
In Multi-PIE dataset [multi_pie], we detected and created 8533 ear-frontal face image pairs belonging to 250 subjects. We separated Multi-PIE dataset into training and three different test sets. For the training set, we selected 240 out of 250 subjects and obtained 6544 ear-face image pairs. Remaining 10 subjects are utilized for subject independent test set. Overall, we have two subject dependent and one subject independent test sets. The details about them are explained below.
Subject independent (S.ID.) test set. We used remaining 10 subjects, who are not in the training set. In the training, the proposed model did not see and learn these 10 subjects and this way, we investigated the subject independent performance of the model.
Subject dependent (S.D.) test set 1. In this set, there are 1677 images of 240 subjects. These 240 subjects are the same with the subjects in the training set, however, the images are different. That is, there are no common images in the training and test sets.
Subject dependent (S.D.) test set 2. 10 subjects are randomly selected from the subject dependent test set 1. The purpose of this set is to create a subject dependent test set that has the same number of subjects with the subject independent test set, in order to compare subject dependent and subject independent results fairly.
|Dataset||Training set||Test set||Test set name|
|# of sub.||# of img.||# of sub.||# of img.|
|Multi-PIE||240||6544||240||1677||Subject dependent test set 1|
|Multi-PIE||240||6544||10||95||Subject dependent test set 2|
|FERET||504||623||504||504||Subject dependent test set 1|
|FERET||504||623||55||55||Subject dependent test set 2|
For the FERET dataset, we obtained 1182 ear-face image pairs from 559 different subjects. In this dataset, while 504 subjects have more than one ear-face image pair, the remaining 55 subjects have only one ear-face image pair. One image from 504 subjects are selected for the subject dependent test set and remaining images are used in the training set.
Subject independent (S.ID.) test set. As mentioned above, 55 subjects have only one ear-face pair. Because of that, we selected these 55 subjects for the subject independent test.
Subject dependent (S.D.) test set 1. 504 images belonging to 504 subjects are chosen for this set. Each subject has one ear-face image pair, which is not included in the training set.
Subject dependent (S.D.) test set 2. In order to have the same number of subjects with the subject independent test set, we created this subset by randomly selecting 55 subjects from the subject dependent test set 1. This way, we obtain 55 ear-face image pairs belonging to 55 subjects that are not in the training set.
We summarize the related information about the training and test sets of both datasets in Table 1. In addition to these experiments, we also performed cross-dataset experiments to explore the generalization capacity of the learned models. That is, we tested a model, that was trained on the training set of Multi-PIE dataset, on the test sets of the FERET dataset, and vice versa. Moreover, using the same setups in Table 1, we applied face recognition experiments using generated face images and original face images to investigate whether the identity information is preserved during reconstruction.
4.2 Evaluation metrics
|Dataset||Test set||# of subject||Pixel difference||Feature difference||Style difference||PSNR||SSIM|
In order to quantitatively assess the reconstruction performance, we employed five evaluation metrics, which are pixel similarity, feature similarity, style similarity, Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) between the reconstructed and ground truth image.
Pixel difference. We calculated L1 distance between ground truth and reconstructed face image to measure the similarity between them in the pixel space.
Feature difference. Besides comparing ground truth and reconstructed face images in the pixel space, we also benefited from feature similarity to compare them in the feature space. We employed Equation 3 to measure squared Euclidean distance between features of reconstructed and real face images.
Style difference. We also used features of reconstructed and real face images in order to calculate style differences between them. We used Equation 5 to calculate the style difference.
PSNR and SSIM.
While PSNR measures the numerical similarity between images via calculating the ratio between a range of the pixel value and Euclidean distance between pixels of the generated and real images, SSIM checks structural similarity between them.
4.3 Reconstruction Results
Results of the subject dependent and subject independent experiments are presented in Table 2. While the first column contains the used dataset, second column indicates whether the test is subject dependent (S.D.) or subject independent (S.ID.). Following column shows the number of subjects in the test set and other columns contain evaluation scores. Subject dependent test set contains the same subjects with the training set but different images of them. On the other hand, subject independent test set includes different subjects than the ones in the training set.
Quantitative evaluation. According to the experimental results in Table 2, results on the Multi-PIE subject dependent and subject independent experiments are similar in all five evaluation metrics. This outcome indicates that our model performs similarly on the subjects, who do not exist in the training set. Especially, relatively high PSNR and SSIM indicates the high face reconstruction capability of the network from input ear images.
We performed the same experiments on the FERET dataset. As in the Multi-PIE experiments, the subject dependent results are slightly better than subject independent ones in terms of considered evaluation metrics. Besides, both subject dependent and subject independent results with all metrics, except PSNR, are better than the Multi-PIE results. Especially, feature and style differences are significantly lower than the ones obtained on the Multi-PIE dataset. One reason for this could be the higher number of subjects available in the FERET dataset, which might have led to a better modelling of the appearance variations.
In Table 2, the subject dependent results represent both the subject dependent test set 1 and the subject dependent test set 2. Since the results are almost the same for this experiment, we did not present the subject dependent test set 2 in a different row.
Qualitative evaluation. Example images from the Multi-PIE and FERET datasets are presented in Figure 4
. In these figures, input represents used ear image for the face reconstruction, output is the reconstructed face image from input ear image, and target contains real face image of the corresponding subject. These example images are from the test set. While first 6 columns are from subject dependent test set 1, other 3 columns are selected from subject independent test set. When reconstructed face images are examined, it seems that they have some partial deterioration, such as asymmetric face, eyes, partial skewness in the eye, nose, or mouth. Despite such slight skewness or asymmetric face, reconstructed face images resemble the original face images and there are very few artifacts on the image. Besides, when it is considered that all of these face images are reconstructed from only ears using relatively a small dataset, they are very promising. Moreover, texture quality of the generated face images, especially for the FERET dataset, are satisfactory in terms of similarity with the ground truth images and real human skin.
|Dataset||Images||Test set||# Subj.||Rank-1||Rank-2||Rank-5||Rank-10||Rank-20|
4.4 Face Recognition Results
We present face recognition results on both datasets in Table 3. While images column indicates the employed images for the recognition task, the next column shows test set. Again, S.D. and S.ID. abbreviations represent subject dependent and subject independent sets, respectively. We also performed subject dependent experiment on a small subset of subject dependent test set 1 to have the same number of subjects with the S.ID. set in order to have a fair comparison with subject independent test results.
According to the experimental results, face recognition accuracies on the real data —frontal face images from the datasets— are extremely high on both datasets, especially on the Multi-PIE dataset. On the other hand, face recognition performance using reconstructed face images is also very promising. Since, we obtained better reconstruction performance on the FERET dataset, this also led to better face recognition accuracies on the FERET dataset. For example, we reached 88.7% and 90.9% Rank-10 identification accuracies on the FERET dataset subject dependent and independent setups, respectively. Since the FERET dataset contains more subjects, it can cover more identity variations. On account of this, both in face reconstruction and face recognition experiments the subject dependent and independent setup results have been very close to each other. On the other hand, we attained 60.6% and 43.1% Rank-10 identification accuracies on the Multi-PIE subject dependent and independent setups, respectively. The larger performance gap between subject dependent and independent experiment results could be due to the fact that Multi-PIE dataset contains less number of subjects in the training, therefore, having a limited capacity to model identity variations.
4.5 Cross-dataset Results
In order to measure the generalization performance of the proposed network, we conducted cross-dataset experiments and presented their results in this section. In Table 4, the first column, dataset, indicates the test dataset while model column states the dataset that was used in the model training phase. Subject dependent and subject independent represent used test sets of the corresponding dataset. As in Table 2, we did not mention test results on the subject dependent test set 2 in Table 4, since the results are almost the same with the ones on the subject dependent test set 1.
|Dataset||Model||Test set||Pixel difference||Feature difference||Style difference||PSNR||SSIM|
According to the cross-dataset results on the Multi-PIE and FERET datasets in Table 4, it is clearly shown that the proposed model performs also well on the cross-dataset setup. Except style difference metric, all other metrics showed a similar performance with the corresponding dataset experiments. On the other hand, style difference metric gave poor results, 6.21 and 6.92 for the FERET, 3.64 and 4.67 for the Multi-PIE, compared to the obtained results using the same dataset for training and test, which were 1.81 and 1.82 for the FERET, 2.64 and 2.99 for the Multi-PIE. Indeed, the reason of this outcome is related to the difference between the data distribution of the Multi-PIE and FERET datasets. Ear images from the Multi-PIE and FERET datasets have the same context information, since surrounding area of the ear, e.g., hair, head, have similar structure in almost all datasets. However, the ground truth images that are employed for the comparison with reconstructed images have characteristic features inherent in the corresponding dataset. Because of that, during face reconstruction in the cross-dataset experiments, generative model tends to reconstruct face images with similar background and colors as the ones in the dataset used for training. This inference causes relatively high style difference between reconstructed and ground truth images. In addition to this, for test on the Multi-PIE dataset using the FERET model, the obtained performance is better than vice versa experiment and the evaluation metrics are almost the same with the original results.
In this work, we presented a novel study on biometric modality mapping, i.e. explored a mapping to reconstruct a frontal face image from an ear image input. We formulated the problem as a paired image-to-image translation task and investigated the learning capability of the GAN for deep biometric modality mapping. We employed style reconstruction and face reconstruction losses, in addition to adversarial and pixel losses. We tested our model on the Multi-PIE and FERET datasets using both subject dependent and subject independent experimental setups. We also performed cross-dataset experiments to analyze the generalization capability of the proposed method. Moreover, we conducted face recognition experiments using reconstructed face images and original face images to assess the usability of the generated faces for person identification purposes. According to the experimental results, although there are some artifacts in some local parts of the generated faces, they are still very similar to the original face images. This outcome is also quantitatively measured using five different metrics. These results indicate that the GAN model can learn indirect relationship between ear and face modalities. Using a large-scale dataset that contains high appearance variations would increase the quality of the reconstructed face images and generalization of the model as well. Besides, face recognition performance is found to be very promising, especially on the FERET dataset, on which we have a better reconstruction performance. This outcome also validates the obtained face reconstruction quality. In our future study, to improve the reconstruction quality, remove the local artifacts, and enhance the generalization capacity of the model, we will work further on the proposed network. We also plan to collect a large-scale ear-face image pairs dataset in the wild to capture more appearance variations.