1 Introduction
The emergence of deep convolutional neural networks (CNN)
[26, 35, 38, 21, 22] greatly advances the frontier of face recognition (FR) [43, 36, 34]. Through getting experience and knowledge from training data, deep networks simulate the perception of the human brain to perform FR and boost the performance to nearly 100% on the Labeled Faces in the Wild (LFW) dataset [23]. However, more and more people find that a problematic issue, namely racial bias, has always been concealed in the previous studies due to biased benchmarks but explicitly degrades the performance in realistic FR systems [2, 10, 18, 7, 16, 32]. For example, Amazon’s Rekognition Tool incorrectly matched the photos of 28 U.S. congressmen with the faces of criminals, especially the error rate was up to 39% for nonCaucasian people; according to [18], a yearlong research investigation across 100 police departments revealed that AfricanAmerican individuals are more likely to be stopped by law enforcement; Buolamwini et al. [10] found that the accuracies of 3 commercial gender classification algorithms drop largely on darker female faces. Based on these findings, MIT Technology Reviewer [2] suggested that racial bias in databases will reflect in algorithms, hence the performances of FR systems depend on the race. However, so little testing information available makes it hard to measure the racial bias in existing FR algorithms and there has yet to be a comprehensive study that investigates how deep FR algorithms are affected by it.Model  LFW  RFW  

Caucasian  Indian  Asian  African  
Microsoft [5]  98.22  87.60  82.83  79.67  75.83 
Face++ [4]  97.03  93.90  88.55  92.47  87.50 
Baidu [3]  98.67  89.13  86.53  90.27  77.97 
Amazon [1]  98.50  90.45  87.20  84.87  86.27 
mean  98.11  90.27  86.28  86.82  81.89 
Aiming to facilitate the research towards this issue in deep FR, we construct a new Racial Faces intheWild (RFW) database containing 625K images of 25K celebrities of different races as shown in Fig. 1 and Table 3. Different from existing datasets, RFW is collected to offer several new uses:
1) Measure racial bias of FR algorithms. Four testing subsets, namely Caucasian, Asian, Indian and African, are constructed for face verification. Through setting a unified standard (e.g. similar distribution of pose, age and gender), four testing subsets of RFW ensure to exclude other factors except for race which can cause difference, and they can be used to fairly evaluate and compare the recognition ability of the algorithm on different races.
2) Reduce racial bias by transfer learning. A training set with four racesubsets is also released. The Caucasian subset consists of about 500K labeled images of 10k identities and otherrace subsets contain 50K unlabeled images, respectively. We recommend to use transfer learning (TL) to transfer recognition knowledge among different races.
Based on our RFW, the first step have been done to verify the existence of racial bias in realistic FR systems through experiments. As shown in Table 1, existing commercial recognition APIs indeed work unequally well for different races, the maximum difference in error rate between the best and worst groups is 12%. This phenomenon has always been concealed in the previous papers, since the number of nonCaucasian people for test is also quite small. To reduce the racial bias, a domain adaptation (DA) approach specifically for FR is proposed, namely a deep information maximization adaptation network (IMAN). It identifies a feature space where data in the source and the target domains are similarly distributed, it also learns the feature space discriminatively, optimizing an mutualinformation loss as an proxy to maximize the decision margin on the unlabeled target domain. Comprehensive experiments on our RFW show that the racial bias could be narroweddown by IMAN.
Our contributions can be summarized into three aspects. 1) A new challenging RFW dataset is constructed and is released ^{1}^{1}1http://www.whdeng.cn/RFW/index.html. Compared with existing datasets, RFW can be used to fairly evaluate and study racial bias in FR algorithms. To the best of our knowledge, the RFW is the first benchmark that face pairs of different races are fairly integrated into the evaluation of the FR system. 2) Based on comprehensive experiments on RFW, we first prove that deep FR algorithms are also susceptible to ”otherrace effect”. 3) An effective IMAN model is proposed, competitive results are delivered and show that racial bias can be well reduced by IMAN.
2 Related work
Database. In the last couple of years, various databases were continually developed to facilitate the FR research. In 2007, LFW dataset was introduced which marks the beginning of FR under unconstrained conditions. It contains 13,233 images of 5749 unique individuals, and provides 6000 pairs for face verification. CASIAWebface [48] provided the first widelyused public training dataset for the deep model training purpose, which consists of 0.5M images of 10K celebrities, collected from the web. Currently, there have been more databases providing public available largescale training data, especially three databases with over 1M images, namely MSCeleb1M [20], VGGface2 [12] and Megaface [24]. However, almost all databases contain significant racial bias, as shown in Table 2. Training on these databases may lead to unfair results on different races and testing on these databases will result in overlooking poor performance of nonCaucasian subjects.
Train/  Database  Racial distribution (%)  

Test  Caucasian  Asian  Indian  African  
train  CASIAWebFace [48]  84.5  2.6  1.6  11.3 
VGGFace2 [12]  74.2  6.0  4.0  15.8  
MSCeleb1M [20]  76.3  6.6  2.6  14.5  
test  LFW [23]  69.9  13.2  2.9  14.0 
IJBA [25]  66.0  9.8  7.2  17.0  
RFW  25.0  25.0  25.0  25.0 
Deep face recognition. Driven by graphics processing units (GPUs), massive annotated data and deep learning, deep FR [43]
has made significant advances in recent years. More powerful loss functions are explored to learn deep features whose intraclass differences are small and interclass differences are large. DeepFace
[39] was the first to use a ninelayer CNN with softmax to perform FR. With 3D alignment for data processing, it reaches an accuracy of 97.35% on LFW. DeepID series [36, 46, 37] combined the softmax with contrastive loss to learn a discriminative representation. FaceNet [34] used a large private dataset to train a GoogleNet. It adopted a triplet loss function and achieved 99.63% on LFW. Wen et al. [45] proposed a center loss to reduce the intraclass features variations. Sphereface [27] and Arcface [14] are proposed to make learned features potentially separable with a larger angular distance that is equivalent to geodesic distance on a hypersphere manifold.Deep unsupervised domain adaptation. Due to many factors, e.g., illumination, pose, and image quality, there is always a distribution change or domain shift between training and testing sets that can degrade the performance. Mimicking the human vision system, DA is a particular case of TL that utilizes labeled data in one or more relevant source domains to execute new tasks in a target domain [44]. Several methods have used the maximum mean discrepancy (MMD) loss for this purpose [42, 28, 29, 47, 30]. The deep domain confusion network (DDC) [42] used MMD in addition to the classification loss to reduce the distribution mismatch by one adaptation layer. Deep adaptation network (DAN) [28] matched the shift by adding multiple adaptation layers and exploring multiple kernels. Other methods have chosen an adversarial loss to minimize domain shift [17, 41, 40, 13]. The domainadversarial neural network (DANN) [17] integrated a gradient reversal layer (GRL) into the standard architecture to ensure that the feature distributions over the two domains are made similar. However, due to lack of appropriate face databases, the research of DA is limited to digital classification and object classification. Considering that there is domain gap between faces of different races, our RFW promotes the development of DA in FR.
3 Racial Faces intheWild: RFW
3.1 Creating RFW
In this section, we describe the dataset collection process, which is summarized in Fig. 2.
Databases selection. Instead of downloading images from websites and cleaning the images carefully, we collect the images of different races from existing databases. There are three principles guiding us to select candidate databases: 1) the candidates should be clean enough, 2) identities’ number of the candidates should be large enough, 3) as few candidates as possible are selected in order to avoid internoise. After comparing all public available largescale training datasets (over 1M images), MSCeleb1M [6] is found to be the best matchers.
Race detection. Face++ API [4]
is used to estimate race of each celebrities in MSCeleb1M. However, API can not correctly detect race for every image and different images of one person may be distinguished as different races. Hence, for each identity, it will be accepted only if almost images are estimated as the same race, otherwise it will be abandoned.
Data recleaning.
Because the automatical methods have already been used to clean MSCeleb1M, we only consider to reclean the testing set of RFW manually. We randomly select 3000 people from each race, and pay extreme cautious to remove outlier faces for each identity as well as outlier identities for each race manually.
Testing set construction.
We construct our testing set similar to LFW. 10 disjoint splits of image pairs are defined, and each contains 300 positive pairs and 300 negative pairs. Moreover, we also find that the random selection of face pairs in LFW may make the task easy and be away from reality. It is natural to impose certain constraints on face pairs. We apply cosine similarity measure of the wellestablished Arcface descriptor
^{2}^{2}2 https://github.com/deepinsight/insightface. For each identity , we randomly select one positive pair from 50% of pairs with smaller cosine similarity to enlarge intradifference; and we randomly select one negative pair from 1% of pairs with larger cosine similarity to decrease demographic biases. Considering that this strategy will select noisy images more easily, we carefully clean these face pairs again.Training set construction. We further collect a training set to facilitate the future research on racial bias. For source domain, about 500K images of 10k Caucasian people are randomly selected, and the labels are given as well. For target domains, we select 50K images of Indians, Asians and Africans respectively, and their labels are unavailable.
Subsets  Train  Test  

# Subjects  # Images  # Subjects  # Images  
Caucasian  10000  468139  2959  10196 
Indian    52285  2984  10308 
Asian    54188  2492  9688 
African    50588  2995  10415 
3.2 Statistics and analyses of RFW
As we know, the performance of FR is influenced by many factors, such as pose, age and gender. In order to ensure that our testing set can be used to fairly evaluate the recognition ability of the algorithm on different races, we compare these attributes, i.e. pose, age and gender, between four testing subsets to exclude other factors except for race which can cause difference. Face++ API is used to estimate pose, age and gender for each image and the distributions are given in Fig. 3. According to the figures, there are no large differences in pose, age and gender distribution between Caucasian, Indian and Asian testing sets, thus the recognition performance is only affected by different races. African set has a smaller age gap and the least balanced gender distribution which only contains 668 female images. Considering that recognition systems perform better on male with small age variation [10], the images of our African set are easier to be recognized compared with those of other sets. That is to say that recognition accuracy of African people may be even lower in reality.
Also, pose and age gap distribution of positive pairs in LFW and RFW are estimated, as illustrated in Fig. 5. Compared with LFW, the pose and age variations in our RFW are much larger. Further, the examples of difficult pairs selected by cosinedistance in RFW are presented in Fig. 4. The positive pairs in RFW contain obvious pose and age differences and the negative pairs are confusing even for human observers with careful inspection. This confirms the effectiveness of our constraints on selecting face pairs.
Then, based on our RFW, we examine four commercial recognition APIs, i.e. Face++, Baidu, Amazon, Microsoft to observe whether or not racial bias exists. Face verification accuracies are shown in Table 1 and ROC curves are presented in Fig. 9. First, we can find that FR accuracies are far from saturated when tested on our Caucasian testing subset. Microsoft achieves 98.22% on LFW and the accuracy drops about 10% on RFW. Second, FR systems indeed work unequally well for different races, the maximum difference in error rate between the best and worst groups is 12%. If we regard the mean of the four commercial recognition APIs as measurement, existing FR systems yield 90.27%, 86.28%, 86.82% and 81.86% average accuracies on Caucasian, Indian, Asian and African testing subset, respectively. Third, an interesting phenomenon is found from our experiments: APIs which are developed by East Asian companies perform better on Asian subjects than do APIs developed in the Western hemisphere.
4 Deep information maximization adaptation network
To reduce the racial bias, there is a strong incentive for transferring recognition knowledge from Caucasians (source) to other races (target) instead of training special models for each race. This is a typical unsupervised DA problem. BenDavid et al. [8] suggested that the expected DA loss for a target domain is bounded by three terms: 1) expected loss for the source domain; 2) domain divergence between source and target; and 3) the sum of the loss on the source and target domain. We call it three DA principles in our paper. Due to the absence of labeled target samples, most deep DA methods for object classification, such as MMD, only minimize the first two terms. Considering a larger number of identities of target domain in FR, only optimizing the two terms is not enough. Some methods [33, 50]
try to minimize the third term utilizing pseudo target labels generated by maximum posterior probability of source classifier. However, pseudo target labels can not be obtained directly using source classifier in FR because there are no share classes between source and target domain. Inspired by
[49, 19], we propose a deep Information maximization adaptation network (IMAN) which optimizes an mutualinformation loss as an proxy to the expected error on the target domain. Combined with MMD, our IMAN simultaneously optimizes the three terms and narrows down the racial bias between domains effectively, as shown in Fig. 6.In our case, source domain is a labeled training set with images of Caucasian subjects, namely , and target domain is an unlabeled training set with images of Asian, Indian or African subjects, namely .
4.1 Mutualinformation loss
As the most widely used classification loss function, Softmax loss aims to maximize the conditional probability of the correct class by minimizing the crossentropy for each training sample which can be presented as follows:
(1) 
where represents the ground truth labels (onehot) of th samples. means the probability assigned to the each class computed by , is the th output of the network according to th class and is the total number of classes. But when there are no labels for target data, how can we learn discriminative representations with the probability only?
Based on the desideratum that an ideal posterior probability vector
should look like , we accordingly reduce amount of confusing labels by minimizing the entropy of . It actually coincides with the idea of Sphereface [27] and Arcface [14] which encourages large decision margin between classes. However, simply minimizing this entropy will cause that more decision boundaries are removed and most samples are assigned to the same class. In order to balance class separation and class balance, we simultaneously maximize the entropy of where is an estimate of the marginal distribution of and is given by . Therefore, our mutualinformation loss can be presented by:(2) 
where denotes the entropy of a probability vector , is the parameter for the tradeoff between two entropies. means mutualinformation between the empirical distribution on the inputs and the estimated label distribution using . Therefore, our loss is equal to maximize this mutualinformation.
Note that mutualinformation loss is heavily dependent on initialization to generate a reliable probability vector at first, we initialize the target classifier by pseudo target labels. These pseudo labels are obtained from clustering algorithm in Megaface [31]
which is similar to spectral clustering. A graph is constructed where the nodes represent images and edges signify the two images have larger cosinesimilarity. Then, each connected component with at least three nodes is saved as a cluster (identity). We take these pseudo labels to initialize our network for mutualinformation loss.
4.2 Our adaptation network
In this paper, we embed the idea of mutualinformation and MMD to deep network for learning transferable features. MMD has been widely adopted as a standard distribution distance metric to measure the discrepancy, it maps the extracted deep features to a reproducing kernel Hilbert space (RKHS) endowed by multiple kernels and compares the square distance between the empirical kernel mean embeddings. According to DAN [28], an empirical estimate of MMD is given as:
(3) 
where represents the function that maps the original data to a RKHS. The kernel functions, which are associated with this mapping , is defined as the convex combination of PSD kernels , namely , where is the coefficients of uth kernel.
Our IMAN pretrained with source data is 1) finetuned on labeled source samples, 2) optimized MMD loss to minimize the domain distribution discrepancy and learn domaininvariant representations, 3) finetuned by mutualinformation loss to learn discriminative representations on the unlabeled target domain, simultaneously. The overall objective function for IMAN is given by:
(4) 
where and are the parameters for the tradeoff between three terms. is our mutualinformation loss. If only unlabeled target samples are used for training, the network may learn more unreliable representations. Then, we should use source samples for training as well to ensure the accuracy. denotes classification loss on the source data , and the source labels . is the
th layer hidden representation for the source and target examples, and
is the multikernel MMD between the source and target evaluated on the th layer representation. , and corresponds to the three DA principles, respectively.Three important points that distinguish IMAN from relevant literature [49, 19] are: 1) Deep DA. We introduce mutualinformation to deep DA. 2) Initialize method. To overcome the nonoverlapping of source and target categories in FR, we propose to initialize target classifier using pseudo labels generated by clustering algorithms. 3) Combining with MMD. We combine mutualinformation loss with MMD to further minimize the domain distribution discrepancy.
5 Experiments on RFW
In this section, we conduct an exploratory experiments, namely ”otherrace effect”, and then evaluate the proposed unsupervised DA method on our RFW dataset.
5.1 ”Otherrace effect” experiment
Psychological research indicates that humans recognize faces of their own race more accurately than faces of other races. The ”contact” hypothesis suggests that this ”otherrace effect” occurs as a result of the greater experience we have with ownrace versus otherrace faces. Considering that Caucasian subjects are overwhelmingly dominant in numbers in training databases, so do deep FR algorithms inherit this ”otherrace effect”? We conduct some experiments on four testing subsets of our RFW to go deep into this problem.
Existence of domain gap. We generate the average faces of four testing subsets and compare them by vision. As shown in Fig. 1, we can find that there indeed are certain discrepancies among different races in facial features and complexions, especially the African people.
Then, the visualization and quantitative comparisons are conducted at feature level. To extract deep features, we train a deep model based on ResNet34 by using CASIAWebFace as the training data and Softmax as the loss function. Based on this model, the deep features of 3000 images of each testing subset are extracted and are visualized respectively using tSNE embeddings [15], as shown in Fig. 7(a). The features almost completely separate according to race, but there is not a clear boundary between the features of Indians and Caucasians, which conforms our common sense that the faces of Indians are more westernized than the faces of Africans and Asians. Moreover, we use the MMD [9, 11] to compute distribution discrepancy between Caucasians and other races. The results are shown in Fig. 7(b), we make the same conclusions with the results of tSNE: 1) the distribution discrepancies between Caucasians and other races are much larger than that between Caucasians themselves which means the existence of domain gap. 2) Africans and Asians have the larger domain discrepancies with Caucasians, followed by Indians.
”Otherrace effect”.
After proving the existence of domain gap, some experiments are further conducted to investigate whether or not deep FR algorithms are susceptible to ”otherrace effect”. Based on the features extracted by our trained Softmax model, we first compare the distribution of cosinedistances of 6000 pairs, as shown in Fig.
8. Considering that overlap between the histograms of positive and negative pairs means False Negative (FN) and False Positive (FP) in face verification, the degree of overlap in Caucasian set is much smaller than that in Indian, Asian and African set, which visually proves the recognition error of nonCaucasian subjects are much higher.Then, we also examine some wellestablished methods, i.e. Centerloss [45], Sphereface [27], VGGFace2 [12] and ArcFace [14] on our RFW. We directly download the centerloss^{3}^{3}3https://github.com/walkoncross/caffefacecenterloss, Sphereface^{4}^{4}4https://github.com/wy1iu/sphereface and VGGFace2 (SeNet learned from scratch)^{5}^{5}5https://github.com/oxvgg/vgg_face2 models from website. Note that ArcFace model^{6}^{6}6https://github.com/deepinsight/insightface provided by author is trained on MSCeleb1M database which totally contains our RFW, we train an ArcFace model based on ResNet34 with CASIAWebface. The 10fold crossvalidation accuracies and ROC curves are given in Table 4 and Fig. 9. All the tested algorithms perform the best on Caucasian testing subset, followed by Indian, and the worst on Asian and African. For example, the accuracy of the ArcFace model on Caucasian testing subset reaches 92.15%, but its accuracy dramatically decreases to less than 83.98% on Asian subset. This is because that almost wellestablished models are predominantly trained on Caucasian faces. Therefore, deeply learned features tend to bias on distinguishing Caucasians rather than other races and the learned representations will discard information useful for discerning nonCaucasian faces. This phenomenon has always been concealed in the previous papers, since the number of nonCaucasian people for test is also quite small.
Model  LFW  RFW  

Caucasian  Indian  Asian  African  
Centerloss [45]  98.75  87.18  81.92  79.32  78.00 
Sphereface [27]  99.27  90.80  87.02  82.95  82.28 
Arcface^{1} [14]  99.40  92.15  88.00  83.98  84.93 
VGGface2^{2} [12]  99.30  89.90  86.13  84.93  83.38 
mean  99.18  90.01  85.77  82.80  82.15 

Different from the model provided by paper, Arcface here is a ResNet34 model trained with CASIAWebface.

VGGFace2 here is a SeNet model trained with VGGFace2 database from scratch.
5.2 Domain adaptation experiment
To validate the proposed IMAN, we conduct experiments on our RFW dataset.
Implementation detail. For preprocessing, we share the uniform alignment methods as Arcface [14]. We use five facial landmarks for similarity transformation, then crop and resize the faces to 112112. Each pixel ([0, 255]) in RGB images is normalized by subtracting 127.5 and then being divided by 128. The baseline model is ResNet34 which is trained by using Caucasian training subset of RFW as the training data and Arcface as the loss function. The learning rate is started from 0.1 and decreased twice with a factor of 10 when errors plateau.
In IMAN, we train our method based on baseline network. We first initialize the target classifier with the pseudolabeled target samples using learning rate of . Then using all labeled source and unlabeled target samples, we finetune the network by Equation (4) with learning rate of . We use Arcface as classification loss, the parameter , , are set to be 15, 5 and 0.2, respectively. For MMD, we follow the settings in DAN [28], and apply MMD to the last two fullyconnected layers. In all experiments, we set the batch size, momentum, and weight decay as 200, 0.9 and , respectively.
Experimental result. Three DA tasks are performed, namely transferring knowledge from Caucasian to Indian, Asian and African. All learning algorithms are given about 500K labeled source examples and 50K unlabeled target examples. Then, we evaluate them on separate target testing subsets. Here are more details about the procedure used for each learning algorithms leading to the empirical results of Table 5:
Methods  Caucasian  Indian  Asian  African 

Baseline  94.78  90.48  86.27  85.13 
DDC [42]    91.63  87.55  86.28 
DAN [28]    91.78  87.78  86.30 
PL5 [31]    92.00  88.33  87.67 
PL5+PL1    92.08  88.80  88.12 
PL5+MMD    92.00  88.65  87.92 
IMAN (ours)    93.55  89.87  88.88 
IMAN* (ours)    94.15  91.15  91.42 
All these algorithms are based on baseline network, and finetune the network with source samples supervised by Arcface loss. The differences are,
DDC and DAN simultaneously finetune the network by MMD. DDC applies MMD on one layer and DAN applies it on the last two layers.
PL5 initializes the target classifier and simultaneously finetunes the network by Softmax loss. The training data is pseudolabeled target samples generated by Megaface’ clustering algorithms [31]. And the learning rate is .
PL5+PL1 is based on PL5, it keeps on finetuning the network by Softmax loss with these pseudolabels, but the learning rate is .
PL5+MMD is based on PL5, it finetunes the network by MMD with learning rate of . The training data is all unlabeled target data.
IMAN is based on PL5, and adds mutualinformation loss to PL5+MMD. The learning rate is , training data is all unlabeled target data.
From Table 5, we can find that our IMAN dramatically outperforms all of the competing methods and achieves about 3% gains over baseline. Moreover, we have the following observations if we go deep into three DA principles described before. First, DDC [42] and DAN [28] only optimize the first two DA terms with help of MMD. But DAN is only superior to baseline by about 1%, DDC improves even less. This confirms our thought that only optimizing the two terms is not enough for FR. Second, PL5+PL1 is the method which finetunes the network by pseudolabeled target samples, and it outperforms DAN and DDC. It shows that optimizing the third DA term by pseudolabels is more effective than optimizing the second one. Finally, we compare the results among PL5+PL1, PL5+MMD and our IMAN. PL5+MMD is worse than PL5+PL1, but our IMAN outperforms PL5+PL1 by about 1% after adding mutualinformation loss to PL5+MMD. This phenomenon proves the effectiveness of mutualinformation loss. Why dose our IMAN using unlabeled data ourperform PL5+PL1 using pseudolabeled data? The reason is that 1) the amount of pseudolabeled data is much smaller than that of unlabeled data due to limitations of clustering algorithms, 2) we can not ensure the correctness of pseudo labels, 3) our mutualinformation loss reduces the confusion possibilities and encourages large decision margin. Furthermore, we initialize the target classifier with Arcface loss instead of Softmax loss, our IMAN (denoted as IMAN*) is further improved, and obtains the best performances with 94.15%, 91.15% and 91.42% for Indian, Asian and African set.
6 Conclusion
An ultimate face recognition algorithm should perform perfectly and fairly on different demographic group. While the problem of racial bias is yet to be comprehensively studied, we have done the first step and create a benchmark for it. Our RFW database contains, 1) four testing subsets, namely Caucasian, Asian, Indian and African, to encourage FR algorithms to be fairly evaluated and compared on different races, 2) four training subsets to enable FR algorithms to transfer recognition knowledge from Caucasians to other races. Through experiments on our RFW, we first prove that there is domain gap among different races and the deep models trained on the current benchmarks do not perform well on nonCaucasian faces. Then, a deep information maximization adaptation network (IMAN) is introduced, it makes representations of source and the target similar and also learns the feature space discriminatively using unlabeled target data. The comprehensive experiments prove the potential and effectiveness of our IMAN to reduce racial bias.
References
 [1] Amazon’s reignition tool. https://aws.amazon.com/rekognition/.
 [2] Are face recognition systems accurate? depends on your race. https://www.technologyreview.com/s/601786.
 [3] Baidu cloud vision api. http://ai.baidu.com.
 [4] Face++ research toolkit. www.faceplusplus.com.
 [5] Microsoft azure. https://www.azure.cn.
 [6] Msceleb1m challenge 3: Face feature test/trillion pairs. http://trillionpairs.deepglint.com/.
 [7] M. Alvi, A. Zisserman, and C. Nellaker. Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings. arXiv preprint arXiv:1809.02169, 2018.
 [8] S. Bendavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(12):151–175, 2010.
 [9] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
 [10] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81, pages 77–91, 2018.
 [11] R. Cafiero, A. Gabrielli, M. A. MuÑ, and oz. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
 [12] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising faces across pose and age. arXiv preprint arXiv:1710.08092, 2017.
 [13] Z. Cao, M. Long, J. Wang, and M. I. Jordan. Partial transfer learning with selective adversarial networks. arXiv preprint arXiv:1707.07901, 2017.
 [14] J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
 [15] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
 [16] N. Furl, P. J. Phillips, and A. J. O’Toole. Face recognition algorithms and the otherrace effect: computational mechanisms for a developmental contact hypothesis. Cognitive Science, 26(6):797–815, 2002.

[17]
Y. Ganin.
Unsupervised domain adaptation by backpropagation.
In International Conference on International Conference on Machine Learning, pages 1180–1189, 2015.  [18] C. Garvie. The perpetual lineup: Unregulated police face recognition in america. Georgetown Law, Center on Privacy & Technology, 2016.
 [19] R. Gomes, A. Krause, and P. Perona. Discriminative clustering by regularized information maximization. In International Conference on Neural Information Processing Systems, pages 775–783, 2010.
 [20] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Msceleb1m: A dataset and benchmark for largescale face recognition. In ECCV, pages 87–102. Springer, 2016.
 [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [22] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 2017.
 [23] G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 0749, University of Massachusetts, Amherst, 2007.
 [24] I. KemelmacherShlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In CVPR, pages 4873–4882, 2016.
 [25] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah, and A. K. Jain. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In CVPR, pages 1931–1939, 2015.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [27] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, volume 1, 2017.
 [28] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In International Conference on International Conference on Machine Learning, pages 97–105, 2015.
 [29] M. Long, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636, 2016.
 [30] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
 [31] A. Nech and I. KemelmacherShlizerman. Level playing field for million scale face recognition. In CVPR, pages 3406–3415. IEEE, 2017.
 [32] P. J. Phillips, F. Jiang, A. Narvekar, J. Ayyad, and A. J. O’Toole. An otherrace effect for face recognition algorithms. ACM Transactions on Applied Perception (TAP), 8(2):14, 2011.
 [33] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tritraining for unsupervised domain adaptation. arXiv preprint arXiv:1702.08400, 2017.
 [34] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823, 2015.
 [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [36] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identificationverification. In NIPS, pages 1988–1996, 2014.
 [37] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.
 [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
 [39] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to humanlevel performance in face verification. In CVPR, pages 1701–1708, 2014.

[40]
E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko.
Simultaneous deep transfer across domains and tasks.
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 4068–4076, 2015. 
[41]
E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell.
Adversarial discriminative domain adaptation.
In
Computer Vision and Pattern Recognition (CVPR)
, volume 1, page 4, 2017.  [42] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. Computer Science, 2014.
 [43] M. Wang and W. Deng. Deep face recognition: A survey. arXiv preprint arXiv:1804.06655, 2018.
 [44] M. Wang and W. Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135 – 153, 2018.
 [45] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515. Springer, 2016.
 [46] W.S. T. WST. Deeply learned face representations are sparse, selective, and robust. perception, 31:411–438, 2008.
 [47] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. arXiv preprint arXiv:1705.00609, 2017.
 [48] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
 [49] S. Yuan and S. Fei. Informationtheoretical learning of discriminative clusters for unsupervised domain adaptation. In International Coference on International Conference on Machine Learning, pages 1275–1282, 2012.
 [50] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentation via classbalanced selftraining. In The European Conference on Computer Vision (ECCV), September 2018.
Comments
There are no comments yet.