To train a robust deep model, abundant training data  and well-designed training strategies are indispensable. It is also worth to point out that, most of the existing training data sets like LSVRC’s object detection task , which contains 200 basic-level categories, were carefully filtered so that the number of each object instance is kept similar to avoid the long tailed distribution.
More specifically, long tail property refers to the condition where only limited number of object classes appear frequently, while most of the others remain relatively rarely. If a model was trained under such an extremely imbalanced distributed dataset (in which only limited and deficient training samples are available for most of the classes), it would be very difficult to obtain good performance. In other words, insufficient samples in poor classes/identities will result in the intra-class dispension in a relatively large and loose area, and in the same time compact the inter-classes dispension.
In , Bengio gave the terminology called “representation sharing”: human possess the ability to recognize objects we have seen only once or even never as representation sharing. Poor classes can be beneficial for knowledge learned from semantically similar but richer classes. While in practice, other than learning the transfer feature from richer classes, previous work mainly cut or simply replicate some of the data to avoid the potential risk long tailed distribution may cause. According to ’s verification, even only 40% of positive samples are left out for feature learning, detection performance will be improved a bit if the samples are more uniform. Such disposal method’s flaw is obvious: To simply abandon the data partially, information contained in these identities may also be omitted.
In this paper, we propose a new loss function, namely range loss to effectively enhance the model’s learning ability towards tailed data/classes/identities. Specifically, this loss identifies the maximum Euclidean distance between all sample pairs as the range of this class. During the iteration of training process, we aim to minimize the range of each class within one batch and recompute the new range of this subspace simultaneously.
The main contributions of this paper can be summarized as follows:
1. We extensively investigate the long tail effect in deep face recognition, and propose a new loss function called range loss to overcome this problem in deep face recognition. To the best of our knowledge, this is the first work in the literature to discuss and address this important problem.
2. Extensive experiments have demonstrated the effectiveness of our new loss function in overcoming the long tail effect. We further demonstrate the excellent generalizability of our new method on two famous face recognition benchmarks (LFW and YTF).
2 Related Work
Deep learning is proved to own a great ability of feature learning and achieve great performances in a series of vision tasks like object detection [7, 24, 16, 8, 27], face recognition [20, 23, 26, 2, 32, 18, 29], and so forth. By increasing the depth of the deep model to 16-19 layers, VGG  achieved a significant improvement on the VOC 2012  and Caltech 256 . Based on the previous work, Residual Network, proposed by Kaiming He et al, present a residual learning framework to ease the training of substantially deeper networks . In 
, the authors propose a new supervision signal, called center loss, for face recognition task. Similar to our range loss’s main practice, center loss minimizes the distances between the deep features and their corresponding class centers ( Defined as arithmetic mean values).
. In a workshop talk 2015, Bengio described the long tail distribution as the enemy of machine learning. In , a much better super-pixel classification results are achieved by the expanding the poor classes’ samples. In , this paper investigates many factors that influence the performance in fine-tune for object detection with long tailed distribution of samples. Their analysis and empirical results indicate that classes with more samples will pose greater impact on the feature learning. And it is better to make the sample number more uniform across classes.
3 The Proposed Approach
In this section, we firstly elaborate our exploratory experiments implemented with VGG on LFW’s face verification task, which give us an intuitive understanding of the potential effects by long tailed data. Based on the conclusion drew from these two experiments, we propose a new loss function namely, range loss to improve model’s endurance and utilization rate toward highly imbalanced data follow by some discussions.
3.1 Problem formulation
In statistics, a long tail of certain distributions is the portion of the distribution having a large number of occurrences far from the ”head” or central part of the distribution . To investigate the long-tail property deeply and thoroughly in the context of deep learning face recognition, we first trained several VGG-16 models  with softmax loss function on data sets with extremely imbalanced distribution ( the distribution of our training data is illustrated in 2. ) We constructed our long tail distributed training set from MS-Celeb-1M  and CASIA- WebFace data set, which consists of 1.7 million face images with almost 100k identities included in the training data set. Among this set, there are 700k images for roughly 10k of the identities, and 1 million images for the remaining 90k identities. To better understand the potential effect of long tailed data on the extracted identical representation features, we slice the raw data into several groups according to different proportions in Table 1. As we can see in Fig 2, classes that contain less than 20 images are defined as poor classes (tailed data). As is shown in Table1, group A-0 is the raw training set. 20%, 50%, 70%, 100% of the poor classes in A-0 is cut to construct group A-1, A-2, A-3 and A-4 respectively.
|Groups||Num of Identities||Images||Division Ratio|
We conduct our experiments on LFW’s face verification task and the accuracy are compared in Table 2. As is shown in Table 2, group A-2 achieves the highest accuracy rate in series A. With the growth of the tail, group A-1 and A-0 get lower performances though they contain more identities and images.
These results indicate that, tailed data stand a great chance to pose a negative effect on the trained model’s ability. Based on the above findings, we come to analyze the distinct characteristics of Long-tail effect that, conventional visual deep models do not always benefit as much from larger data set with long-tailed property as it does for a uniform distributed larger data set. Moreover, long tailed data set, if cut and remained in a specific proportion (50% in here), will contribute to deep models’ training.
In fact, there are some different features in face recognition task: the intra-class variation is large because the face image can be easily influenced by the facing directions, lighting conditions and original resolutions. On the other hand, compared with other recognition tasks, the inter class variation in face recognition is much smaller. As the growth of the number of identities, it is possible to include two identities with similar face. Worse still, their face images are so few that can not give a good description to their own identities.
|Groups||Acc. on LFW|
|A-0 (with long-tail)||97.87%|
|A-1 (cut 20% tail)||98.03%|
|A-2 (cut 50% tail)||98.25%|
|A-3 (cut 70% tail)||97.18%|
|A-4 (cut 100% tail)||95.97%|
3.2 Study of VGG Net with Contrastive and Triplet Loss on Subsets of Object Classes
Considering the characteristics of long tailed distributions: a small number of generic objects/entities appear very often while most others exist much more rarely. People will naturally think the possibility to utilize the contrastive loss or the triplet loss to solve the long tail effect because of its pair training strategy.
The contrastive loss function consists of two types of samples: positive samples of similar pairs and negative samples of dissimilar pairs. The gradients of the loss function act like a force that pulls together positive pairs and pushes apart in negative pairs. Triplet loss minimizes the distance between an anchor and a positive sample, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.
In this section, we apply the contrastive loss and triplet loss on VGG-16 with the same constructed long tailed distributed data. The goal of this experiment, on some level, is to gain insights on the contrastive loss and triplet loss’s processing capacity of long tailed data. We conduct the LFW’s face verification experiment on the most representative groups A-0 and group A-2 with full and half of the long tailed data. As for the training pairs, we depart all identities into two parts with same number of identities firstly. The former part contains only richer classes and the later poor classes. Positive pairs (images of the same person) are randomly selected from the former part and negative pairs are generated in the latter part data of different identities. After training, we got the contrastive and triplet’s results shown in Table 3 and Table 4 respectively. From these tables, we can clearly see that long tail effect still exist on models trained with contrastive loss and triplet loss: with 291,277 more tailed images in group A-0’s training set, contrary to promoting the verification performances, accuracy is reduced by 0.15%. Moreover, contrastive loss improves the accuracy by 0.46% and 0.21% comparing to VGG-16 with softmax loss.
Probable causes of long tail effect’s existence in contrastive loss may lie that: though pair training and triplet training strategy can avoid the direct negative effect long tail distribution may brought, classes in the tail are more like to be selected in the training pairs’ construction (poor classes are accounted for 90% of the classes). Because the massive classes with rare samples piled up in the tail, pairs contain the pictures of one person are extremely limited in a small amount, thus resulting in the lack of enough descriptions toward intra-class’s invariation. Inspired by contrastive and triplet loss’s defect and deficiency, we find the necessity to propose our loss function specially-costumed to be integrated into training data with long tail distribution. Such loss function is designed primarily for better utilizing the tailed data, which we believe has been submerged by the richer classes’ information and poses not only almost zero impact to the model, but a negative resistance to model’s effectiveness in learning discriminative features.
|Training Groups||Acc. on LFW|
|A-0 (with long-tail)||98.35%|
|A-2 (cut 50% of tail)||98.47%|
|Training Groups||Acc. on LFW|
|A-0 (with long-tail)||98.10%|
|A-2 (cut 50% of tail)||98.40%|
3.3 The Range Loss
Intrigued by the experiment results above that long tail effect does exist in models trained with contrastive loss and triplet loss, we delve deeper into this phenomenon, give a qualitative explanation of the necessity to propose our new loss toward this problem and further discuss the merits and disadvantages of the existing methods.
In long tail distributed data, samples of the tailed data are usually extremely rare, there are only very limited images for each person in our dataset. Contrastive loss optimizes the model in such a way that neighbors are pulled together and non-neighbors are pushed apart. To construct such a training set consists of similar pairs and negative examples of dissimilar pairs, sufficient pairs of the same person is indispensable but out of the question to be achieved on long tailed data.
Moreover, as we discussed in the previous section, richer classes will pose greater impact on the model’s training. Ways to leverage the imbalanced data should be considered.
The the objective of designing range loss is summarized as:
Range loss should be able to strengthen the tailed data’s impact in the training process to prevent poor classes from being submerged by the rich classes.
Range loss should penalize those sparse samples’ dispension brought by poor classes.
Enlarge the inter-class distance at the same time.
Inspired by the contrastive loss, we design the Range Loss in a form that reduces intra-personal variations while enlarge the inter-personal differences simultaneously. But contrary to contrastive loss function’s optimizing on positive and negative pairs, the range loss function will calculate gradients and do back propagation based on the overall distance of classes within one mini－batch. In other words, statistical value over the whole class substituted the single sample’s value on pairs. As to the second goal, the author in  use the hard negative mining idea to deal with these samples. For those sparse training samples in poor classes, features located in the feature space’s spatial edge(edge feature) can be viewed as the points that enlarge the intra-class’s invariation most. These samples, to a certain degree, can also be viewed as the hard negative samples. Inspired by this idea, range loss should be designed to minimize those hard negative samples’ distance thus lessen the exaggerated intra-class invariation by tailed data. Based on this, we calculate greatest range’s harmonic mean value over the feature set extracted in the last FC layer as the inter-class loss in our function. The range value can be viewed as the intra-class’s two most hard negative samples. For the inter-class loss, the shortest distance of class feature centers will be the supervision.
To be more specifically, range loss can be formulated as:
Where and are two weight of range loss and in which denotes the intra-class loss that penalizes the maximum harmonic range of each class:
Where denotes the complete set of classes/identities in this mini-batch. is the -th largest distance. For example, we define and . and are the largest and second largest Euclidean range for a specific identity respectively. Input and denoted two face samples with the longest distance, and similarly, input and are samples with of the second longest distance. Equivalently, the overall cost is the harmonic mean of the first -largest range within each class. Experience shows that bring a good performance.
represents the inter-class loss that
where, is the shortest distance between class centers, that are defined as the arithmetic mean of all output features in this class. In a mini-batch, the distance between the center of class and class is the shortest distance for all class centers. denotes a super parameter as the max optimization margin that will exclude greater than this margin from the computation of the loss.
In order to prevent the loss being degraded to zeros  during the training, we use our loss joint with the softmax loss as the supervisory signals. The final loss function can be formulated as:
In the above expression, refers to the mini-batch size and is the number of identities within the training set. denotes the features of identity extracted from our deep model’s last fully connected layers. and are the parameters of the last FC layer.
is inserted as a scaler to balance the two supervisions. If set to 0, the overall loss function can be seen as the conventional softmax loss. According to the chain rule, gradients of the range loss with respect tocan be computed as:
For a specific identity, let , is a distance of and , two features in the identity.
Where denotes the total number of samples in class . And we summarize the loss value and gradient value’s computation process in Algorithm 1. 3).
3.4 Discussions on Range Loss’s Effectiveness
Generally speaking, range loss adopts two stronger identifiability statistical parameters than contrastive loss and others: distance of the peripheral points in the intra-class subspace, and the center distance of the classes. Both the range value and the center value is calculated based on groups of samples. Statistically speaking, range loss utilizes those training samples of one mini-batch in a joint way instead of individually or pairly, thus ensure the model’s optimization direction comparatively balanced. To give an intuitive explanations of the range loss, we have simulated a 2-D feature distribution graph in one mini-batch with 4 classes (see Fig. 3)
In this section, we evaluate our range loss based models on two well known face recognition benchmarks, LFW and YTF data sets. We firstly implemented our range loss with VGG’s  architecture and train on 50% and 100% long tailed data to measure its performances on face verification task. More than that, based on ’s recent proposed center loss which achieves the state-of-art performances on LFW and YTF, we implement our range loss with the same network’s structure to see whether the range loss is able to handle the long tailed data better than other loss function in a more general CNN’s structure.
4.1 Implementation Details of VGG with Range Loss
Training Data and Preprocessing:
To get a high-quality training data, we compute a mean feature vector for all identities according to their own pictures in data set. For a specific identity, images whose feature vector is far from the identity’s feature vector will be removed. After carefully filtering and cleaning the MS-Celeb-1M and CASIA- WebFace data set, we obtain a dataset which contains 5M images with 100k unique identities. We use the new proposed multi-task cascaded CNN in  to conduct the face detection and alignment. Training images are cropped to the size of 224224 and 112
94 RGB 3-channel images for VGG and our CNN model’s input, respectively. In this process, to estimate a reasonable mini-batch size is of crucial importance. By our experiences, it’s better to construct such a mini-batch that contains multiple classes and same number of samples within each class. For examples, we set mini-batch size at 32 in our experiment, and 4 different identities in one batch with 8 images for each identity. For those small scale nets, it’s normal to set 256 as the batch size, with 16 identities in one batch and 16 images per identities. Generally speaking, more identities being included in one mini-batch will contribute to both the softmax loss’s supervising and the range loss’s inter-class part.
The VGG net is a heavy convolutional neural networks model, especially when facing a training set with large amounts of identities. For 100k identities, according to our experiences, the mini-batch size can never exceed 32 because of the limitation of the GPU memory. The net is initialized by Gaussian distribution. The loss weight of the inter-class part of range loss iswhile the intra-class part of range loss is . The parameter is set . Initial learning rate is set at and reduce by half every iterations. We extract each of the testing sample’s feature in the last fully connected layer.
4.2 Performances on LFW and YTF Data sets
LFW is a database of face photographs designed for unconstrained face recognition. The data set contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people have two or more distinct photo’s in this data set .
YouTube faces database is a database of face videos designed for studying the problem of unconstrained face recognition in videos. The data set contains 3,425 videos of 1,595 different people. All the videos were downloaded from YouTube. An average of 2.15 videos are available for each subject 
. We implement the CNN model using the Caffe library with our customized range loss layers. For comparison, we train three models under the supervision of softmax loss (model A), joint contrastive loss and softmax loss (model B), and softmax combined with range loss (model C). From the results shown in Table 5, we can see that Model C (jointly supervised by the range loss and softmax loss) beats the baseline model A (supervised by only softmax loss) by a large gap: from 97.87% to 98.53% in LFW. Contrary to our previous experimental result that models trained with complete long tailed data reach a lower accuracy, our model’s (Model C) performances on complete long tail exceed the 50% long tail group’s result by 0.43%. This shows that, firstly, comparing to soft-max loss and contrastive loss, range loss’s capacity of learning discriminative feature from long tailed data performed best. Secondly, the integration of range loss to the model enables the latter 50% tailed data to contribute to model’s learning. This shows that, the original drawback that tailed data may bring, has been more than eliminated, but converted into notably contribution. This shows the advantage of our proposed range loss in dealing with long tailed data.
|Deep FR ||2.6M||98.95%||97.30%|
4.3 Performance of Range Loss on other CNN structures
To measure the performances and impact by the range loss and comprehensively and thoroughly, we further adopt residual CNN  supervised by the joint signals of range loss and softmax. Deep residual net in recent years have been proved to show good generalization performance on recognition tasks. It presents a residual learning framework that ease the training of networks substantially deeper than those used previously and up to 152 layers on the ImgageNet dataset. That we choose this joint signals can be largely ascribed to the softmax’s strong ability to give a discriminative boundaries among classes. Different to our previous practice, the model is trained under 1.5M filtered data from MS-Celeb-1M  and CASIA- WebFace, which is of smaller scale size of the original long tail dataset with a more uniform distribution. The intention of this experiment lies that: apart from the ability to utilize amounts of imbalanced data, we want to verify our loss function’s generalization ability to train universal CNN model and to achieve the state-of-art performances. We evaluate the range loss based residual net’s performances on LFW and YTF’s face verification task. The model’s architecture is illustrated in Fig.7. In Table 6, we compare our method against many existing models, including DeepID-2+, FaceNet, Baidu, DeepFace and our baseline model D (Our residual net structure supervised by softmax loss). From the results in Table 6, we have the following observations. Firstly, our model E (supervised by softmax and range loss) beats the baseline model D (supervised by softmax only) by a significant margin (from 98.27% to 99.52% in LFW, and 93.10% to 93.70% in YTF). This represents the joint supervision of range loss and softmax loss can notablely enhance the deep neural models’ ability to extract discriminative features. Secondly, residual network integrated with range loss was non-inferior to the existing famous networks and even outperforms most of them. This shows our loss function’s generalization ability to train universal CNN model and to achieve the state-of-art performances. Lastly, our proposed networks are trained under a database far less than other’s(shown in Table 6), this indicates the advantages of our network.
In this paper, we deeply explore the potential effects the long tail distribution may pose to the deep model’s training. Contrary to our intuitiveness, long tailed data, if tailored properly, can contribute to the model’s training. We proposed a new loss function, namely range loss. By combining the range loss with the softmax loss to jointly supervise the learning of CNNs, it is able to reduce the intra-class variations and enlarge the inter-class distance under imbalanced long tailed data effectively. Therefore, the optimization goal towards the poor classes should be focused on these thorny samples within one class. Its performance on several large-scale face benchmarks has convincingly demonstrated the effectiveness of the proposed approach.
-  A. Bingham and D. Spradlin. The long tail of expertise. 2011.
-  D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In , pages 3025–3032, 2013.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
-  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.
-  Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
-  S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision, pages 345–360. Springer, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361. Springer, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015.
-  J. F. Henriques, J. Carreira, R. Caseiro, and J. Batista. Beyond hard negative mining: Efficient detector learning via block-circulant decomposition. In proceedings of the IEEE International Conference on Computer Vision, pages 2760–2767, 2013.
-  G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
-  J. Liu, Y. Deng, and C. Huang. Targeting ultimate accuracy: Face recognition via deep embedding. arXiv preprint arXiv:1506.07310, 2015.
-  M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
-  W. Ouyang, X. Wang, C. Zhang, and X. Yang. Factors in finetuning deep model for object detection with long-tail distribution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 864–873, 2016.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, volume 1, page 6, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  S.Bengio. The battle against the long tail. Talk on Workshop on Big Data and Statistical Machine Learning., 2015.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pages 1988–1996, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014.
-  Y. Wen, Z. Li, and Y. Qiao. Latent factor guided convolutional neural networks for age-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4893–4901, 2016.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
-  L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 529–534. IEEE, 2011.
-  J. Yang, B. Price, S. Cohen, and M.-H. Yang. Context driven scene parsing with attention to rare classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3294–3301, 2014.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multi-task cascaded convolutional networks. arXiv preprint arXiv:1604.02878, 2016.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.
-  B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pages 487–495, 2014.
-  E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: Touching the limit of lfw benchmark or not? arXiv preprint arXiv:1501.04690, 2015.