Wildest Faces: Face Detection and Recognition in Violent Settings

05/19/2018 ∙ by Mehmet Kerim Yucel, et al. ∙ 0

With the introduction of large-scale datasets and deep learning models capable of learning complex representations, impressive advances have emerged in face detection and recognition tasks. Despite such advances, existing datasets do not capture the difficulty of face recognition in the wildest scenarios, such as hostile disputes or fights. Furthermore, existing datasets do not represent completely unconstrained cases of low resolution, high blur and large pose/occlusion variances. To this end, we introduce the Wildest Faces dataset, which focuses on such adverse effects through violent scenes. The dataset consists of an extensive set of violent scenes of celebrities from movies. Our experimental results demonstrate that state-of-the-art techniques are not well-suited for violent scenes, and therefore, Wildest Faces is likely to stir further interest in face detection and recognition research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Likewise, there has been a plethora of studies on face recognition. Compared to the pioneering works of [Turk and Pentland(1991), Ahonen et al.(2006)Ahonen, Hadid, and Pietikainen, Xie et al.(2010)Xie, Shan, Chen, and Chen, Edwards et al.(1998)Edwards, Cootes, and Taylor, Wright et al.(2009)Wright, Yang, Ganesh, Sastry, and Ma, Wiskott et al.(1997)Wiskott, Krüger, Kuiger, and Von Der Malsburg], face recognition models that benefit from deep learning-based techniques and concentrate on better formulation of distance metric optimization raised the bar [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin, Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf, Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al., Wen et al.(2016)Wen, Zhang, Li, and Qiao, Sun et al.(2013)Sun, Wang, and Tang, Sun et al.(2014)Sun, Chen, Wang, and Tang, Sun et al.(2015)Sun, Liang, Wang, and Tang]. In addition to face recognition in still images, video-based face recognition studies have also emerged (see [Ding and Tao(2016)] for a recent survey). Ranging from local feature-based methods [Li et al.(2013)Li, Hua, Lin, Brandt, and Yang, Parkhi et al.(2014)Parkhi, Simonyan, Vedaldi, and Zisserman, Li et al.(2014)Li, Hua, Shen, Lin, and Brandt] to manifolds [Huang et al.(2015b)Huang, Wang, Shan, Li, and Chen] and metric learning [Cheng et al.(2018)Cheng, Zhou, and Han, Huang et al.(2017)Huang, Wang, Van Gool, Chen, et al., Goswami et al.(2017)Goswami, Vatsa, and Singh], recent studies have focused on finding informative frames in image sets [Goswami et al.(2014)Goswami, Bhardwaj, Singh, and Vatsa] and finding efficient and fast ways of feature aggregation [Chowdhury et al.(2016)Chowdhury, Lin, Maji, and Learned-Miller, Yang et al.(2017a)Yang, Ren, Chen, Wen, Li, and Hua, Rao et al.(2017a)Rao, Lin, Lu, and Zhou, Rao et al.(2017b)Rao, Lu, and Zhou].

Nevertheless, the real-life conditions still challenge the state-of-the-art algorithms due to variations in scale, background, pose, expression, lighting, occlusion, age, blur and image resolution. As shown in [Yang et al.(2016)Yang, Luo, Loy, and Tang], several leading algorithms produce severely degraded results in rather unconstrained conditions. Recently, there have been many attempts in building large scale datasets with variety of real-life conditions. FDDB [Jain and Learned-Miller(2010)], AFW [Zhu and Ramanan(2012)], PASCAL Faces [Yan et al.(2014)Yan, Zhang, Lei, and Li], Labeled Faces in the Wild (LFW) [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller], Celeb Faces [Sun et al.(2013)Sun, Wang, and Tang], Youtube Faces (YTF) [Wolf et al.(2011)Wolf, Hassner, and Maoz], IJB-A [Klare et al.(2015)Klare, Klein, Taborsky, Blanton, Cheney, Allen, Grother, Mah, and Jain], MS-Celeb-1M [Guo et al.(2016)Guo, Zhang, Hu, He, and Gao], VGG-Face [Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.], VGG2-Face [Cao et al.(2017)Cao, Shen, Xie, Parkhi, and Zisserman], MegaFace [Kemelmacher-Shlizerman et al.(2016)Kemelmacher-Shlizerman, Seitz, Miller, and Brossard] and WIDER Face [Yang et al.(2016)Yang, Luo, Loy, and Tang] datasets have been made publicly available for research purposes. Datasets with extreme scales, such as [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin] and [Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf] have also been used but have not been disclosed to the public. However, these datasets can still be considered as "controlled" in several regards, such as resolution, the presence of motion blur and the very quality of the image. Moreover, these datasets mostly omit noisy samples and are not representative of extreme expressions, such as anger and fear in violent scenes.

Figure 1: Our dataset creation pipeline is shown in the first row. Faces with green bounding boxes indicate the celebrities that are used for recognition. Second row shows sample recognition images from Wildest Faces dataset which include variety of real-life conditions. Note the amount of pose variations, blur and low image quality. Moreover, Wildest Faces offers a considerable age variance, extreme facial expressions as well as severe occlusion.

In this paper, we present a new benchmark dataset, namely Wildest Faces, where we put the emphasis on violent scenes with virtually unconstrained scenarios. In addition to previously studied adverse conditions, Wildest Faces dataset contains images from a large spectrum of image quality, resolution and motion blur (see Fig. 1). The dataset consists of videos of celebrities in which they are practically fighting. There are K images (a.k.a frames) and 2186 shots of 64 celebrities, and all of the video frames are manually annotated to foster research both for detection and recognition of “faces in the wildest”. It is especially important from the surveillance perspective to identify the people who are involved in crime scenes and we believe that the availability of such a dataset of violent faces would stir further research towards this direction as well.

We provide a detailed discussion of the statistics and the evaluation of state-of-the-art methods on the proposed dataset. We exploit the dataset both in the context of face detection, image-based and video-based face recognition. For video face recognition, we also introduce an attention-based temporal pooling technique to aggregate videos in a simple and effective way. Our experimental results demonstrate that such a technique can be preferable amongst others, whilst there is still a large room for improvement in this challenging dataset that is likely to facilitate further research.

2 Discussion on available datasets

Face Detection Datasets: AFW [Zhu and Ramanan(2012)] contains background clutter with different face variations and associated annotations include bounding box, facial landmarks and pose angle labels. FDDB [Jain and Learned-Miller(2010)] is built using Yahoo!, where images with both eyes in clear sight are neglected, which leads to a rather constrained distribution in terms of pose and occlusion. IJB-A [Klare et al.(2015)Klare, Klein, Taborsky, Blanton, Cheney, Allen, Grother, Mah, and Jain] is one of the few datasets that contains annotations for both recognition and detection tasks. MALF [Yang et al.(2015a)Yang, Yan, Lei, and Li] incorporates rich annotations in the sense that they contain pose, gender and occlusion information as well as expression information with a certain level of granularity. PASCAL Faces [Yan et al.(2014)Yan, Zhang, Lei, and Li] contains images selected from PASCAL VOC [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. In AFLW [Martin Koestinger and Bischof(2011)] annotations come with rich facial landmark information available. WIDER Face [Yang et al.(2016)Yang, Luo, Loy, and Tang] is one of the largest datasets released for face detection. Collected using categories chosen from LSCOM [Naphade et al.(2006)Naphade, Smith, Tesic, Chang, Hsu, Kennedy, Hauptmann, and Curtis], each annotation is categorized due to its scale, occlusion, pose, overall difficulty and events, which facilitates in-depth analysis. Detailed information on these datasets can be found in Table 1.

Dataset # Images # Faces Source Type Public
AFW [Zhu and Ramanan(2012)] 205 473 Flickr Images Yes
FDDB [Jain and Learned-Miller(2010)] 2,845 5,171 Yahoo! News Images Yes
IJB - A [Klare et al.(2015)Klare, Klein, Taborsky, Blanton, Cheney, Allen, Grother, Mah, and Jain] 24,327 49,579 Internet Images / Videos Yes
MALF [Yang et al.(2015a)Yang, Yan, Lei, and Li] 5,250 11,931 Flickr, Baidu Inc. Images Yes
AFLW [Martin Koestinger and Bischof(2011)] 21,997 25,993 Flickr Images Yes
PASCAL Faces [Yan et al.(2014)Yan, Zhang, Lei, and Li] 851 1,335 PASCAL VOC Images Yes
WIDER Face [Yang et al.(2016)Yang, Luo, Loy, and Tang] 32,203 393,703 Google, Bing Images Yes
Wildest Faces 67,889 109,771 YouTube Videos Yes
Table 1: Face detection datasets.

Face Recognition Datasets: Labeled Faces in the Wild (LFW) [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller] is one of the widely used datasets in the recognition literature. Viola-Jones detector [Viola and Jones(2001)] is used to detect faces during the dataset collection phase, and then manual correction on annotations is performed. PubFig [Kumar et al.(2009)Kumar, Berg, Belhumeur, and Nayar] is created as a complement to the LFW. The faces in this set are the images of the public celebrities and are collected using Google and Flickr. Celebrity Faces [Sun et al.(2013)Sun, Wang, and Tang] is constructed using public figures. In one of the turning points of face recognition, large-scale VGG face dataset [Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.] is released with the help of automated face detection and a stunning number of 200 human annotators. During its collection phase, care is taken to avoid having the same individuals with LFW and YTF datasets. Recently, this dataset is further expanded in [Cao et al.(2017)Cao, Shen, Xie, Parkhi, and Zisserman] as VGG Face-2, which is fairly larger than its predecessor. FaceScrub [Ng and Winkler(2014)] is another dataset comprised of individuals who are primarily celebrities. CASIA-WebFace [Yi et al.(2014)Yi, Lei, Liao, and Li] is another popular dataset, though authors note that they can’t be sure that all images are annotated correctly. MS-Celeb-1M [Guo et al.(2016)Guo, Zhang, Hu, He, and Gao] contains approximately 10 million images of 100,000 individuals where 1,500 of them are celebrities. In one of the latest benchmarks released publicly, MegaFace [Kemelmacher-Shlizerman et al.(2016)Kemelmacher-Shlizerman, Seitz, Miller, and Brossard] contains a large set of pictures from Flickr with a size of 50 pixels in both dimensions, where faces are detected using Headhunter [Mathias et al.(2014)Mathias, Benenson, Pedersoli, and Van Gool]. Authors of [Kemelmacher-Shlizerman et al.(2016)Kemelmacher-Shlizerman, Seitz, Miller, and Brossard] also presented an improved version of MegaFace, dubbed MF2 [Nech and Kemelmacher-Shlizerman(2017)], that builds on its predecessor. Additionally, tech giants have utilized their proprietary datasets in Facebook’s DeepFace [Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf], Google’s FaceNet [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin] and NTechLab’s 111https://ntechlab.com.

For video face recognition, YouTube Faces [Wolf et al.(2011)Wolf, Hassner, and Maoz] uses [Viola and Jones(2001)] to automatically detect faces. Each face in the data is centered, expanded with 2.2 magnification factor and the size of the annotation is fixed with 100 pixels in both dimensions. Other two prominent video face recognition datasets are COX [Huang et al.(2015a)Huang, Shan, Wang, Zhang, Lao, Kuerban, and Chen] and PasC [Beveridge et al.(2013)Beveridge, Phillips, Bolme, Draper, Givens, Lui, Teli, Zhang, Scruggs, Bowyer, et al.]. Despite their relatively large size, PasC [Beveridge et al.(2013)Beveridge, Phillips, Bolme, Draper, Givens, Lui, Teli, Zhang, Scruggs, Bowyer, et al.] suffers from video location constraints and COX [Huang et al.(2015a)Huang, Shan, Wang, Zhang, Lao, Kuerban, and Chen] suffers from demographics as well as video location constraints. Detailed information on these datasets can be found in Table 2.

Limitations of the available datasets: Except WIDER, the available datasets generally focus on high resolution and high quality images. Moreover, several of these datasets filter low quality, occluded and blurred images, thus do not represent what is out there in the real world. Although there are video recognition datasets which inherently consist of motion blurred or comparably low quality images (e.g. [Wolf et al.(2011)Wolf, Hassner, and Maoz]), majority of the datasets are likely to suffer from automatically performed face detector bias. In addition, to the best of our knowledge, none of these datasets primarily focus on violent scenes where unconstrained scenarios might actually introduce unconstrained effects.

Dataset # Images (or videos) # Individuals Source Type

Wildest Faces
2,186 (64,242 frames) 64 YouTube Videos

COX [Huang et al.(2015a)Huang, Shan, Wang, Zhang, Lao, Kuerban, and Chen]
3,000 1000 Custom Videos
PasC [Beveridge et al.(2013)Beveridge, Phillips, Bolme, Draper, Givens, Lui, Teli, Zhang, Scruggs, Bowyer, et al.] 2,802 + 9376 frames 293 Custom Videos
YTF [Wolf et al.(2011)Wolf, Hassner, and Maoz] 3,425 1,595 Youtube Videos
LFW [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller] 5,749 13,233 Yahoo! News Images

PubFig [Kumar et al.(2009)Kumar, Berg, Belhumeur, and Nayar]
60,000 200 Google, Flickr Images

CelebA [Yang et al.(2015b)Yang, Luo, Loy, and Tang]
202,599 10,177 Google, Bing Images

CelebFaces [Sun et al.(2013)Sun, Wang, and Tang]
87,628 5,436 Flickr, Baidu Inc. Images

VGG Face [Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.]
2.6M 2,622 Google, Bing Images

FaceScrub [Ng and Winkler(2014)]
106,863 530 Internet Images

CASIA-WebFace [Yi et al.(2014)Yi, Lei, Liao, and Li]
494,414 10,000 IMDB Images

MegaFace [Kemelmacher-Shlizerman et al.(2016)Kemelmacher-Shlizerman, Seitz, Miller, and Brossard]
1M 690,572 Flickr Images

VGG-2 [Cao et al.(2017)Cao, Shen, Xie, Parkhi, and Zisserman]
3.2M 9,131 Google Images

MF2 [Nech and Kemelmacher-Shlizerman(2017)]
4.7M 672,000 Flickr Images

MS-Celeb-1M [Guo et al.(2016)Guo, Zhang, Hu, He, and Gao]
10M 100,000 Internet,Bing Images

DeepFace [Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf]
4M 4,000 Internal Images

FaceNet [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin]
500 M 8 M Internal Images

NTechLab †
18.4 M 200.000 Internal Images
Table 2: Face recognition datasets. † indicates private dataset. Among the available video face recognition datasets, Wildest Faces have the highest video count per individual.

3 Wildest Faces Dataset

Figure 2: K-Means cluster centers for Al Pacino images in Wildest Faces, FaceScrub [Ng and Winkler(2014)] and YouTube Faces [Wolf et al.(2011)Wolf, Hassner, and Maoz] are shown in first, second and third row, respectively. k=8 for Wildest Faces and FaceScrub , k=3 for YouTube Faces as higher k values produce repetitive images. Average faces from Wildest Faces are the least recognizable, indicating a large degree of variance in adverse effects. Images are histogram equalized for convenience.

Human faces are in their wildest form during violence or fight with their expressions uncontrolled. Besides, the fast movements during violence naturally results in challenges for pose, occlusion, and blur. Based on these observations, we constructed Wildest Faces dataset from YouTube videos by focusing on violent scenes of celebrities in movies.

3.1 Data Collection and Annotation

We first identified the celebrities who are known to be acting in movies with violence. We then picked their videos from YouTube in a variety of scene settings; car chase, indoor fist fights, gun fights, heated arguments and science fiction/fantasy battles. This abundance in scene settings provide an inherent variety of possible occluding objects, poses, background clutter and blur (see Fig.1). Majority of the frames of each video have celebrity face in them, though in some frames celebrities may not be present. Videos, with an average 25 FPS are then divided into shots with a maximum duration of 10 seconds.

In total, we choose 64 celebrities and collect 2,186 shots from 410 videos, which results in 67,889 frames with 109,771 manually annotated bounding boxes. In order to test the generalization ability thoroughly, we split the dataset based on videos and do not include any shots from a training video in the other splits. The splits for training, validation and test sets yield the ratios 56%-23%-21% video-wise and 61%-20%-19% frame-wise. Video-based splitting also assesses age difference; e.g. training set includes Sean Connery in his early acting days whereas test set solely includes him in late stages of his career.

Ground truth locations of faces have been annotated by 12 annotators using VOTT 222https://github.com/Microsoft/VoTT. We also label our celebrities with target tag for recognition and label the rest of the faces as non-target. We do not omit any adverse effect; we label extremely tiny, occluded, frontal/profile and blurred faces. When creating the recognition set, we simply crop the target label from each frame in the dataset and expand the area with a factor 0.15 to make sure we do not miss any facial parts. An example illustration can be seen in Fig.1. As we do not have celebrity faces in each collected frame, our recognition set consists of 64,242 frames in total.

3.2 Statistics

(a) Detection scales.
(b) Detection blur.
(c) Recognition blur.
(d) Recognition age.
Figure 3: Wildest Faces Statistics. In (a), blue and red correspond to width and height, respectively. Detection set offers a severely blurred data, whereas recognition has a more equal distribution. For detection scales, we see an equal emphasis on small and large faces.

Wildest Faces dataset has a diverse distribution of faces. In Figure 2, k-means cluster centers of Al Pacino’s images (dataset-wide) are shown for FaceScrub [Ng and Winkler(2014)], YouTubeFaces [Wolf et al.(2011)Wolf, Hassner, and Maoz] and Wildest Faces. It is clear that our dataset has a wide spectrum of adverse effects as its cluster centers are far from being recognizable as Al Pacino. Wildest Faces offers a good scale variance for detection, as well as high amount of blur. Recognition set offers a good distribution of several blur levels as well as a noticable average age variance. Occluded shots roughly makes up the half of the available data, which offers a challenge as well. Moreover, pose variance is sufficiently large in each shot, which would promote pose-invariance in video face recognition. In the following, we present the analysis of these effects.

Scale.

We classify our faces into categories of

small, medium and large with respect to the heights of faces: below 100 pixels as small, in between 100 to 300 pixels as medium, and larger than 300 pixels as large. Scale statistics for detection set is shown in Figure 3(a). For recognition set, the balance shifts slightly to medium from small.

Blur. We follow a multi-stage procedure to quantify blur that is present in images. Inspired from [Pech-Pacheco et al.(2000)Pech-Pacheco, Cristóbal, Chamorro-Martinez, and Fernández-Valdivia], we perform contrast normalization and then convert our images to grayscale. Grayscale images are then convolved with a 3x3 Laplacian Kernel, and variance of the result is used to produce a blurness value, which is then used to empirically find a threshold to divide the images into blur categories.. We then manually edit any wrong blur labels. Blur statistics are shown in Figure 3(b) and Figure 3(c).

Age. For each individual we also measure the distribution of age variances, which is the differences between the dates of their earliest and latest movies in our dataset. We see drastic age variations in certain individuals, up to 40 years (see Figure 3(d)). On average, 13 years of age variation per individual is observed.

Occlusion. We provide occlusion information on shot-level for recognition set; we label shots as no occlusion, mixed or significant. Shots labelled mixed have occlusion in several frames of the shot, but not more than half of the face is occluded. Significant labels indicate there are several frames with heavy occlusion, where at least half of the face is occluded. We randomly select 250 shots in our dataset and analyze them; this leads to a ratio of 20%, 28% and 52% for significant, mixed and no occlusion tags, respectively.

Pose. For selected individuals, we present four average faces (each taken from a shot). We make sure that there is no occlusion or high blur in these shots, so only pose variation is the concern. It can be clearly seen from Figure 4 that high pose variance leads to unidentifiable average faces supporting the complexity of Wildest Faces dateset.

(a) Al Pacino
(b) Dwayne Johnson
(c) Bruce Willis
(d) Chuck Norris
Figure 4: Pair of average faces taken from sample shots (taken from low blurred and minimally occluded shots) of example subjects in Wildest Faces. Every first image represents a shot average with minimal pose variation and the second image is an average shot with severe pose variation. Comparison between these images indicates a large pose diversity in our dataset. Images are histogram equalized for convenience.

4 Attentive Temporal Pooling

For the purpose of video face recognition, we propose a simple yet effective technique which we refer to as attentive temporal pooling, inspired from [Yang et al.(2017a)Yang, Ren, Chen, Wen, Li, and Hua]. The intuition behind this model is to exploit the hidden pose information in a trainable fashion to extract useful information in the noisy sequences of video frames. The proposed approach consists of three main components; i) an attention layer, ii) a pooling layer, and iii) a fully connected layer. Attention module learns to promote the informative parts of given image sequences. Through the pooling layer, the overall sequence information is aggregated and fed into a fully connected layer. This simple framework operates over CNN features.

More formally, the input is a matrix of

-dimensional CNN feature vectors coming from

frames. An attention weight matrix of size is initialized using Xavier Normal Form method[Glorot and Bengio(2010)].

is a hyperparameter that needs to be tuned (we set

in our experiments). The attention weights is calculated by

(1)

which results in a sized matrix. This matrix is then fed to a softmax function that operates over the temporal dimension. The -th row of the resulting matrix can be considered as a weight distribution over the frames, for the pose captured by the -th row of the matrix

. We use the estimated attention weights to temporally pool the per-frame feature vectors. More specifically, we extract the video feature vector by computing a weighted sum as follows:

(2)

where the resulting matrix is of size

. The output is then aggregated with max-pooling and fed into the fully connected layer which is used for classification with a cross-entropy loss. The model is implemented in PyTorch

[Paszke et al.(2017)Paszke, Gross, Chintala, and Chanan]. The network parameters are optimized using SGD with a learning rate of 0.0001 and a momentum of 0.9. The batch size is set to 1. We note that this approach can be considered as a generalization of the aggregation scheme proposed in [Yang et al.(2017a)Yang, Ren, Chen, Wen, Li, and Hua], which is equivalent to Eq. 2 for .

5 Experimental Results

5.1 Face Detection

We first evaluate the performance of face detection over Wildest Faces dataset. For this purpose, we pick three most recent techniques; Single-Shot Scale Invariant Face Detector [Zhang et al.(2017)Zhang, Zhu, Lei, Shi, Wang, and Li], Tiny Faces [Hu and Ramanan(2017)] and Single Stage Headless Detector [Najibi et al.(2017)Najibi, Samangouei, Chellappa, and Davis]. 333We use the codes released by the papers’ authors. We also evaluate a light-weight, SSD [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg]-based face detector available in OpenCV 444https://github.com/opencv/opencv/tree/master/samples/dnn/face_detector. We use all these techniques in an "as-is" configuration; we apply available pre-trained models (trained on WIDER Face [Yang et al.(2016)Yang, Luo, Loy, and Tang]) on all our data (train, test and validation splits combined). Since our main focus in this work is on video face recognition, we do not perform any training on Wildest Faces, hence we compute the performance of the detectors over the entire dataset of 67889 images.

Overall. Detection results are shown in Table 3 and Figure 5(h). It can be said that our dataset offers a new challenge for all the detectors. Performance-wise, we see Tiny Faces [Hu and Ramanan(2017)] and SSH [Najibi et al.(2017)Najibi, Samangouei, Chellappa, and Davis] performing on par with each other. SFD [Zhang et al.(2017)Zhang, Zhu, Lei, Shi, Wang, and Li] is the third best, whereas the light-weight SSD [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] performs the worst.

Blur. Our blur analysis results are shown in Figure 5(a) to 5(d). We observe that blur severely degrades each detector; higher the blur worse the detection performance. SSH [Najibi et al.(2017)Najibi, Samangouei, Chellappa, and Davis] seems to be the most robust detector to blur, whereas for low blur cases Tiny Faces [Hu and Ramanan(2017)] performs better with a slight margin.

Scale. We test the performance of the detectors in different scales. Results are shown in Figure 5(e) to 5(g). The same trend in overall performance is visible here as well; Tiny Faces[Hu and Ramanan(2017)] takes the lead over images with large size, with SSH [Najibi et al.(2017)Najibi, Samangouei, Chellappa, and Davis] closely trailing behind, whereas the others fall visibly behind. As faces become smaller, SSH [Najibi et al.(2017)Najibi, Samangouei, Chellappa, and Davis] catches up and takes the lead from Tiny Faces[Hu and Ramanan(2017)]. All the detectors have degraded performance when faces become smaller. We perform the same assessment for width and obtain a reminiscent trend.

These findings indicate that there is still considerable room for improvement for face detection in challenging cases like extreme blur or small size.

Method Large Medium Small Severe Blur High Blur Medium Blur Low Blur Overall
SSD [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg]-based detector 73.2% 47.1% 19.9% 36% 56.7% 68% 70.2% 51.6%
SFD [Zhang et al.(2017)Zhang, Zhu, Lei, Shi, Wang, and Li] 84.6% 75.9% 69.5% 74.3% 78.4% 84% 87% 77.3%
Tiny Faces [Hu and Ramanan(2017)] 95.6% 89.3% 80.7% 85.2% 89.6% 92.5% 94.6% 90.5%
SSH [Najibi et al.(2017)Najibi, Samangouei, Chellappa, and Davis] 94.1% 90.7% 82.4% 88.4% 92% 93.7% 94% 90.7%
Table 3: Detection AP values. Small, Medium and Large refer to height scale categories.
(a) Severe blur.
(b) High blur.
(c) Medium blur.
(d) Low blur.
(e) Large height.
(f) Medium height.
(g) Small height.
(h) Overall.
Figure 5: Smaller scales and high blur levels severely degrade results of all face detectors.

5.2 Face Recognition

5.2.1 Image-based Face Recognition

For image-based face recognition, we use the train, validation and test splits of Wildest Faces dataset that consist of 39459, 12088, and 12695 face images respectively. We use two prominent face recognition approaches; VGG Face [Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.] and Center Loss[Wen et al.(2016)Wen, Zhang, Li, and Qiao] (trained on LFW [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller]). We first train these models from scratch over the Wildest Faces

, but we observe that they achieve significantly better results with pretrained models (trained on fairly larger datasets). We resize face regions to 96x96 and perform the relevant preprocessing steps in line with each technique’s implementation using Caffe

[Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell]. We make minimal changes to original hyperparameters during training to improve convergence.

The image-based recognition results are shown in Table 4. Besides the comparison of face recognition techniques, we also test the effect of using alignment. For this purpose, we utilize MTCNN alignment technique [Zhang et al.(2016)Zhang, Zhang, Li, and Qiao]. We bypass the detector of MTCNN and use the ground-truth locations of faces during training. We add fully connected layers to the end of both networks of [Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.] and [Wen et al.(2016)Wen, Zhang, Li, and Qiao] to cast them as classifiers, since original models were for identification. The experimental results show that when no alignment is used, CenterLoss[Wen et al.(2016)Wen, Zhang, Li, and Qiao] method yields superior results. On the contrary, VGGFace[Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.] method benefits significantly from alignment and yields on par performance in the presence of alignment.

5.2.2 Video Face Recognition

Our dataset consists of video clips of celebrities, so it is well-suited as a benchmark for video face recognition. The train, validation and test splits consist of 1347, 387 and 452 shots, respectively. The simplest baseline is majority voting using the techniques presented for standard face recognition. Results are shown in Table 5. We measure the recognition performance both at frame-level and at shot-level. Frame-level performance is evaluated as the accuracy over 12695 images and and shot-level over 452 shots, respectively.

For video face recognition, we also train several LSTM [Hochreiter and Schmidhuber(1997)]

architectures. Using the finetuned VGG features that are aligned with MTCNN, we implement single-layer LSTM, 2-layer LSTM (LSTM2), bi-directional LSTM (BiLSTM) and compare their performances with the attentive temporal pooling method described above. RMSprop optimizer with a learning rate of 0.0001 is used in all configurations of LSTMs for a fair comparison. Hidden sizes are fixed to 4096. Results are shown in Table

5.

As expected, majority voting of standard image based techniques fails to yield competitive results at the shot-level, whereas frame-level accuracy of VGGFace[Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.] is competitive with video-based recognition techniques. At the shot-level, the best performing LSTM model is one-layer LSTM, whereas two-level LSTMs perform better at the shot-level. Overall, we observe that the proposed attentive temporal pooling model performs the best on average. Note that the accuracies are all around 50% mark, indicating that the violent face recognition research can benefit from more tailored models.

Table 4: Image-based face recognition results.
Method Frame-Level Shot-Level
VGGFace[Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.] 51.98% 49.5%
CenterLoss[Wen et al.(2016)Wen, Zhang, Li, and Qiao] 49.6% 46.6%
LSTM 52.1% 51.9%
LSTM2 52.3% 49.3%
BiLSTM 49.6% 50.6%
AttTempPool 52.2% 52.6%
Table 5: Accuracy values for video face recognition. In shot-level evaluation, the accuracy is calculated over shots, whereas in frame-level, accuracy is calculated over frames by assigning all the frames in the shot to the label of the sequence.

6 Conclusion

Inspired by the lack of a publicly available face detection and recognition dataset that concentrates primarily on violent scenes, we introduce Wildest Faces dataset that compasses a large spectrum of adverse effects, such as severe blur, low resolution and a significant diversity in pose and occlusion. The dataset includes annotations for face detection as well as recognition with various tags, such as blur severity, scale and occlusion. To the best of our knowledge, this is the first face dataset that focuses on violent scenes which inherently have extreme facial expressions along with challenging aspects.

We also provide benchmarks using prominent detection and recognition techniques and introduce an attention-based temporal pooling technique to aggregate video frames in a simple and effective way. We observe that approaches fall short to tackle the challenges of Wildest Faces. We hope Wildest Faces will boost face recognition and detection research towards edge cases. We will provide continuous improvements and additions to Wildest Faces dataset in the future.555The dataset with annotations will be made available upon publication

References

  • [Ahonen et al.(2006)Ahonen, Hadid, and Pietikainen] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE transactions on pattern analysis and machine intelligence, 28(12):2037–2041, 2006.
  • [Beveridge et al.(2013)Beveridge, Phillips, Bolme, Draper, Givens, Lui, Teli, Zhang, Scruggs, Bowyer, et al.] J Ross Beveridge, P Jonathon Phillips, David S Bolme, Bruce A Draper, Geof H Givens, Yui Man Lui, Mohammad Nayeem Teli, Hao Zhang, W Todd Scruggs, Kevin W Bowyer, et al. The challenge of face recognition from digital point-and-shoot cameras. In Biometrics: Theory, Applications and Systems (BTAS), 2013 IEEE Sixth International Conference on, pages 1–8. IEEE, 2013.
  • [Cao et al.(2017)Cao, Shen, Xie, Parkhi, and Zisserman] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. arXiv preprint arXiv:1710.08092, 2017.
  • [Cheng et al.(2018)Cheng, Zhou, and Han] Gong Cheng, Peicheng Zhou, and Junwei Han. Duplex metric learning for image set classification. IEEE Transactions on Image Processing, 27(1):281–292, 2018.
  • [Chowdhury et al.(2016)Chowdhury, Lin, Maji, and Learned-Miller] Aruni Roy Chowdhury, Tsung-Yu Lin, Subhransu Maji, and Erik Learned-Miller. One-to-many face recognition with bilinear cnns. In

    Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on

    , pages 1–9. IEEE, 2016.
  • [Ding and Tao(2016)] Changxing Ding and Dacheng Tao. A comprehensive survey on pose-invariant face recognition. ACM Transactions on intelligent systems and technology (TIST), 7(3):37, 2016.
  • [Edwards et al.(1998)Edwards, Cootes, and Taylor] Gareth J Edwards, Timothy F Cootes, and Christopher J Taylor. Face recognition using active appearance models. In European conference on computer vision, pages 581–595. Springer, 1998.
  • [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [Farfade et al.(2015)Farfade, Saberian, and Li] Sachin Sudhakar Farfade, Mohammad J Saberian, and Li-Jia Li.

    Multi-view face detection using deep convolutional neural networks.

    In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 643–650. ACM, 2015.
  • [Glorot and Bengio(2010)] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pages 249–256, 2010.
  • [Goswami et al.(2014)Goswami, Bhardwaj, Singh, and Vatsa] Gaurav Goswami, Romil Bhardwaj, Richa Singh, and Mayank Vatsa. Mdlface: Memorability augmented deep learning for video face recognition. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–7. IEEE, 2014.
  • [Goswami et al.(2017)Goswami, Vatsa, and Singh] Gaurav Goswami, Mayank Vatsa, and Richa Singh. Face verification via learned representation on feature-rich video frames. IEEE Transactions on Information Forensics and Security, 12(7):1686–1698, 2017.
  • [Guo et al.(2016)Guo, Zhang, Hu, He, and Gao] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: Challenge of recognizing one million celebrities in the real world. Electronic Imaging, 2016(11):1–6, 2016.
  • [Hochreiter and Schmidhuber(1997)] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [Hu and Ramanan(2017)] Peiyun Hu and Deva Ramanan. Finding tiny faces. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 1522–1530. IEEE, 2017.
  • [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
  • [Huang et al.(2015a)Huang, Shan, Wang, Zhang, Lao, Kuerban, and Chen] Zhiwu Huang, Shiguang Shan, Ruiping Wang, Haihong Zhang, Shihong Lao, Alifu Kuerban, and Xilin Chen. A benchmark and comparative study of video-based face recognition on cox face database. IEEE Transactions on Image Processing, 24(12):5967–5981, 2015a.
  • [Huang et al.(2015b)Huang, Wang, Shan, Li, and Chen] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xianqiu Li, and Xilin Chen. Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification. In

    International conference on machine learning

    , pages 720–729, 2015b.
  • [Huang et al.(2017)Huang, Wang, Van Gool, Chen, et al.] Zhiwu Huang, Ruiping Wang, Luc Van Gool, Xilin Chen, et al. Cross euclidean-to-riemannian metric learning with application to face recognition from video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [Jain and Learned-Miller(2010)] Vidit Jain and Erik Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2010-009, 2(7):8, 2010.
  • [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
  • [Kemelmacher-Shlizerman et al.(2016)Kemelmacher-Shlizerman, Seitz, Miller, and Brossard] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873–4882, 2016.
  • [Klare et al.(2015)Klare, Klein, Taborsky, Blanton, Cheney, Allen, Grother, Mah, and Jain] Brendan F Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan Mah, and Anil K Jain. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1931–1939, 2015.
  • [Kumar et al.(2009)Kumar, Berg, Belhumeur, and Nayar] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and simile classifiers for face verification. In Computer Vision, 2009 IEEE 12th International Conference on, pages 365–372. IEEE, 2009.
  • [Li et al.(2013)Li, Hua, Lin, Brandt, and Yang] Haoxiang Li, Gang Hua, Zhe Lin, Jonathan Brandt, and Jianchao Yang. Probabilistic elastic matching for pose variant face verification. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3499–3506. IEEE, 2013.
  • [Li et al.(2014)Li, Hua, Shen, Lin, and Brandt] Haoxiang Li, Gang Hua, Xiaohui Shen, Zhe Lin, and Jonathan Brandt. Eigen-pep for video face recognition. In Asian Conference on Computer Vision, pages 17–33. Springer, 2014.
  • [Li et al.(2015)Li, Lin, Shen, Brandt, and Hua] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015.
  • [Li and Zhang(2013)] Jianguo Li and Yimin Zhang. Learning surf cascade for fast and accurate object detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3468–3475. IEEE, 2013.
  • [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [Martin Koestinger and Bischof(2011)] Peter M. Roth Martin Koestinger, Paul Wohlhart and Horst Bischof. Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In Proc. First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2011.
  • [Mathias et al.(2014)Mathias, Benenson, Pedersoli, and Van Gool] Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool. Face detection without bells and whistles. In European Conference on Computer Vision, pages 720–735. Springer, 2014.
  • [Najibi et al.(2017)Najibi, Samangouei, Chellappa, and Davis] Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and Larry Davis. Ssh: Single stage headless face detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4875–4884, 2017.
  • [Naphade et al.(2006)Naphade, Smith, Tesic, Chang, Hsu, Kennedy, Hauptmann, and Curtis] Milind Naphade, John R Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. Large-scale concept ontology for multimedia. IEEE multimedia, 13(3):86–91, 2006.
  • [Nech and Kemelmacher-Shlizerman(2017)] Aaron Nech and Ira Kemelmacher-Shlizerman. Level playing field for million scale face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3406–3415. IEEE, 2017.
  • [Ng and Winkler(2014)] Hong-Wei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets. In Image Processing (ICIP), 2014 IEEE International Conference on, pages 343–347. IEEE, 2014.
  • [Parkhi et al.(2014)Parkhi, Simonyan, Vedaldi, and Zisserman] Omkar M Parkhi, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. A compact and discriminative face track descriptor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1693–1700, 2014.
  • [Parkhi et al.(2015)Parkhi, Vedaldi, Zisserman, et al.] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
  • [Paszke et al.(2017)Paszke, Gross, Chintala, and Chanan] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch. 2017.
  • [Pech-Pacheco et al.(2000)Pech-Pacheco, Cristóbal, Chamorro-Martinez, and Fernández-Valdivia] José Luis Pech-Pacheco, Gabriel Cristóbal, Jesús Chamorro-Martinez, and Joaquín Fernández-Valdivia. Diatom autofocusing in brightfield microscopy: a comparative study. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, volume 3, pages 314–317. IEEE, 2000.
  • [Rao et al.(2017a)Rao, Lin, Lu, and Zhou] Yongming Rao, Ji Lin, Jiwen Lu, and Jie Zhou. Learning discriminative aggregation network for video-based face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3781–3790, 2017a.
  • [Rao et al.(2017b)Rao, Lu, and Zhou] Yongming Rao, Jiwen Lu, and Jie Zhou.

    Attention-aware deep reinforcement learning for video face recognition.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3931–3940, 2017b.
  • [Samangouei et al.(2018)Samangouei, Najibi, Davis, and Chellappa] Pouya Samangouei, Mahyar Najibi, Larry Davis, and Rama Chellappa. Face-magnet: Magnifying feature maps to detect small faces. arXiv preprint arXiv:1803.05258, 2018.
  • [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [Sun et al.(2013)Sun, Wang, and Tang] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Hybrid deep learning for face verification. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1489–1496. IEEE, 2013.
  • [Sun et al.(2014)Sun, Chen, Wang, and Tang] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
  • [Sun et al.(2015)Sun, Liang, Wang, and Tang] Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.
  • [Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
  • [Tang et al.(2018)Tang, Du, He, and Liu] Xu Tang, Daniel K Du, Zeqiang He, and Jingtuo Liu. Pyramidbox: A context-assisted single shot face detector. arXiv preprint arXiv:1803.07737, 2018.
  • [Turk and Pentland(1991)] Matthew A Turk and Alex P Pentland. Face recognition using eigenfaces. In Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on, pages 586–591. IEEE, 1991.
  • [Viola and Jones(2001)] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.
  • [Viola and Jones(2004)] Paul Viola and Michael J Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004.
  • [Wang et al.(2017)Wang, Yuan, and Yu] Jianfeng Wang, Ye Yuan, and Gang Yu. Face attention network: An effective face detector for the occluded faces. arXiv preprint arXiv:1711.07246, 2017.
  • [Wen et al.(2016)Wen, Zhang, Li, and Qiao] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
  • [Wiskott et al.(1997)Wiskott, Krüger, Kuiger, and Von Der Malsburg] Laurenz Wiskott, Norbert Krüger, N Kuiger, and Christoph Von Der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on pattern analysis and machine intelligence, 19(7):775–779, 1997.
  • [Wolf et al.(2011)Wolf, Hassner, and Maoz] Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 529–534. IEEE, 2011.
  • [Wright et al.(2009)Wright, Yang, Ganesh, Sastry, and Ma] John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2):210–227, 2009.
  • [Xie et al.(2010)Xie, Shan, Chen, and Chen] Shufu Xie, Shiguang Shan, Xilin Chen, and Jie Chen. Fusing local patterns of gabor magnitude and phase for face recognition. IEEE transactions on image processing, 19(5):1349–1361, 2010.
  • [Yan et al.(2014)Yan, Zhang, Lei, and Li] Junjie Yan, Xuzong Zhang, Zhen Lei, and Stan Z Li. Face detection by structural models. Image and Vision Computing, 32(10):790–799, 2014.
  • [Yang et al.(2014)Yang, Yan, Lei, and Li] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Aggregate channel features for multi-view face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014.
  • [Yang et al.(2015a)Yang, Yan, Lei, and Li] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Fine-grained evaluation on face detection in the wild. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1–7. IEEE, 2015a.
  • [Yang et al.(2017a)Yang, Ren, Chen, Wen, Li, and Hua] Jiaolong Yang, Peiran Ren, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation network for video face recognition. arXiv preprint, 2017a.
  • [Yang et al.(2015b)Yang, Luo, Loy, and Tang] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, pages 3676–3684, 2015b.
  • [Yang et al.(2016)Yang, Luo, Loy, and Tang] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5525–5533, 2016.
  • [Yang et al.(2017b)Yang, Xiong, Loy, and Tang] Shuo Yang, Yuanjun Xiong, Chen Change Loy, and Xiaoou Tang. Face detection through scale-friendly deep convolutional networks. arXiv preprint arXiv:1706.02863, 2017b.
  • [Yi et al.(2014)Yi, Lei, Liao, and Li] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
  • [Zafeiriou et al.(2015)Zafeiriou, Zhang, and Zhang] Stefanos Zafeiriou, Cha Zhang, and Zhengyou Zhang. A survey on face detection in the wild: past, present and future. Computer Vision and Image Understanding, 138:1–24, 2015.
  • [Zhang et al.(2016)Zhang, Zhang, Li, and Qiao] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
  • [Zhang et al.(2017)Zhang, Zhu, Lei, Shi, Wang, and Li] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. Single shot scale-invariant face detector. arXiv preprint arXiv:1708.05237, 2017.
  • [Zhu et al.(2018)Zhu, Tao, Luu, and Savvides] Chenchen Zhu, Ran Tao, Khoa Luu, and Marios Savvides. Seeing small faces from robust anchor’s perspective. arXiv preprint arXiv:1802.09058, 2018.
  • [Zhu and Ramanan(2012)] Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012.