Robust face detection is one of the ultimate components to support various facial related problems, such as face alignment , face recognition , face verification , and face tracking , etc. From the cornerstone by Viola-Jones  to the recent work by Hu et al. , the performance of face detection has been improved dramatically. The recent introduction of the WIDER face dataset , which contains a large number of small faces, exposes the performance gap between humans and the current face detection techniques due to a number of challenges in practice. Different from the classical face detection, tiny face detection mainly focuses on low-resolution, large scale variation and serious occlusion, as shown in Fig. 1. All of these challenges suggest the information on small objects is far too limited.
The existing methods for finding small objects in imageries can be grouped into three categories. The first group (e.g., 
) aims to extract scale-invariant features using pre-trained deep networks. However, their performance drops dramatically as the target faces become too small. Another group tries to generate additional information inside the objects by interpolation. For example, the work in demonstrated that interpolating the lowest layer of image pyramid was significantly beneficial for capturing small objects. The last group (e.g., 
) seeks to incorporate information surrounding the objects (i.e., context) in order to improve the performance of tiny face detection. It is clear that computer vision needs additional contextual information to accurately classify small faces. Is there another way to improve the performance of small object detection?
Note that, the existing classification-based tiny face detectors simply apply a threshold on a classification score to determine whether the corresponding candidate is face or non-face, as shown in the first stage of Fig. 2. However, the optimal threshold is often difficult to obtain. In this paper, we propose a novel idea to exploit the semantic information (consisting of spatial locations, scales and textures) of a candidate’s neighbors to classify a target to face or background. Specifically, based on such semantic information, we try to group all of the faces into one cluster, while backgrounds are kept far away from the cluster. For this purpose, we propose a Metric Learning and Graph-Cut (MLGC) framework, which carries out further classification on the candidates produced by other object detectors. Fig. 2 illustrates the framework of this idea.
We first obtain a high-recall classifier which aims to retrieve all of the targets in an image, but may unavoidably introduce lots of false positives. Our focus is to retrieve faces with low classification scores but remove these false positives. In order to do this, we design a metric learning method to learn a similarity matrix to evaluate the similarity of each pair of candidates. A graph model is built to represent the similarity matrix of these candidates. The graph cut technique is utilized to divide the graph into several groups where candidates in the same group are similar and those in different groups are dissimilar to each other. Finally, the candidates in each group are classified into faces or non-faces, correspondingly, by voting.
The main contributions of this paper can be highlighted as follows. First, aiming to boost the detection performance, we propose a novel metric learning and graph-cut framework to exploit the semantic information between targeting objects’ neighbors. Secondly, to depict local neighborhood relationships, we introduce a pairwise constraint into the tiny face detector to improve the detection accuracy. Thirdly, to realize such a pairwise constraint, we convert the problem of regression that estimates the similarity between different candidates into a classification problem that produces the scores of classification for each pair of candidates.
2 Related Work
Face detection is a classic topic in computer vision. The pioneer work on the topic was published by Viola and Jones  who designed a cascade of weak classifiers using Haar features and AdaBoost for fast and robust face detection. Similar in spirit, numerous approaches have been developed to improve the performance with more sophisticated hand-crafted features  and more powerful classifiers . However, these methods using non-robust hand-crafted features and optimized each components independently, and hence led to sub-optimal face detection results. Recently, face detectors based on CNNs  have greatly bridged the gap between human vision and artificial detectors.
Tiny face detection aims to detect a large number of small faces in crowded and cluttered scenes. It is totally different from detecting normal faces, because the cues for detecting a 3-pixel tall face are fundamentally different from those for detecting a 300-pixel tall face . Bell  presented the Inside-Outside Net (ION) to model the context outside a region of interest and showed improvements on small object detection. Very recently, Hu and Ramanan 
designed a foveal descriptor that captured both coarse context and high-resolution image features in order to effectively encode context information, which has achieved state-of-the-art performance on the WIDER FACE dataset. As we all know, it is not sufficient to detect small objects merely by extracting deep learning features from the texture inside an object region. One main drawback is that, these approaches have neglected local semantic information. We have observed that there exists local coherent relationships in terms of spatial location, scale, and texture in high-density tiny face detection, ignoring the influence of various viewpoints. For example, as shown in Fig.1, face bounding boxes close to each other are similar in their scales and textures. Local semantic information helps tiny face detectors better eliminate false alarms. To introduce local coherent relationships, we learn a metric to represent this coherence and use the graph-cut algorithm to divide candidates into several groups, where candidates in the same group are similar, and dissimilar when they are in different groups.
3 The Proposed Method
Our goal is to integrate local coherent relationships into tiny face detection. In order to represent local coherent relationships, we define pairwise constraints, which are an equivalence constraint for pairs of data points belonging to same classes, and an inequivalence constraint for pairs of data points belonging to different classes.
As shown in Fig. 2, we present a metric learning and graph-cut (MLGC) approach for high-density tiny face detection. We first use a linear-SVM to estimate the similarity matrix among all candidates (Sect. 3.1) and then we construct a graph model and use the graph-cut algorithm to divide candidates into several groups (Sect. 3.2). Finally, we design a voting method to classify groups (Sect. 3.2).
3.1 Metric learning based on linear-SVM
Let denote the set of candidates (i.e., face or non-face bounding boxes). To introduce the pairwise constraint, we first build a similarity matrix , where represents the similarity between and . means that has a strong resemblance of , and means that is completely different from .
In order to obtain the similarity score between two candidates and , we treat it as a classification problem and propose an unsupervised way to obtain the similarity score between two candidates. We use SVM to compute the similarity score between two candidates and. Note that, classification scores and deep features of a candidate are obtained from the tiny face detector . During the training stage, we sort by their scores in descending order. We suppose that denotes the top of which are face patches, while denotes the bottom of the non-face patches in . As shown in Fig. 2, in Stage 2 of our MLGC, we build a training set , , . If . If . During the testing stage, we feed to the SVM classifier, and then use the output score as the similarity score between and . Thus, we build the similarity matrix .
3.2 Graph-cut based on spectral clustering
Given a set of candidates and a similarity matrix , our goal is to cluster into different groups. Candidates are similar when they are in the same group, and are dissimilar when they are in different groups. In this work, we adopt the graph-cut algorithm for this purpose. First, we build a graph model to represent , where each vertex represents a candidate , and represents the similarity between the corresponding candidates and . Then, clustering into groups can be reformulated with the graph model represented in Eq. 1. We want to find a partition of the graph so that the weights of edges between different subgraphs are very low (indicating that points in different clusters are dissimilar from each other) and the weights of edges in the same group are very high (meaning that points within the same cluster are similar to each other). Formally,
where and , , used to boost local neighborhood relationships.
However, the solution simply separates one individual vertex from the rest of the graph. To avoid unbalanced graph-cut situation that there is a large difference in sizes of subgraphs, we introduce the size of subgraph which is the number of vertexes in to ensure the set of subgraph is reasonably large. So, we can transform Eq. 1 as follows:
According to ,
where , and the indicator
Eq. 3 is the standard form of a trace minimization problem. According to the Rayleigh-Ritz theorem , the solution is given by choosing the matrix which contains the first eigenvectors of and then uses the -means algorithm on . So, we manage to cluster into groups . Finally, candidates in each group are classified to face or non-face class using voting.
In this section, we first demonstrate the effectiveness of our proposesd semantic similarity metric and then evaluate the whole model on three widely-used face detection benchmarks, including WIDER FACE , Annotated Faces in the Wild (AFW)  and Pascal Faces .
To demonstrate the effectiveness of our proposed semantic metric (see Subsection 3.1) for similarity measurement, we create positive samples, i.e., the ground truth face regions, and negative samples which are patches randomly sampled from background, and evaluate the discriminative ability of the computed similarity matrix on the WIDER FACE validation set. The average precision in each image on the validation set is in the testing set composed of both positive and negative samples, in the set of positive samples only, and in the set of negative samples only.
4.1 The AFW and PASCAL FACE Dataset Results
The AFW dataset has 205 images containing in total 473 labelled faces. We evaluate our model against the HR , DPM , Headhunter, SquaresChnFtrs , Structured Models , Shen et al. , TSM  and commercial detectors (e.g., Face.com, Face++ and Picasa). As illustrated in Fig. LABEL:exp:allb, our MLGC outperforms all other detectors on precision-recall (PR) curves.
The PASCAL FACE dataset contains 1,335 labeled faces in 851 images, which are collected from PASCAL person layout subset. Because this paper focuses on face detection, we ignore images without persons from the original dataset, similar like DPM . We also evaluate our model against the HR , DPM , Headhunter, SquaresChnFtrs , Structured Models , Shen et al. , TSM  and commercial detectors (e.g., Face++ and Picasa). As shown in Fig. LABEL:exp:allc, our MLGC outperforms all other detectors on PR curves.
4.2 Results Obtained on the WIDER FACE Dataset
The WIDER FACE Dataset is one of the most challenging public face datasets due to the variety of face scales and occlusion. It contains 32,203 images split into training (40%), validation (10%) and testing (50%) set. The validation set and testing set are divided into “easy”, “medium”, and “hard” subsets according to the difficulties of the detection.
We compare our MLGC with the HR , MSCNN , ScaleFace , CMS-RCNN  and Multitask Cascade CNN . The PR curves on the testing set is presented in Fig. LABEL:exp:alla, and our method outperforms HR by 0.2% in “easy” subset. The PR curves on the validation set is presented in Fig. LABEL:fig:overall and our method outperforms the HR by 0.5%, 0.2%, 0.3%, in “easy”, “medium” and “hard” subsets respectively.
In this paper, aiming to improve the performance of tiny face detection, we have proposed a novel idea to exploit the semantic similarity between targeting objects’ neighbors and created a pairwise constraint to depict such semantic similarity. Then, a framework which adopts the metric learning and graph-cut techniques has been formulated to boost the accuracy of existing tiny object classifiers. Experiments conducted on three widely-used benchmark datasets for face detection have demonstrated the improvement over the state-of-the-arts by applying this idea. The mechanism of our proposed framework is generic indicating that the framework has a great potential being applied on other small and generic object detectors. This forms our work for the next step.
-  Xuehan Xiong and Fernando De la Torre, “Supervised descent method and its applications to face alignment,” in CVPR, 2013.
-  Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li, “Face alignment across large poses: A 3d solution,” in CVPR, 2016, pp. 146–155.
-  Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al., “Deep face recognition.,” in BMVC, 2015.
-  Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823.
-  Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z Li, “High-fidelity pose and expression normalization for face recognition in the wild,” in CVPR, 2015.
-  Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang, “Deep learning face representation by joint identification-verification,” in NIPS, 2014.
-  Minyoung Kim, Sanjiv Kumar, Vladimir Pavlovic, and Henry Rowley, “Face tracking and recognition with visual constraints in real-world videos,” in CVPR. IEEE, 2008, pp. 1–8.
-  Paul Viola and Michael Jones, “Robust real-time face detection,” IJCV, vol. 57, no. 2, pp. 137–154, 2004.
-  Peiyun Hu and Deva Ramanan, “Finding tiny faces,” in CVPR. IEEE, 2017, pp. 1522–1530.
-  Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang, “Wider face: A face detection benchmark,” in CVPR, 2016, pp. 5525–5533.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Cheng-Yang Szegedy, and Alexander C Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
-  Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton, “A simple way to initialize recurrent networks of rectified linear units,” CoRR, vol. abs/1504.00941, 2015.
-  Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li, “Aggregate channel features for multi-view face detection,” in IJCB. IEEE, 2014.
-  Minh Tri Pham and Tat Jen Chain, “Fast training and selection of haar features using statistics in boosting-based face detection,” in ICCV, 2007, pp. 1–7.
Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua,
“A convolutional neural network cascade for face detection,”in CVPR, 2015.
-  Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li, “Convolutional channel features,” in CVPR, 2015, pp. 82–90.
-  Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang, “From facial parts responses to face detection: A deep learning approach,” in ICCV, 2015.
Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Girshick,
“Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks,”in CVPR, 2016, pp. 2874–2883.
Ulrike Von Luxburg,
“A tutorial on spectral clustering,”Statistics and computing, 2007.
-  Helmut Lutkepohl, “Handbook of matrices.,” Computational Statistics and Data Analysis, 1997.
-  Xiangxin Zhu and Deva Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in CVPR. IEEE, 2012, pp. 2879–2886.
-  Junjie Yan, Xuzong Zhang, Zhen Lei, and Stan Z Li, “Face detection by structural models,” Image and Vision Computing, vol. 32, no. 10, pp. 790–799, 2014.
-  Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan, “Object detection with discriminatively trained part-based models,” TPAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
-  Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool, “Face detection without bells and whistles,” in ECCV. Springer, 2014, pp. 720–735.
Xiaohui Shen, Zhe Lin, Jonathan Brandt, and Ying Wu,
“Detecting and aligning faces by image retrieval,”in CVPR. IEEE, 2013, pp. 3460–3467.
-  Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vasconcelos, “A unified multi-scale deep convolutional neural network for fast object detection,” in ECCV. Springer, 2016, pp. 354–370.
-  Shuo Yang, Yuanjun Xiong, Chen Change Loy, and Xiaoou Tang, “Face detection through scale-friendly deep convolutional networks,” arXiv, 2017.
-  Chenchen Zhu, Yutong Zheng, Khoa Luu, and Marios Savvides, “Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection,” in Deep Learning for Biometrics, pp. 57–79. Springer, 2017.
-  Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, 2016.