I Introduction
Deep learning has made great progress in recent days; one of the most compelling achievements is the application of computer vision. Multiface alignment, also known as multiple facial landmarks localization or detection, aims to identify the locations of the key points of multiple faces on images or videos.
Multiface alignment task can be grouped into bottomup and topdown approaches. For a long time in academia and industry, people have employ a topdown method, face detection first and then send to single face alignment network. Conventional single face alignment methods [1, 2, 3, 4, 5] can be divided into directly or indirectly generating landmarks [6]. The time complexity of NMS and Soft NMS [7] is , which is the most critical deficiency of the topdown method. After that, the results of NMS or Soft NMS [7]are sent to the single face detection network [8, 9, 10, 11, 12]. For this process, the time complexity is . What is worse, traditionally, the single face detection networks have a very deep convolution structure. Repeated use of convolutional networks to infer images can greatly slow down the entire structure. Once the number of faces increases, the speed will be greatly sacrificed.
So, it is important to develop a bottomup structure for multiface alignment task. Some bottomup human pose estimation algorithms
[15, 16, 14] use Part Affinity Fields and a greedy parse to resolve individual. Inspired by that, our multiface bottomup method can be divided into two steps [14]: first finding out all possible face landmarks, and then parsing the discrete key points into individuals. Since this method is based on the entire image, it needs to overlook global texture information. Therefore, compared with the algorithm for detection and NMS, this algorithm is more robust to occlusion. Last but not least, this method is independent of the number of faces. In the multiface alignment task, bottomup approaches will have a large margin than the topdown method in speed.In this paper, we present a bottomup algorithm that iteratively parses out a single face using global semantic segmentation information. While our face task does not have clear connection like the limbs [14], pixel embedding [17] learns implicit features to obtain corresponding spatial feature relationships. which compared with the topdown method which the detected faces are cyclically sent to the singleface landmarks network.
In conclusion, our main contributions are threefold.

We explored a bottomup multiple face alignment structure, whose runtime is not correlated with the number of the face in an image.

We proposed the Fox Block that can blend the global features and texture information of the face.

We proposed a new loss function, Cosine Discriminative Loss, that introduces cosine function into the Discriminative Loss, which can classify facial features on highdimensional space with better performance.
The paper is organized as follows: Section 2 describes the proposed methods, Section 3 shows experimental results. Section 4 concludes this work.
Ii Proposed Method
Figure. 1A illustrates our bottomup method. The method takes an RGB image of size and generates the landmarks and corresponding faces. The FoxNet simultaneously predicts the landmark candidates , at the segmentation branch, and their highdimensional features which encode spatial information, at the feature branch. As shown in Figure. 3, features, which combine the nonmaximum suppression result of segmentation branch, utilize cluster algorithm to produce multiple face landmarks.
Iia Architecture
In our proposed networks, FoxNet, as illustrated in Figure 1 B., the first stage would produce a set of abstract feature , where are the head of FoxNet (e.g., ResNet [18]). Moreover, in each subsequent stage, the block inherits multiscale information in the previous stage to produce more robust features . At the end, two pointwise convolution of different the number of channel generate segmentation result and feature result , where is the depthwise convolution and is the numebr of channel of them. In order to fully utilize the facial multiscale information, we view Hourglass [19] as our backbone. Therefore, we designed a Fox Block that can blend multiresolution identified features on the same scale.
Our Fox Block, as shown in Figure. 2, has four different kernel size,
, of average pooling, which stride is
to protect original resolution. During inference, feature branch classifies landmark candidates come from segmentation branch. However, during training time, as shown in Equation 1, we make all facial pixel participated in the calculation to study more identified features.(1) 
where is the input image, is pixel belonging to the face, is corresponding classification labels, and is our cosine loss.
To localize the landmark, the global information of the images is required. So we proposed to use Fox Block to have a larger receptive field in our proposed model.
IiB Cosine Discriminative Loss
Pixel embedding [17]
is a differentiable transform which maps each image pixels to highdimensional vector for better classification. The objective of our loss function is to increase the interclass distance and minimize the intraclass distance. Discriminative loss
[17] has made great success in semantic segmentation field which enforces the network to map each pixel in the image to an ndimensional vector in feature space. However, we viewed that introducing the cosine loss takes the normalized features as input to learn highly discriminative features by maximizing the interclass cosine margin could utilize the cosinerelated discriminative information well. [17]uses variance term to force embeddings to close the cluster center, distance term to push away the cluster centers from each other and regularization to pull all embeddings to the origin. We inherit threeterms, but replace the Euclidean distance with cosine distance and change the pull to the push. In our task, we only need the orientation of embedding to obtain the discriminative features. As shown in figure.
1A, our segmentation predicts the landmark candidates who have more precise semantic information than the length of embedding who represent the response of landmark on a feature branch. If we use regularization term, in discriminative loss, forcing embeddings of different length into the origin, the surface area of the characteristic hypersphere will too small to classify. Inspired by [20], we put embeddings to a hypersphere with extensive surface area which can learn better distribution and normalization to cluster.In cosine discriminative loss, regularization term force embeddings of the different norm into the origin, which make the surface area of the characteristic hypersphere shrank. The details of our proposed Fox Loss is illustrate in Equation 5 to maximizing interclass variance and minimizing intraclass variance. It has integrated Equation 2 to 4. The variance term() is an intracluster pullforce that draws embeddings towards the mean embedding which has presented in Equation 2. The distance term is an intercluster pushforce that pushes clusters away from each other, increasing the distance between the cluster centers which has presented in Equation 3. The regularization term is a small pullforce that draws all clusters towards the origin, to keep the activations bounded which has presented in Equation 4. In the equations, the definitions are as follows: is the number of clusters in the ground truth, is the number of elements in cluster , is an embedding, is the mean embedding of cluster (the cluster center), is the cosine loss between and , which could also be noted as . denotes the hinge. and are respectively the margins for the variance and distance loss.
our cosine discriminative loss is defined as follows:
(2) 
(3) 
(4) 
(5) 
IiC Semisupervised Face Separation with Mean Shift
Different from the traditional structure that iteratively passes the facial information into the prediction networks, all facial information has been presented on our segmentation branch. The corresponding facial landmarks share some particular feature. Landmarks that belong to the same face can be seen as a cluster in Euclidean space. For example, the Euclidean distance of each landmark is closer to other faces. Mean shift [21] is a procedure for locating the modes of a density function given discrete data sampled from that function which involves shifting this kernel iteratively to a higher density region until convergence. It always points toward the direction of the maximum increase in the density. The complexity will tend towards in lower dimensions, with the number of samples and the number of points. It is suitable for a mean shift to process clustering on facial landmarks. It is a semisupervised clustering algorithm that allows the input without given the number of clusters. We perform a mean shift algorithm to separate the corresponding face information.
As presented on Figure 3, in our inference. We perform nonmaximum suppression(NMS) operation on the segmentation branch from the training process and utilize the results to perform the mean shift to separate the different faces. We utilize the segmentation branch from the training process and perform NMS operation.
Iii Experiments
We evaluate our method on two datasets: Single Face Dataset WFLW and our Multiface AISA Dataset for precision and speed.
Method  FullSet  Pose  Expression  Illumination  Makeup  Occlusion  Blur 
ESR [22]  11.13  25.88  11.47  10.49  11.05  13.75  12.20 
CFSS [23]  9.07  21.36  10.09  8.30  8.74  11.76  9.96 
LAB [6]  5.27  10.24  5.51  5.23  5.15  6.79  6.32 
OURS  5.80  10.50  8.94  5.71  6.30  6.53  6.30 
WFLW dataset: WFLW contains 10000 faces(7500 for training and 2500 for testing) with munually annotated landmarks.
AISA Dataset: In order to facilitate the bottomup multiface alignment algorithm, we introduce a new dataset base on 300W [24], which contains ( for training and for testing). The difficulty is reflected in face scale, occlusion and the number of faces.
Evaluation metric. We use standard normalized landmarks mean error(NME) to evaluate face landmarks moreover, the F1 score to evaluate face detection.
Iiia Evaluating Single Face Alignment
We compare our method against the stateofthe art methods, ESR [22], CFSS [23] and LAB [6], on WFLW. The result is shown in Table I which comes from segmentation branch using NMS.
Our method achieves on the test set and higher than LAB, while better on Occlusion and Blur subset. This margin shows that our method has a larger receptive field to obtain more global features.
IiiB Comparing Bottomup and Topdown Method
Topdown multiface alignment method contains detection and single face alignment, so we compare our approach with these two steps, respectively. As shown in Table II,
Detection Method  Single Face Alignment Method  F1 Score  NME 

MTCNN  LAB  0.56  6.10 
SSH  LAB  0.89  6.10 
OURS  OURS  0.80  6.80 
IiiC Runtime Analysis
To analyze the runtime performance of our method, we uniformly resize to during test time to fit GPU memory. The runtime analysis is performed on a single NVIDIA GeForce GTX1080ti GPU. We perform face detection SSH [26] and two single face DAN [27] and LAB as a topdown comparison, where the runtime is roughly proportional to the number of people in an image. The results are illustrated in Fig. 4. In our approach, we only took to process the single face landmark detection task while the baseline experiments that perform on SSH+LAB and MTCNN+LAB would take and . Compared to the other two methods, the slope of our proposed method is minimal(to be more precise, our slope is and the slope of the other two is (SSH+LAB) and (MTCNN+LAB).). It is obvious our proposed method is not only the fastest in single face alignment task but is increases relatively slowly with the increasing number of people. The runtime consists of two major parts:

In our structure, CNN only processed once which is constant with varying number of people;

Multiface parsing time whose runtime complexity is , where is represents the number of faces. However, the parsing time does not significantly influence the overall runtime because it is one order of magnitude less than the CNN processing time, e.g., for people, the parsing time takes while CNN takes .
IiiD Training Details
Iv Conclusion
We have developed an extremely fast structure that develops the multiface alignment task. It is the first bottomup structure on this task.
In our approach, we first proposed the use of the FoxNet structure to solve the problem of receptive field defects. Moreover, we use Fox Block to provide additional contextual information that may be needed for facial landmark detection. In our approach, we have achieved a highspeed bottomup solution and maintain most of the accuracy. The approach is an algorithm that is independent of the number of people to be detected which could be applied on largescale realtime facial alignment task.
V Acknowledgments
This reseach was supported in part by the National Key Technology R&D Program of China YFD and the Hunan Province innovative experiment and reseach study program for college student SCX.
References

[1]
Sachin Sudhakar Farfade, Mohammad J Saberian, and LiJia Li,
“Multiview face detection using deep convolutional neural networks,”
in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015, pp. 643–650.  [2] Rajeev Ranjan, Vishal M Patel, and Rama Chellappa, “A deep pyramid deformable part model for face detection,” arXiv preprint arXiv:1508.04389, 2015.
 [3] Shuo Yang, Ping Luo, ChenChange Loy, and Xiaoou Tang, “From facial parts responses to face detection: A deep learning approach,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3676–3684.

[4]
Ping Luo, Xiaogang Wang, and Xiaoou Tang,
“Hierarchical face parsing via deep learning,”
in
2012 IEEE Conference on Computer Vision and Pattern Recognition
. IEEE, 2012, pp. 2480–2487.  [5] Yi Sun, Xiaogang Wang, and Xiaoou Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3476–3483.
 [6] Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou, “Look at boundary: A boundaryaware face alignment algorithm,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2129–2138.
 [7] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis, “Softnms—improving object detection with one line of code,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5562–5570.
 [8] Peiyun Hu and Deva Ramanan, “Finding tiny faces,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 1522–1530.
 [9] Xu Tang, Daniel K Du, Zeqiang He, and Jingtuo Liu, “Pyramidbox: A contextassisted single shot face detector,” arXiv preprint arXiv:1803.07737, 2018.
 [10] Huaizu Jiang and Erik LearnedMiller, “Face detection with the faster rcnn,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 2017, pp. 650–657.
 [11] Yitong Wang, Xing Ji, Zheng Zhou, Hao Wang, and Zhifeng Li, “Detecting faces using regionbased fully convolutional networks,” arXiv preprint arXiv:1709.05256, 2017.
 [12] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li, “S^ 3fd: Single shot scaleinvariant face detector,” in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 192–201.
 [13] Ross Girshick, “Fast rcnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
 [14] Zhe Cao, Tomas Simon, ShihEn Wei, and Yaser Sheikh, “Realtime multiperson 2d pose estimation using part affinity fields,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 1302–1310.
 [15] ShihEn Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
 [16] Zhe Cao, Gines Hidalgo, Tomas Simon, ShihEn Wei, and Yaser Sheikh, “Openpose: Realtime multiperson 2d pose estimation using part affinity fields,” arXiv preprint arXiv:1812.08008, 2018.
 [17] Bert De Brabandere, Davy Neven, and Luc Van Gool, “Semantic instance segmentation with a discriminative loss function,” arXiv preprint arXiv:1708.02551, 2017.
 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [19] Alejandro Newell, Kaiyu Yang, and Jia Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision. Springer, 2016, pp. 483–499.
 [20] Yutong Zheng, Dipan K Pal, and Marios Savvides, “Ring loss: Convex feature normalization for face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5089–5097.
 [21] Dorin Comaniciu and Peter Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Transactions on pattern analysis and machine intelligence, vol. 24, no. 5, pp. 603–619, 2002.
 [22] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun, “Face alignment by explicit shape regression,” International Journal of Computer Vision, vol. 107, no. 2, pp. 177–190, 2014.
 [23] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang, “Face alignment by coarsetofine shape searching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4998–5006.
 [24] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic, “300 faces inthewild challenge: Database and results,” Image and vision computing, vol. 47, pp. 3–18, 2016.
 [25] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
 [26] Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and Larry S Davis, “Ssh: Single stage headless face detector,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4875–4884.
 [27] Mohit Iyyer, Varun Manjunatha, Jordan BoydGraber, and Hal Daumé III, “Deep unordered composition rivals syntactic methods for text classification,” in Association for Computational Linguistics, 2015.
 [28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in pytorch,” 2017.
Comments
There are no comments yet.