A lot of researches has been dedicated and reserved to identity document analysis and recognition on mobile devices, willing to facilitate the data input process . A practical and common approach to this problem involves detecting and comparing individual’s live face to the face image found in his/her identity document. However, the face detection and recognition on the identity documents poses numerous challenges that are different from general face detection and recognition on the normal visual images. The main challenges of face detection in images on identity documents are mainly due to the challenging conditions in which camera captured images on identity documents have been taken, such as occlusion, defocus, complex background, the variation of the orientation of identity documents and the varying illumination conditions caused by the reflect. Fig.1 shows the challenging conditions of camera captured images on identity documents from the MIDV-500 dataset . Besides, collecting information from identity documents (such as faces, signatures, texts..) is still a challenge to solve due to the strict privacy laws and regulations. Since identity documents contain sensitive personal data, people are not willing to risk the leak of the personal information that may lead to some complications. Thus, it becomes difficult to evaluate and compare various identity document analysis methods to each other, since they have been tested on private industry-oriented and locked down data [3, 4], which are collected from customers and are not available for the public in the light of security and market advantage. Therefore, the lack of public datasets for images in identity documents also hinder the research on this sensitive field. Thanks to the availability of large face datasets and the progress in deep neural network architectures [5, 6, 7]
, face detection and recognition in general face images has made tremendous strides in the past decade. Recently, there have been a necessity to build a model to accurately differentiate faces from the backgrounds in challenging conditions, while keeping real-time performance. A deep cascaded multi-task architecture built on deep convolutional neural networks (CNNs) employs a coarse-to-fine strategy for face detection, by adopting different stages of CNNs in order to predict the face progressively[8, 9, 10]. Benefiting from the simplicity of the global networks and the progressive increase of depth of the CNNs, the deep cascaded based CNNs for face detection can achieve the state-of-the-art performance in terms of accuracy and speed. Specifically, Cascade-CNN  firstly proposed the deep Cascade-CNNs architecture for face detection task. The MTCNN , based on the deep cascaded multi-task framework, achieved also the state-of-the-art performance and attained very high speed for face detection and alignment. Furthermore, PCN  which is also based on the deep cascaded CNNs, has tackled the detection of rotating faces in a coarse-to-fine manner. It can accurately detect faces with arbitrary rotation-in-plane angles, which are common to the camera captured images on identity documents. In this work, these three state-of-the-art methods Cascade-CNN, MTCNN and PCN, that are the most widely used for face detection tasks with different advantages, are adopted to detect faces on identity document images. The recent published MIDV-500 dataset, which is a Mobile Identity Document Video dataset consisting of 500 video clips for 50 different identity with 17 types of ID cards, is used to evaluate the three different face detection frameworks.
In summary, our main contributions of this paper are:
Evaluating the three state-of-the-art methods Cascade-CNN, MTCNN and PCN for face detection in camera captured images on identity documents under challenging environments.
The bounding box coordinates of face regions in arbitrary frames in the MIDV-500 dataset’s videos have been manually annotated for the evaluation.
We have demonstrated that, under the challenging conditions, with various orientation of the document, complex background and the varying illumination, the MTCNN model shows its superiority to the other methods evaluated on the MIDV-500 dataset.
The rest of paper is organized as follows: in Section II, we briefly review the existing studies in the area of identity document analysis and recognition; Section III describes the frameworks of the evaluated methods; in Section IV, we present the Mobile Identity Document Video dataset (MIDV-500)  and the experimental evaluation results; the final Section V draw a conclusion and presents the future work.
Ii Related Work
Lately, numerous researches have been devoted to document analysis and recognition, analyzing and identifying different identity fields: document numbers, document holder name components, machine readable zone, signatures and face detection using state-of-the-art visual recognition approaches [11, 12, 13]. As for the related work provided on , a text field OCR study was proposed to perform text field extraction, recognition and identity document data extraction from video clips.
The face detection problem has been an important task in computer vision systems that aims to extract information from face images, with respect to the head position and occlusion, while detecting faces quickly and accurately on input images. Previous face detection systems were mostly based on hand-crafted features extracted by the feature engineering for the classification methods. A number of variant methods and local descriptors have been proposed for face detection and recognition task, such as, rapid object detection using a boosted cascade of simple features, LBP , HOG , Gabor-LBP , SIFT . Recent researches in this area focus more on the uncontrolled face detection problem, where various poses, complex lighting, occlusion, exaggerated expressions, full rotation-in-plane angle face images can lead to large visual variations, and to significant divergence in face appearances, while detecting faces in full rotation-in-plane angles can greatly advance the performance of both face alignment and face detection-recognition.
, assuming that all face images are frontal faces without large expression variations as most methods do. The algorithm is similar to a general constrained face matcher, except that it is developed for a document photo dataset. Since the popularity of deep neural networks could partially be attributed to a special property that the low-level image features are transferable, i.e. they are not limited to a particular task, but applicable to many image analysis tasks. Given this property, one can first train a domain specific neural network by transfer learning on a relatively small dataset, as proposed by Y. Shi and Anil K. Jain. A recent face detection study was performed in 
using open source libraries[22, 23] with default frontal face detectors. It was conducted using original frames, and using projectively restored document images based on ground truth document coordinates. The ground truth coordinates were projectively transformed from the template coordinates to frame coordinates (according to the ground truth document boundaries). The same process was done for face detection results in the cropped document detection mode.
Iii Face Detection Frameworks
The frameworks have been chosen for their performance in terms of high accuracy and speed, their low time cost in a fast run-time on the Multi-oriented FDDB  and WiderFace  datasets, and also for their specific architecture based on deep cascaded multi-task convolutional neural networks (CNNs). High accuracy is achieved with a deep neural network; having three networks - each with multiple layers - allow for higher precision, as each network can fine-tune the results of the previous one. This technique is so called Hard Sample Mining. As for the three methods, training only the first stage is not perfect, because it would recognize some images with no faces in it as positive samples (window candidates containing at least one face). These images are known as false positives, so we include them into The second stage, since its role is to refine bounding box edges and reduce false positives. This can help the second stage targets the first network’s weaknesses and improve accuracy. Similarly, it is applied to the third stage as well.
Since different tasks are employed in each CNNs, different types of training images are used in the learning process, such as faces, non-faces and partially aligned faces, and instead of defining only one loss function for all tasks, one loss function each is defined, they are switched back and forth with different ratios to balance different loss functions, according to the task importance.
As for the training data used to train the models, three kinds of images are employed: positive samples, negative samples and suspected samples. Positive samples are those windows with IoU ratio over 0.7/0.65 to any ground truth faces; negative regions are those windows with IoU ratio smaller than 0.3 to a ground truth face, and suspected/part samples are those windows with IoU between 0.4 and 0.7/0.65. Positive and Negative samples are used for the classification task of faces/non-faces. Positive and suspected/part samples contribute to the training of bounding box regression and calibration.
Iii-a Cascade-CNN face detector
The Cascade-CNN face detector 
, compared to other face detection systems, that learns the classifier by relying on hand-craft features, evaluates the input given image to reject non-face regions and distinguishes faces from high covered regions at higher resolution. After that, a calibration network is applied to process the remaining detection windows to adjust its size and calibrate its location to approach a potential face. The proposed cascade face detector is a three-stage deep convolutional network. After each stage, Non-Maximum Suppression is introduced to eliminate highly overlapped detection windows to make a more accurate detection window at the correct scale, with an Intersection-over-Union ratio exceeding a set ratio. The remaining calibrated bounding boxes are considered as the outputs of the model.
The experiments on the the challenging Face Detection Dataset and Benchmark FDDB show that the Cascade-CNN detector outperform the state-of-the-art methods in the continuous score evaluation, and is comparable to the state-of-the-art methods on the Annotated Faces in the Wild AFW .
Iii-B Joint Face Detection and Alignment
The proposed MTCNN framework  uses multi-task cascaded convolutional networks with three stages, to predict face and landmark location in a coarse-to-fine manner. The cascaded convolutional network aims to reduce the extra computational expense of the cascade face detector proposed by Viola and Jones , it utilizes Haar features with boosted cascade framework to evaluate frontal faces, but the framework is relatively weak towards uncontrolled applications where faces are in varied poses and complex unexpected lighting. The need of convolutional networks was necessary to achieve remarkable performance in a variety of computer vision tasks. The contribution of this method is to combine face alignment and face detection for real time performance.
The overall framework of this approach consists of obtaining different scaled images by passing in the sliding window and image pyramid, each candidate window goes through the detector stage by stage. In each stage, the detector rejects faces with low confidence level, regresses the bounding boxes of remaining faces. For the remaining face candidates, they are updated to the new bounding boxes that are regressed. After each stage, non-maximum-suppression is performed on every box to merge those highly overlapped face candidates. The output of the last stage will sort bounding boxes of the remaining calibrated face candidates with facial landmarks’ positions, having only one bounding box for every face in the image.
The performance of the model was evaluated on FDDB and WIDER FACE datasets, outperforming the state-of-the-art methods by a large margin in both benchmarks.
Iii-C Progressive Calibration Networks (PCN)
The proposed progressive calibration network (PCN) 
is a real-time and accurate face detector, which aims to progressively calibrate the rotation-in-plane angle of each face candidate to upright in a three-stage multi-task deep convolutional network. More specifically, the calibration process to precise the rotation-in-plane angle is divided into several progressive steps and only predicts the coarse orientation in each stage. Given an input image, a sliding window and image pyramid principle are applied to obtain and detect all different sized faces within the image. Each face will be passed through the detector stage by stage. After passing in the image, the detector gathers face candidates and their bounding box coordinates, parse the stage output to get a list of confidence levels for each bounding box, then rejects most candidates with low face confidence, i.e. (Boxes that the network is not quite sure contains a face). Simultaneously, the estimated bounding boxes of remaining face candidates are regressed and calibrated according to the predicted coarse rotation-in-plane angles. After each stage, non-maximum-suppression is conducted to eliminate redundant boxes. The calibration based on the coarse rotation-in-plane prediction brings almost no additional time cost, leading to accurate and fast calibration.
The experiments on the multi-oriented FDDB and Rotation WIDER FACE datasets show that PCN performs way better than the baseline Cascade, with almost no extra time cost benefited from the efficient calibration process.
Iv Experimental Baselines
In the following part, we decribe at first the MIDV-500 dataset. After that, we present the evaluation results on the dataset to demonstrate the effectiveness and the limits of the methods described above.
Iv-a Dataset Structure
The MIDV-500 is a Mobile Identity Document Video dataset consisting of 500 video clips for 50 different identity document types with ground truth, including 17 types of ID cards, 14 types of passports, 13 types of driving licences and 6 other identity documents of various countries. The video clips were recorded in 5 different conditions:
CA, CS: ”Clutter” - The document lays on table with many objects in the background.
HA, HS: ”Hand” - The document is held on hand.
KA, KS: ”Keyboard” - The document lays on keyboards.
PA, PS: ”Partial” - The document is partially or completely hidden off-screen.
TA, TS: ”Table” - The document is presented on a table.
The dataset has about (50 documents) x (5 conditions) x (2 devices), each video was split into 10 frames per second while the duration of each video is 3 seconds, which makes it (30 images) x (2 devices) for every condition. In total there are 15 000 annotated frame images in the whole dataset.
For each extracted video frame, bounding box annotation for each face was performed manually. In total, there are 48 ID-faces out of 50 identity documents. The MIDV-500 dataset allows to perform studies of object detection and information extraction algorithms, measures their accuracy and their robustness against various distortions. For our study, we regard the face detection experiment to evaluate the state-of-the-art frameworks described above. For that, we consider only a part of the dataset taking into account the presence of a face on the document. Since each condition contains 30 frames per 3 seconds, most frames remain the same. So, we have taken only 2 frames per condition to evaluate the robustness of the three models. This way, 1000 frames (50 document) x (5 conditions) x (2 devices) x (2 frames) out of 15 000 were filtered for the experiment.
Iv-B Results on MIDV-500 Dataset
Face detection was performed using Cascade-CNN, MTCNN and PCN. The detection was conducted using original frames, where faces presented on the document images were annotated manually. We haven’t conducted face detection on projective restoration of document images based on ground truth document coordinates, since our objective is to perform the robustness of the three models on various poses and different face appearances in different angles. Following the protocol proposed by  of the evaluation of the face detector, we evaluate the three face detectors mentioned above in terms of ROC curves on MIDV-500 dataset shown in Fig. 2. In the ROC curve, the 20 to 200 FP is usually a sensible operating range in the existing works . We can see that the MTCNN (red curve) is the best face detector on MIDV-500 dataset, then the PCN (on blue curve) and the Cascade-CNN (green curve) performs worst. Even if the PCN face detector is designed for the rotation-invariant face detection, the PCN does not show its superiority on MIDV-500 dataset. As well as the PCN, the Cascade-CNN can also detect the rotating face images by using four small CNNs to deal with four directions of the rotation, i.e. up, down, right, left. However for each direction, the CNN used for the detection is small and the performance is inferior than the other detectors. Overall, MTCNN shows a good generalization ability on the new challenging dataset, and is tolerant to the moderate rotation although it is not designed for the rotated faces. We also evaluate the speed of the three face detectors on MIDV-5OO dataset as shown in Table I. The evaluation of speed is employed on CPU (Intel (R) Core(TM) i7-6500 CPU @ 2.50GHZ 2.59GHZ) and GPU (Nvidia-Titan X) respectively. Meanwhile the recall rate at 20, 50, 100, 200, 500 false positives on MIDV-500 are also shown in Table I. Fig. 3 shows the examples of the detection results obtained by the PCN, MTCNN and Cascade-CNN respectively. From the detected results, we can see that the MTCNN gains the best detection results in overall. It shows that the condition of detection is quite challenging, where it varies from the background of the environment, the complex illumination (such as the reflect light) and the distortion of the images introduced by the pose variation in the 3D space. Comparing to the other two methods, the Cascade-CNN has most mistakes like much more false positive detection and bad location of the coordinates of the bounding boxes. The PCN is much more better than Cascade-CNN, but in terms of the accuracy of locating the bounding boxes, it is inferior to MTCNN. It has to say, the examples does not include the extreme rotation case of the images, so for the up-right images, the MTCNN shows better performance than PCN. Comparing to the low resolution of the images in the Multi-Oriented FDDB or WiderFace datasets (from 40x40 to 400x400 maximum), the high resolution (1920x1080) of MIDV-500 images is the main reason for the low speed results.
|Method||Recall rate at FP on MIDV-500||Speed/FPS|
Comparing to the performance of these three methods on the classic datasets Multi-Oriented FDDB or WiderFace, either the recall rate or the speed of the three methods PCN, MTCNN and Casecade-CNN on the MIDV-500 dataset are greatly deteriorated. This is partly due to the quite challenging conditions of the MIDV-500 dataset, and also the heterogeneous face recognition problems. Since the methods evaluated in this work are mainly trained on the datasets such as FDDB or WiderFace which are mainly visual photos or selfies, faces in MIDV-500 dataset are the images in the ID-cards, passports, driving licences and other identity documents. These ones are way different from the training data images used in other datasets. In our case, we may find black and white photos and some relative poor image quality. However, the generalization ability is still a problem in a deep learning based data-driven detection method, which is strongly based on the data used to train the model. In particularly, for the detection task, the different way or manner used to label faces can also affect the performance. For instance, some datasets would include more parts of the head to completely cover the face, so the resulting detected bounding boxes are larger than others, this will cause the difference when we calculate the intersection-over-union used to evaluate the model. Fig.4 shows how the difference of annotating bounding boxes between different datasets affects the detection. The green bounding boxes are the detected results by PCN with very high confidence (larger than 0.99), they indicate that the model thought
has detected faces very well. However, we can see that the detected bounding boxes do not really fit the ground truth (the IoU with the ground truth bounding box is less than 0.5). This divergence is probably caused by the different way to annotate the bounding boxes between MIDV-500 and the training dataset.
In this paper, we evaluated three state-of-the-art face detection methods PCN, MTCNN and Cascade-CNN on the new challenging MIDV-500 dataset. Since the face detection is a necessary step for the following identity document analysis and recognition, it is worth to survey how the current face detection methods work on the ID card-like documents such as passports, citizen cards and ID-cards. The evaluation results show the performance and the limitations of the current methods for face detection on the new challenging MIDV-500 dataset. Yet, there is still much space to improve for future works for the face detection task under challenging conditions.
This work was supported by the MOBIDEM project, part of the “Systematic Paris-Region” and “Images & Network” Clusters, funded by the French government.
-  L. De Koker, “Money laundering compliance—the challenges of technology,” in Financial Crimes: Psychological, Technological, and Ethical Issues. Springer, 2016, pp. 329–347.
-  V. V. Arlazarov, K. Bulatov, T. Chernov, and V. L. Arlazarov, “Midv-500: A dataset for identity documents analysis and recognition on mobile devices in video stream,” arXiv preprint arXiv:1807.05786, 2018.
-  N. Skoryukina, D. P. Nikolaev, A. Sheshkus, and D. Polevoy, “Real time rectangular document detection on mobile devices,” in Seventh International Conference on Machine Vision (ICMV 2014), vol. 9445. International Society for Optics and Photonics, 2015, p. 94452A.
-  S. Usilin, D. Nikolaev, V. Postnikov, and G. Schaefer, “Visual appearance based document image classification,” in 2010 IEEE International Conference on Image Processing. IEEE, 2010, pp. 2133–2136.
P. Hu and D. Ramanan, “Finding tiny faces,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 951–959.
-  M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, “Ssh: Single stage headless face detector,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4875–4884.
-  S. Yang, P. Luo, C.-C. Loy, and X. Tang, “From facial parts responses to face detection: A deep learning approach,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3676–3684.
-  X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen, “Real-time rotation-invariant face detection with progressive calibration networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2295–2303.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
-  H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5325–5334.
-  A. M. Awal, N. Ghanmi, R. Sicre, and T. Furon, “Complex document classification and localization application on identity document images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 426–431.
-  M. Simon, E. Rodner, and J. Denzler, “Fine-grained classification of identity document types with only one example,” in 2015 14th IAPR International Conference on Machine Vision Applications (MVA). IEEE, 2015, pp. 126–129.
-  L.-P. De las Heras, O. R. Terrades, J. Llados, D. Fernandez-Mota, and C. Canero, “Use case visual bag-of-words techniques for camera based identity document classification,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2015, pp. 721–725.
-  P. Viola, M. Jones et al., “Rapid object detection using a boosted cascade of simple features,” CVPR (1), vol. 1, no. 511-518, p. 3, 2001.
-  R. G. Cinbis, J. Verbeek, and C. Schmid, “Unsupervised metric learning for face identification in tv video,” in 2011 International Conference on Computer Vision. IEEE, 2011, pp. 1559–1566.
-  L. Wolf, T. Hassner, and I. Maoz, Face recognition in unconstrained videos with matched background similarity. IEEE, 2011.
-  J. Sivic, M. Everingham, and A. Zisserman, “Person spotting: video shot retrieval for face sets,” in International conference on image and video retrieval. Springer, 2005, pp. 226–236.
C. Lu and X. Tang, “Surpassing human-level face verification performance on
lfw with gaussianface,” in
Twenty-ninth AAAI conference on artificial intelligence, 2015.
-  V. Starovoitov, D. Samal, and D. Briliuk, “Three approaches for face recognition,” in The 6-th International Conference on Pattern Recognition and Image Analysis October, 2002, pp. 21–26.
-  V. Starovoitov, D. Samal, and B. Sankur, “Matching of faces in camera images and document photographs,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 4. IEEE, 2000, pp. 2349–2352.
-  Y. Shi and A. K. Jain, “Docface: Matching id document photos to selfies,” in 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 2018, pp. 1–8.
D. E. King, “Dlib-ml: A machine learning toolkit,”Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1755–1758, 2009.
-  G. Bradski, “The opencv library. dr. dobb’s j. of softw. tools,” 2000.
-  V. Jain and E. Learned-Miller, “Fddb: A benchmark for face detection in unconstrained settings,” 2010.
-  S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection benchmark,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5525–5533.
X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 2879–2886.
-  M. Opitz, G. Waltner, G. Poier, H. Possegger, and H. Bischof, “Grid loss: Detecting occluded faces,” in European conference on computer vision. Springer, 2016, pp. 386–402.