In recent years, many computer-aided diagnosis (CAD) systems have been developed to help clinicians as supplementary tools [1, 2, 3]. A significant amount of work burden of clinical experts and the occurrence of misdiagnosis can be reduced by developing CAD systems. Among the imaging protocols, panoramic X-ray imaging is a popular diagnostic method owing to its very small dose of radiation when compared to the cone beam computed tomography . This method captures the entire oral structure in a two-dimensional (2D) image and provides a noninvasive treatment plan, such as implants and tooth extraction. Moreover, forensic identification can be conducted by analyzing the corresponding individual teeth of the subjects . Various applications such as classification  and segmentation [7, 8] were developed using dental panoramic X-ray images . Especially, automated detection and identification of individual teeth are the most demanded algorithms and a critical prerequisite for other applications .
To localize objects of interest from images, various object detection methods were extensively developed until recently. Initially, classical machine learning-based algorithms were proposed. These classical methods typically employed feature descriptors and trained classifiers to obtain object boxes[10, 11]12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Modern CNN-based detection methods can be categorized into two primary methods: 1) anchor-based [16, 20] and 2) point-based approaches [21, 22, 23]. The anchor-based methods employ exhaustive classifications on predefined anchor boxes and typically perform a non-maximum suppression technique to localize each object . Conversely, the point-based object detection attempts to regress points to delineate objects such as the center point . Besides the center point, several key points (e.g., left-top and right-bottom corners) are simultaneously regressed for accurate object detection [22, 24]. The latest studies show that the point-based approaches are demonstrating more promising results than the anchor-based methods in terms of accuracy and efficiency .
Although CNN-based detection methods are showing groundbreaking results, high accuracy must be guaranteed so that the algorithm can be used as an important auxiliary diagnostic measure in clinical practices. It was reported in  that an automated deep learning-based algorithm has an impact on teeth detection and identification in dental panoramic images; however, the study showed possible errors in detection, which can result in subsequent identification errors. Thus, the simple adoption of a general CNN-based detection algorithm cannot provide the high standard of accuracy that is required in the clinics. It is critical to build a robust metric that can guarantee accuracy of the algorithm to verify the applicability of the system.
In this paper, we propose a CNN-based individual tooth detection and identification algorithm based on direct regression of object points. First, we propose a points regression neural network by employing spatial distance regularization (DR) loss. The proposed network performs center point regression of 32 fixed anatomical teeth, which automatically assigns anatomical identifiers. A novel inter-point DR penalty is employed on the output prediction of 32 points on a neighborhood basis. Subsequently, bounding boxes of an individual tooth is localized in a cascaded fashion. For the final box regressions, a multitask framework is applied for an additional offset vector regression, which is trained to delineate a marginal error vector of a center point. The superiority of the proposed method not only depends on accurate localization but also on automated individual identification of teeth. An additional identification algorithm is not required in the proposed method. The proposed method automatically identifies each tooth by localizing all 32 possible regions of the teeth including missing ones. The experimental results showed that our proposed method outperforms other state-of-the-art methods; moreover, various conditions of test images illustrated the clinical validity of the algorithm. The primary contributions of this work can be summarized as follows:
Integration of a point-based detection method and fixed 32-point regression in a cascaded fashion.
The proposed method does not require any additional classification methods.
Introduction of a DR loss between neighboring teeth to improve the regression.
Multitask training of box parameters and the marginal offset vector of the center point.
The remainder of this paper is structured as follows. In Section 2, we review related works on object detection and identification methods. Further, we describe our proposed method in Section 3. Section 4 demonstrates the experimental results and Sections 5 and 6 present the discussion and conclusion, respectively.
2 Object Detection
In this section, modern CNN-based object detection methods are reviewed. We first present several anchor-based approaches; subsequently, we highlight the state-of-the-art point-based methods that were developed recently.
2.1 Anchor-based Object Detection
In the earlier developments of CNN-based object detection, an exhaustive or a selective search of object regions was proposed [12, 13, 14, 15]. The separated two-step training (i.e., extraction of the region-of-interest and the subsequent classification using optional regression) was integrated after the region proposal network (RPN) was proposed . The RPN module employed grid-based anchors that were used to perform box regressions .
Most of the presented methods identified objects as axis-aligned boxes in an image. The candidate boxes of the objects were generated based on predefined anchors with several multiscaled boxes for each anchor . An exhaustive classification was performed on all candidate boxes positioned according to predefined anchors . To remove multiple overlapping boxes, post-processing localization was required such as non-maximum suppression (NMS) . For real-time applications (e.g., automobile or surveillance vision), one-stage classification and regression networks (i.e., single-shot detectors) were proposed [17, 18, 19, 20]. An exhaustive classification-based approach, such as faster regions with CNN (R-CNN) , was considered to be superior to the single-shot detectors [17, 18] in terms of accuracy, primarily because of the exhaustive local classifications and post-regression procedure.
The anchor-based methods have certain significant drawbacks: 1) the RPN module based on the anchors requires exhaustive classifications for each anchor box ; 2) the additional classification network requires the memory of an additional GPU; 3) the anchor-based predefined windows exclude small objects in the image. The first exhaustive classification in the RPN module significantly affects the final performance of a detector. That is, the method introduces additional challenges such as a class imbalance problem, NMS processing 
, and true or negative example mining. The method also requires careful tuning of several hyperparameters such as the number of anchors, size of the anchor boxes, and scales.
2.2 Point-based Object Detection
More recently, object detection methods that are based on key points were proposed [21, 24, 22, 23]. The primary concept of these methods is the indication of significant “points” in the object of interest. In , the authors considered an object as a single point, i.e., the center point of the bounding box of an object. The major advantage of the point-based algorithms is that the architecture does not employ anchor-based extreme classifications. The presented point-based object detection does not require region classification. The bounding boxes of the objects can be obtained by performing regression on orientation and sizes. Point-based methods [21, 23] can be considered as anchor-free one-stage approaches [17, 18, 26]. The primary differences are that the point-based approaches do not constrain the shapes and anchor positions. The anchor positions are extracted based on key-point detection [27, 28]; consequently, neither classification nor NMS post-processing  is required . The detection of two corner points  or five building key points including the center point  was also presented. The center-point-based approaches [21, 23] do not require combinatorial grouping of key points after detection; thus, they are simpler and more effective methods when compared to the methods that employ multiple key points .
To accurately localize an individual tooth, we propose a point regression-based object detection method. First, our method performs direct regression of the center points for all possible 32 teeth regardless of the existence of the teeth, which automatically identifies each tooth. Subsequently, teeth boxes are individually localized using a cascaded neural network that utilizes intermediate feature maps for initial predictions and cropped patches from the original image. A simultaneous offset training is employed on the final output based on a multitask framework to improve the accuracy.
3.1 Network Architecture
As illustrated in Fig. 1
, the proposed network initially estimates the center points of the 32 teeth based on direct regression. We used several backbone networks for the initial 32-point regression, i.e., residual network (ResNet), deep layer aggregation (DLA) , and stacked hourglass (HG-Stacked) 
, with a single modification of the final output tensor as a single vector of size 64 for 32 2D points. The output feature maps are pooled using the global average pooling method (GAP) and subsequently passed through a single fully connected (FC) layer to represent the positions of the 32 points.
The final teeth localization is subsequently obtained by a cascading fashion for every patch. The second network utilizes the estimated center positions and intermediate features that were used for the first estimation. The feature maps are upsampled to the original resolution (i.e., ) to concatenate the original input image for further cascaded detection. All 32 patches are cropped to a fixed size of corresponding to each center point. For the second CNN, we employed ResNet-18  for simplicity. The final offsets and box parameters are obtained similar to the first stage through a series of GAP and FC layers.
A multitask training is employed for the final estimation based on two objectives: 1) box parameter regression (i.e., width and height) and 2) center point offset (i.e., marginal vector for center point correction) regression. Figure 2 shows the ground-truth offsets that are used while training the network. The objective of training offsets is to improve the accuracy of the box position.
3.2 Training the Network
As described in the previous sections, the proposed network is trained based on four different loss metrics: 1) initial regression of center points, 2) DR, 3) offsets of center points (i.e., marginal error vector), and 4) box parameter regression (i.e., width and height). For training the first center point regression, we employed a mean squared error (MSE) function. The loss function can be defined as
where and are the estimated vectors of the center points and ground-truth of the center positions (i.e., ), respectively. In addition to MSE loss, a novel DR penalty is employed on the output of the initial prediction of 32 points based on regularization loss of Laplacian on spatially aligned inter-distances. If is composed of two vectors, then , where and indicate the upper and lower teeth position vectors, respectively. Each vector is spatially aligned corresponding to the positions on the ground-truth image (e.g., left to right; Fig. 3). We first calculated two neighborhood distance vectors corresponding to the upper and lower position vectors:
where . The distance vectors represent the spatially aligned distances between the neighboring teeth. Finally, we modeled a DR loss corresponding to the distance vectors:
where is a gradient operator. Equation (4) is an regularization of Laplacian on distance vectors. Figure 3 illustrates and vectors schematically. The primary underlying principle of the proposed regularization term calculated according to (4) is to smooth the variation of distances between proximate teeth to remove the outlying positions. The red-colored distances in Fig. 3 demonstrate a high value when calculated using (4), indicating that minimizing the term will regularize the proximate distances for accurate regression.
The final offset and box parameter regression is trained similar to (1). The offset loss is defined as
where and are the estimated vector of the final offset and the ground-truth offset at the current iteration, respectively. The ground-truth offset can be calculated by , where is the estimated vector of the center points at the current iteration. Similarly, the loss function for bounding box parameters is defined as
where and are the estimated vector of the final box parameters and the ground-truth.
The overall loss function is defined by combining all the presented loss functions into a single objective function:
where indicates the weights of the proposed network and , , and are the weighting coefficients. The first two terms in the equation are related to the initial center regression. The third and fourth terms are the offset and box parameter losses, respectively. The final regularization is a global regularization term. We used , and in all the experiments.
3.3 Training Data
Each image was annotated by clinical experts in the field, as illustrated in Fig. 4. All 32 teeth were annotated using axis-aligned bounding boxes with corresponding anatomical identifiers for each tooth. Instead of annotating only the existing teeth in the image, we annotated all 32 teeth boxes, considering even the missing teeth (Fig. (b)b). By enforcing the annotation for all 32 points, the neural network can be trained through direct fixed-point positional regression of all points.
All training images were preprocessed using contrast limited adaptive histogram equalization (CLAHE) method 
. The primary purpose of employing CLAHE was to minimize the variance of image contrasts among different machines.
4.1 Data Acquisition and Configuration
This study was conducted in association with Osstem Implant Co., Ltd. All the datasets were acquired only after the consent of the patients for academic purpose only. The images were sourced from four different machines provided by Osstem Implant, HDX WILL, PointNix, and Genoray. The datasets were obtained from dental clinics for ordinary diagnostic purposes, which indicates that the datasets represent a variety of conditions commonly observed in clinics. In the dataset, the width and height of all panoramic X-ray images ranged from 2093 to 3432 pixels and 1012 to 1504 pixels, respectively. The x- and y-axis spacing in the image ranged from 0.07 to 0.10mm.
A total of 818 images were collected for training, validation, and testing. We used 574 images for training, 162 images for validation, and 82 images for testing. The quantitative evaluations for all the experimental results were conducted on 82 test images.
|Faster R-CNN  (1 class)||ResNet-18||0.69||0.88||0.51||0.78||9.02|
|Faster R-CNN  (32 classes)||ResNet-18||0.64||0.81||0.50||0.76||5.68|
|CenterNet  (1 class)||ResNet-18||0.60||0.81||0.17||0.57||35.38|
|CenterNet  (32 classes)||ResNet-18||0.67||0.90||0.34||0.70||34.83|
|Our method w/o DR||ResNet-18||0.79||0.91||0.81||0.82|
|Our method w/o OFF||ResNet-18||0.58||0.69||0.23||0.60|
|Our method w/o DR and OFF||ResNet-18||0.57||0.69||0.22||0.60|
4.2 Evaluation Metrics
To measure the accuracy of tooth detection, we employed average precision (AP)  metric. This metric is calculated based on the area under the receiver operating characteristics (ROC) curve by considering the criterion of intersection-over-union (IoU) :
where is the inferred box area and is the corresponding ground-truth box area. The precision (i.e., ) and recall (i.e., ) can be calculated, where TP, FP, and FN indicate the number of true positives, false positives, and false negatives, respectively, according to a certain IoU threshold value. We calculated the area under the ROC curve based on 0.05 interval of the IoU threshold ranging from 0 to 1.
To measure the accuracy of localization in successful detection, we introduced the mean IoU (mIoU) metric. We first matched the similarities between the detected boxes and the ground-truth boxes based on the criterion of maximum IoU. The detected boxes were assigned to a single ground-truth box once its IoU value was bigger than zero. For the final assignment, the ground-truth boxes selected a single detected box that had themaximum IoU value among the assigned. The mIoU value is calculated based on the matched pair of boxes without false detection:
where is the number of matched pairs.
To measure the identification accuracy, we measured the precision and recall metrics:
where , , and denote the number of ground-truth tooth boxes (i.e., existing teeth), detected boxes with respect to a threshold value of , and true positive identifications (i.e., accurate numbering) among , respectively.
4.3 Accuracy Evaluation
We performed a comparative analysis of the tooth detection results with state-of-the-art detectors, including anchor-based faster-RCNN  and center-point-based method . We present two different metrics: tooth or non-tooth classification (i.e., 1 class) and classification of all 32 teeth (i.e., 32 classes) for each network. Various backbone networks including ResNet-18 , DLA-34 , HG-Stacked  were used in the experiment for a comprehensive analysis, as presented in . Table 1 lists the values of AP, mIoU, and the computational performance based on frames per second (FPS). The accuracy of detection was evaluated by considering several AP values, i.e., AP , , and . and indicate the F1 scores where the IoU threshold is given by 50 and 75, respectively. A localization accuracy of successful detection was evaluated by mIoU, as presented in (9).
The results demonstrate that our proposed method is superior to other state-of-the-art methods pertaining to tooth detection. Regardless of the backbone networks, the proposed method showed the best AP and mIoU scores when compared to other detection approaches. The similar AP scores of our method for each backbone network indicate that the complexity of the network (i.e., network architecture) is not an important factor for the final accuracy. Among the networks, CenterNet  showed the highest FPS.
Figure 10 illustrates certain sample results. The numbers denote the anatomical tooth identifiers and the percentage values indicate the IoU values. The proposed method accurately localized each individual tooth when compared to other state-of-the-art methods. The networks that trained with one class, i.e., faster R-CNN-1 and CenterNet-1, did not detect non-teeth regions because the networks only detected existing teeth from the input images. The proposed method demonstrated the most successful results among the other networks regardless of the missing teeth (the second row in Fig. 10).
4.4 Ablation Studies
To verify the effect of our proposed method, several ablation studies were conducted (Table 2), i.e., without DR (i.e., (4)), without offset (OFF; i.e., (5)), and without both DR and OFF. As listed in Table 2, the improvement of accuracy was primarily achieved through the multitask offset training. The accuracy of detection severely decreased without the offset branch (i.e., our proposed method without OFF). The distance-based regularization loss showed no significant difference without the offset training. The best accuracy was obtained by employing both DR and OFF losses (Table 1).
The MSE values for center point regression are also listed in Table 3 to assess the performance of localization. denotes the MSE loss in the first estimation of center points (i.e., center points in Fig. 1). Similarly, denotes the MSE loss for the final center point regression calculated from the first estimation and the offset (i.e., center offsets in Fig. 1). The accuracy of center point regression significantly improved owing to multitask offset training (i.e., ). The proposed DR loss also improved the accuracy of the first and final outputs. In the case of the network without offset, the first accuracy improved, which indicates that the proposed cascaded fashion of two-step training concentrated on the final accuracy rather than the first regression. Sample visualizations of center points regression are illustrated in Fig. 11. The proposed DR successfully regularized the proximate distances between neighboring teeth; thus, an improvement in regression accuracy was observed.
4.5 Tooth Identification
The accuracy of identification was evaluated by comparing the proposed method with faster R-CNN-32  and CenterNet-32 . The network was trained to classify each tooth by its anatomical identifier based on the ground-truth tooth numbers. Table 4 lists the results of the precision (according to (10)) and recall (according to (11)) values. The accuracy was evaluated based on the existing teeth in the test images. Our proposed method showed superior performance in teeth identification. The primary advantage was obtained by the fixed 32-point regression and DR loss. Our method does not require any additional classifications as employed in other detection methods such as faster R-CNN-32  and CenterNet-32 . Ablation cases of the proposed network demonstrated that our proposed DR and OFF losses significantly improved the accuracy of identification. The recall value significantly decreased without offset training. Figure 12 illustrates sample visualizations of the confusion matrix based on the ResNet-18  backbone network.
Individual tooth detection and identification is a clinical CAD application that requires a high standard of accuracy for its deployment in clinics. It is challenging to obtain an accurate system by employing the state-of-the-art object detection methods [16, 23] because there are no specific metrics to minimize false positive or false negative detection. In this study, we proposed a fixed 32-point regression method and simultaneous localization that resolves these limitations. The proposed method automatically identifies the anatomical identifiers of teeth rather than performing each classification as presented in other studies [16, 23]. Thus, the proposed network achieved high accuracy in identification, which is important for the application. The proposed 32-point regression method is similar to the landmark detection approaches [35, 36, 37] that directly estimate the positions of fixed points. Our work combined the point regression scheme with object localization to achieve an accurate detection task with identification.
Instead of employing a single feed-forward network to detect 32 different classes, which contains explicit classification steps, we employed a cascaded framework to implement class-agnostic object localization. The proposed architecture is more efficient than other state-of-the-art methods because there are no additional classification layers for each class.
In this study, we proposed a point-wise tooth localization neural network by introducing a spatial DR loss. The proposed network employed center point regression for all the anatomical teeth (i.e., 32 points), which automatically identified each tooth. The regularization loss for Laplacian of spatial distances improved the detection accuracy based on center points. The final detection was performed using a cascaded, class-agnostic localization neural network with multitask training of center offsets. The experimental results demonstrated that the proposed method outperformed state-of-the-art detection approaches, which typically require an external classification for each class. Our proposed method achieved a precision of 0.997 and recall value of 0.972 for tooth identification, indicating the practical applicability of the proposed system in clinics.
This work was partly supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2017-0-01815, Development of AR-based Surgery Toolkit and Applications). And this research was partly supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (No. 2017R1D1A1B03034484). And this research was partly supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (No. 2017R1A2B3011475).
-  B. Van Ginneken, B. T. H. Romeny, and M. A. Viergever, “Computer-aided diagnosis in chest radiography: a survey,” IEEE Transactions on medical imaging, vol. 20, no. 12, pp. 1228–1241, 2001.
-  K. Doi, “Computer-aided diagnosis in medical imaging: historical review, current status and future potential,” Computerized medical imaging and graphics, vol. 31, no. 4-5, pp. 198–211, 2007.
-  C.-W. Wang, C.-T. Huang, J.-H. Lee, C.-H. Li, S.-W. Chang, M.-J. Siao, T.-M. Lai, B. Ibragimov, T. Vrtovec, O. Ronneberger et al., “A benchmark for comparison of dental radiography analysis algorithms,” Medical image analysis, vol. 31, pp. 63–76, 2016.
-  C. Angelopoulos, S. Thomas, S. Hechler, N. Parissis, and M. Hlavacek, “Comparison between digital panoramic radiography and cone-beam computed tomography for the identification of the mandibular canal as part of presurgical dental implant assessment,” Journal of Oral and Maxillofacial Surgery, vol. 66, no. 10, pp. 2130–2135, 2008.
-  O. Nomir and M. Abdel-Mottaleb, “Human identification from dental x-ray images based on the shape and appearance of the teeth,” IEEE transactions on information forensics and security, vol. 2, no. 2, pp. 188–197, 2007.
-  Y. Miki, C. Muramatsu, T. Hayashi, X. Zhou, T. Hara, A. Katsumata, and H. Fujita, “Classification of teeth in cone-beam ct using deep convolutional neural network,” Computers in biology and medicine, vol. 80, pp. 24–29, 2017.
-  T. M. Tuan et al., “A cooperative semi-supervised fuzzy clustering framework for dental x-ray image segmentation,” Expert Systems with Applications, vol. 46, pp. 380–393, 2016.
——, “Dental segmentation from x-ray images using semi-supervised fuzzy
clustering with spatial constraints,”
Engineering Applications of Artificial Intelligence, vol. 59, pp. 186–195, 2017.
-  H. Chen, K. Zhang, P. Lyu, H. Li, L. Zhang, J. Wu, and C.-H. Lee, “A deep learning approach to automatic teeth detection and numbering based on object detection in dental periapical films,” Scientific reports, vol. 9, no. 1, pp. 1–11, 2019.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in , vol. 1. IEEE, 2005, pp. 886–893.
-  Y. Zheng, A. Barbu, B. Georgescu, M. Scheuering, and D. Comaniciu, “Four-chamber heart modeling and automatic segmentation for 3-d cardiac ct volumes using marginal space learning and steerable features,” IEEE transactions on medical imaging, vol. 27, no. 11, pp. 1668–1681, 2008.
-  J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
-  R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
-  J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
-  ——, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
-  X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
-  H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 734–750.
-  K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6569–6578.
-  X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection by grouping extreme and center points,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 850–859.
-  N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-nms–improving object detection with one line of code,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5561–5569.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
-  A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to-end learning for joint detection and grouping,” in Advances in neural information processing systems, 2017, pp. 2277–2287.
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291–7299.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2403–2412.
-  A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision. Springer, 2016, pp. 483–499.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
-  K. Zuiderveld, “Contrast limited adaptive histogram equalization,” in Graphics gems IV. Academic Press Professional, Inc., 1994, pp. 474–485.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
-  Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3476–3483.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in European conference on computer vision. Springer, 2014, pp. 94–108.
-  A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1653–1660.