down-sampled at different resolutions using bicubic interpolation (resized for better visualization). Low-resolution depth images contain little information for the identification of patient and health professionals. Corresponding color images in the second row are shown for better appreciation of the downsampling process.
Modern hospitals could highly benefit from the use of smart assistance systems that are able to support the workflow by exploiting digital data from equipment and sensors through artificial intelligence and surgical data science[11, 12]. This is illustrated by the recent development of new applications, such as patient activity monitoring inside intensive care units (ICU) , staff hand-hygiene recognition , radiation exposure monitoring during hybrid surgery  and workflow steps recognition in the operating room (OR) .
These systems, which have a huge potential to improve safety and care, all rely on machine intelligence using computer vision models to extract semantic information from visual data. In particular, human detection and pose estimation in the operating room[1, 6, 8] is one of the key components to develop such applications. Constant monitoring by the use of cameras raises however potential concerns for the privacy of patients and health professionals. Cameras usually capture the color, depth or both types of images for visual processing. Color images appear to be the most privacy-intrusive, but even textureless depth images can also intrude the privacy when used at sufficiently high-resolution [3, 4]. This is particularly relevant in environments where the number of persons is limited and where the persons could potentially be more easily identified. Figure 4 shows depth images at different resolutions. It suggests that low-resolution images could be used for more privacy-compliant computer-vision applications and that their recording could be better accepted by clinical institutions. In , it has been shown that activity recognition can be performed on low-resolution depth images captured for the tasks of hand-hygiene classification and ICU activity logging. In this work, we investigate whether low-resolution depth images contain sufficient information for accurate human pose estimation (HPE).
HPE consists of localizing human keypoints in images. Methods for human pose estimation are different for color and depth images both in terms of model architectures and complexity of training datasets. In the case of color images, deep learning models have recently shown remarkable progress with the help of large scalein the wild annotated datasets, such as COCO . Deep learning models for HPE can generally be grouped into bottom-up and top-down approaches. Bottom-up approaches first detect the keypoints and then group them to form skeletons , whereas top-down approaches first detect the person using person detectors and then use single person pose estimator to estimate body joints in each detected box . Top-down approaches are often more accurate due to their two-stage design but slower in comparison to bottom-up approaches. For depth images, Shotton et al.  use high-resolution synthetic depth dataset to train the models, while Haque et al.  focus on single person pose estimation using datasets recording actors performing simulated actions. Recently, Srivastav et al.  have introduced the MVOR dataset, which contains color and depth images captured in the OR along with ground truth human poses. They have also evaluated recent HPE methods. This is therefore an interesting testbed for multi-person pose estimation on depth data captured during real surgical activities, which we will use in this work.
Current methods for HPE inside the OR have been developed using standard resolution images [1, 8]. We have found that state-of-the-art models, which are trained on the high-resolution images, perform poorly on the corresponding low-resolution images. In this paper, we therefore propose an approach for the human pose estimation problem on low-resolution depth images. To the best of our knowledge, this is the first work that attempts to solve this task.
To train our system, we use a non-annotated dataset of synchronized RGB-D images captured in the OR environment. Unlike conventional approaches, which use either manual or synthetically rendered annotations challenging to generate, we propose to use the detections from a state-of-the-art method applied to the color images as pseudo ground truth for the corresponding depth images. This simple idea turns out to be very effective. Indeed, as our approach only requires a set of RGB-D images at train time, it can be easily retrained in any facility since no annotation process is needed. Then, it can run round the clock on low-resolution depth images from the same facility. Our HPE approach is a network which integrates super-resolution modules with a 2D multi-person body keypoint estimator based on RTPose . It utilizes intermediate super-resolution feature maps to better learn the high-frequency features. With the proposed architecture, we achieve the same results as a network trained on the standard resolution images and improve by % the results of a baseline method which up-samples the low-resolution images with bicubic interpolation before feeding them to the pose estimation network.
Our approach is inspired by the recent developments in the area of super-resolution and multi-person human pose estimation. We propose to integrate a super-resolution image estimator and a 2D multi-person pose estimator in a joint architecture, illustrated in Figure 3. This architecture is based on modification from the RTPose network . Besides yielding competitive results on COCO and MVOR, RTPose has the advantage to perform multi-person pose estimation in a single step, thereby simplifying the integration and training of the super-resolution modules. It is composed of a feature extraction block and a pose estimation block shown in Figure 3. We introduce a super-resolution block, which does not only increase the spatial resolution but also generates super-resolution (SR) feature maps (S1, S2). These intermediate feature-maps contain high-frequency details, which are lost during the low-resolution (LR) image generation process and used in the pose estimation block for better localization. The super-resolution block uses a multi-stage design, where each stage increases the spatial resolution of the features maps by a factor of two using the pixel-shuffle algorithm  (while reducing the number of channels by four). During training, a complete SR image is generated to compute the auxiliary loss L_HR, which compares the SR image to the ground truth high-resolution (HR) depth image using the L2 norm. This helps to train the super-resolution block and refines the input to the SR features block. Note that during training, errors from the pose estimation are also back-propagated to these blocks. Furthermore, at test time only LR images are used and no SR images need to be generated by the network since only the SR feature maps are used.
RTPose was originally developed for color images. Since depth images contain fewer texture details, we have made the architecture more computationally efficient by reducing the number of iterative refinement stages from five to three. The network uses two separate branches, one for keypoint localization and another to compute part affinity maps . In our architecture, these two branches consume the 3 types of features (F, S1, S2), where F are the features extracted from the high-resolution feature maps provided by the super-resolution block. The final skeleton is generated from the part affinity and keypoint localization heatmaps using the bipartite graph matching algorithm presented in . Losses in the pose estimation network are used as in , but now take the input from the SR feature maps (S1, S2). At each stage , two L2 losses and are computed from the predicted part affinity/keypoint localization heatmaps (/) and the ground truth heatmaps (/). All the and losses are summed together to form the pose estimation loss L_P. Finally, the total loss is the sum of L_HR and L_P. We have chosen to weigh both terms equally as we observe that their magnitudes are similar. The complete network is trained end-to-end jointly for both super-resolution and pose estimation.
2.2 Ground-truth generation
In the literature, authors have either used manually annotated or synthetically generated datasets to train for HPE on depth images. Manual annotations can be expensive and time-consuming, and synthetic annotations are difficult to generate due to the constraint of realistic rendering and do not always generalize well to real scenarios. Therefore, we use an alternate approach to generate annotations. This approach is based on the observation that the RGBD cameras capture synchronized color and depth streams, and recent HPE methods trained on the COCO dataset  work remarkably well on the color images. Therefore, we use detections from the color images to train the model for the depth images. To facilitate this approach, we collected an unlabeled RGBD dataset containing synchronized 80k color and depth images captured in the OR during real surgical procedures. Then, we used the state-of-art person detector Mask-RCNN  and a single person pose estimator MSRA  on color images to generate detections. We filter out the false positives and retain high-quality detections in both the stages using thresholds selected from the qualitative results on a small set of images. This approach generates pseudo ground truth automatically without using any human annotation efforts. It is therefore scalable and can be deployed to any facility. For human pose estimation, we choose here a two steps method based on Mask-RCNN and MSRA for their state-of-the-art performance on the public COCO dataset. Note that such a two-step method would be less convenient to use in our approach, due to the large architectures involved and the fact that super-resolution would need to be integrated into both.
3 Experiments and Results
3.0.1 Training setup:
We use the dataset of 80k images and the pseudo ground truth described in Section 2.2
for training. It contains 20k images from four categories, where each category includes images with one, two, three and four or more persons. We split the dataset into 77k training and 3k validation images. When downsampling the images to sizes 80x60 and 64x48, we use bicubic interpolation. To generate pseudo ground truth, we use a threshold of 0.7 in the person-detector stage and then select the skeleton if at least 4 keypoints are detected with a score greater than 0.35. We use PyTorch deep learning framework in our experiments. The depth images are normalized in the range [0, 255] and we train our networks using the stochastic gradient descent optimizer with a momentum of 0.9. The initial learning rate is set to 0.001 with a step decay of 0.1 after 12k iterations and each model is trained for 32k iterations with a batch size of 12. We use the pre-trained weights from the authors of RTPose to initialize the pose-estimator networks. Note that these weights were originally obtained using the color images from the COCO dataset. For the layers that have been modified in the pose-estimation network and contain a larger number of channels (e.g. to accommodate S1 and S2), we repeated the same weights and perturbed them by a small random number. The weights of the super-resolution network are initialized using orthogonal initialization.
3.0.2 Testing setup:
We evaluate our method on the publicly available depth images of the MVOR dataset , which contains images of size 640x480 captured in an OR from 3 different viewpoints during actual clinical interventions. The training dataset comes from the same environment and camera setup but contains data captured on different days. During testing, we use the flip-test, namely average the original heatmaps with the heatmaps obtained after flipping the images horizontally to refine the predictions. We use the percentage of correct keypoints (PCK) 
as an evaluation metric, which is widely used to measure the localization accuracy of the detected skeletons in multi-person scenarios.
We show our results in Table 1. RTPose_640x480, RTPose_80x60, and RTPose_64x48 are baseline RTPose models that do not use any super-resolution and are trained on 640x480 (full-size), 80x60, and 64x48 size depth images, respectively. These RTPose variants are the original models modified to take a 1-channel input. The degraded 80x60 and 64x48 images are resampled to the original size using bicubic interpolation to match the input size of the network. DepthPose_80x60 and DepthPose_64x48 are our proposed networks directly trained on 80x60 and 64x48 low-resolution images. Results show that the DepthPose_64x48 network, which uses 10x downsampled images, performs on par with the baseline trained on full-size image. Accuracy is improved by over 6.5% compared to the baseline RTPose_64x48. DepthPose_80x60 performs even better than RTPose_640x480 (an interesting fact also observed in  in the context of activity classification) and is 3.6% better than RTPose_80x60.
We have also evaluated the quality of the pseudo ground truth by running the Mask-RCNN and MSRA models on the color images from MVOR. The resulting PCK value is 76.2, showing that there still exists a gap of around 9% to be filled between the depth and color images. This may also explain the improved results of DepthPose_80x60 model, which takes advantage of an improved architecture compared to the full-size RTPose_640x480 model. Figure 4 shows some qualitative results of the DepthPose_64x48 model. Additional qualitative comparisons are available in the supplementary material.
3.1.1 Comparative study without SR feature maps:
We also experiment to better understand the effect of using super-resolution. Instead of giving to the baselines RTPose_80x60 and RTPose_64x48 images that are up-sampled with bicubic interpolation, we feed and train these networks with images up-sampled separately using a super-resolution network. The super-resolution network corresponds to the super-resolution block trained independently using loss L_HR. We observe in Table 1 that this procedure (SR+RTPose) improves the overall accuracy, but yields result inferior to DepthPose by 1.5% and 2.7% for 80x60 and 64x48 images, respectively. This shows that the use of intermediate SR feature maps in the pose estimation network helps to better localize keypoints. Also, SR+RTPose has the disadvantage to explicitly generate super-resolution images, the privacy compliance of which would need to be considered.
In this paper, we present an approach for high-resolution multi-person 2D pose estimation from low-resolution depth images. Our evaluation on the public MVOR dataset shows that even with a 10x subsampling of the depth images, our method achieves results equivalent to a pose estimator trained and tested on the original-size images. Furthermore, we show that by exploiting high-quality pose detections on the color images of a non-annotated RGB-D dataset, we can generate pseudo ground truth for the depth images and train a decent OR pose estimator. These results suggest the high potential of low-resolution images for scaling up and deploying privacy-preserving AI assistance in hospital environments.
This work was supported by French state funds managed by the ANR within the Investissements d’Avenir program under references ANR-16-CE33-0009 (DeepSurg), ANR-11-LABX-0004 (Labex CAMI) and ANR-10-IDEX-0002-02 (IdEx Unistra). The authors would also like to thank the members of the Interventional Radiology Department at University Hospital of Strasbourg for their help in generating the dataset.
-  Belagiannis, V., Wang, X., Shitrit, H.B.B., Hashimoto, K., Stauder, R., Aoki, Y., Kranzfelder, M., Schneider, A., Fua, P., Ilic, S., et al.: Parsing human skeletons in an operating room. Machine Vision and Applications 27(7), 1035–1046 (2016)
-  Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR. pp. 7291–7299 (2017)
Cheng, Z., Shi, T., Cui, W., Dong, Y., Fang, X.: 3d face recognition based on kinect depth data. In: 4th International Conference on Systems and Informatics (ICSAI). pp. 555–559 (2017)
-  Chou, E., Tan, M., Zou, C., Guo, M., Haque, A., Milstein, A., Fei-Fei, L.: Privacy-preserving action recognition for smart hospitals using low-resolution depth images. NeurIPS-MLH (2018)
Haque, A., Guo, M., Alahi, A., Yeung, S., Luo, Z., Rege, A., Jopling, J., Downing, L., Beninati, W., Singh, A., et al.: Towards vision-based smart hospitals: A system for tracking and monitoring hand hygiene compliance. In: Proceedings of Machine Learning for Healthcare. vol. 68 (2017)
-  Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3d human pose estimation. In: ECCV. pp. 160–177. Springer (2016)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2961–2969 (2017)
-  Kadkhodamohammadi, A., Gangi, A., de Mathelin, M., Padoy, N.: Articulated clinician detection using 3d pictorial structures on rgb-d data. Medical image analysis 35, 215–224 (2017)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755. Springer (2014)
-  Ma, A.J., Rawat, N., Reiter, A., Shrock, C., Zhan, A., Stone, A., Rabiee, A., Griffin, S., Needham, D.M., Saria, S.: Measuring patient mobility in the icu using a novel noninvasive sensor. Critical care medicine 45(4), 630 (2017)
-  Maier-Hein, L., Vedula, S., Speidel, S., Navab, N., Kikinis, R., Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., et al.: Surgical data science: enabling next-generation surgery. Nature Biomedical Engineering 1, 691–696 (2017)
-  Padoy, N.: Machine and deep learning for workflow recognition during surgery. Minimally Invasive Therapy & Allied Technologies 28(2), 82–90 (2019)
-  Rodas, N.L., Barrera, F., Padoy, N.: See it with your own eyes: markerless mobile augmented reality for radiation awareness in the hybrid room. IEEE Transactions on Biomedical Engineering 64(2), 429–440 (2017)
-  Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (2013)
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: CVPR. pp. 1874–1883 (2016)
-  Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Communications of the ACM 56(1), 116–124 (2013)
-  Srivastav, V., Issenhuth, T., Abdolrahim, K., de Mathelin, M., Gangi, A., Padoy, N.: Mvor: A multi-view rgb-d operating room dataset for 2d and 3d human pose estimation. In: MICCAI-LABELS workshop (2018)
-  Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Multi-stream deep architecture for surgical phase recognition on multi-view rgbd videos. In: M2CAI—MICCAI workshop (2016)
-  Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV. pp. 466–481 (2018)
-  Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence 35(12), 2878–2890 (2012)
*** Supplementary Material ***
The pictures below show additional qualitative results of our proposed models, namely DepthPose_80x60 and DepthPose_64x48, w.r.t the baseline models RTPose_640x480, RTPose_80x60, and RTPose_64x48. We also show the ground truth (GT) on color images for better appreciation of the qualitative results. These results show that DepthPose_80x60 and DepthPose_64x48 perform better for removing false positives and spurious detections and improve the part localization (see red and green arrows in the figures).
in 71,72,176,161,151,143,131,48,18