Privacy-Preserving Action Recognition for Smart Hospitals using Low-Resolution Depth Images

11/25/2018 ∙ by Edward Chou, et al. ∙ 0

Computer-vision hospital systems can greatly assist healthcare workers and improve medical facility treatment, but often face patient resistance due to the perceived intrusiveness and violation of privacy associated with visual surveillance. We downsample video frames to extremely low resolutions to degrade private information from surveillance videos. We measure the amount of activity-recognition information retained in low resolution depth images, and also apply a privately-trained DCSCN super-resolution model to enhance the utility of our images. We implement our techniques with two actual healthcare-surveillance scenarios, hand-hygiene compliance and ICU activity-logging, and show that our privacy-preserving techniques preserve enough information for realistic healthcare tasks.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Healthcare facilities and services can benefit greatly from automation and machine vision to improve patient care and outcomes. A new paradigm commonly known as “smart hospitals” yu2018smarthospitaliot ; noury2008smarthospitaldaynight ; biswas2006smarthospitalsystem ; twinanda2015smarthospitaldatadriven aim to integrate automation and machine intelligence directly into the healthcare environment by using sensor-collected data to understand and facilitate hospital procedures. A promising approach leverages the information richness of visual data, using cameras and computer vision to collect and analyze patient and healthcare worker activities sanchez2008smarthospitalactivityrecognition . Vision-based activity monitoring has been deployed to track hand-hygiene compliance haque2017handhygiene , perform activity logging in ICU facilities ma2017icumobility ; liu2018icu , and detect anomalous behaviors like falls in senior home facilities luo2018seniorhome , demonstrating the potential of such systems to reduce disease, decrease human workload, and improve patient care.

Although smart hospitals result in better healthcare services, they can also elicit distrust from patients and healthcare workers lin2016iotsmarthomeprivacy . A building filled with cameras performing constant monitoring can appear intrusive and oppressive, and the medical field as a whole is especially concerned with patient and data privacy asghar2017hipaaprivacyhealth . A visual system designed to enforce hand-hygiene compliance will invariably capture auxiliary information which could contain a patient’s identity, their medical condition, and personally-embarrassing activities. Though using different data-modalities with depth sensors can alleviate some privacy concerns by avoiding RGB data zhang2012privacyfalldetection

, previous works show that common depth sensors like the Kinect capture enough data to perform facial recognition

cheng2017facekinectdepth .

One approach for alleviating patient concern over camera intrusiveness is to ensure that cameras capture as little privacy-sensitive information as possible. In this work, we use low-resolution depth images to remove privacy-relevant information while still retaining activity-recognition utility. We train deep learning models to perform hand-hygiene monitoring and activity-logging tasks and measure the accuracy drop due to downsampling the depth data. We enhance the utility of our images by using super-resolution techniques trained on a privacy-safe data to enhance our downsampled images, demonstrating a realistic framework for non-intrusive healthcare monitoring.

2 Related Work

The use of extremely low resolution images to perform activity recognition in a privacy-preserving manner has been explored in previous literature. One work miyazaki2015lowresprivacyconscious

proposes low-resolution action recognition by focusing on the shape of the human head to guide body-position estimation. Inverse super-resolution (ISR)

ryoo2016lowresegocentric uses a network generates multiple low resolution proposals and applies MCMC and entropy-measure approaches to discover the optimal transformation for action recognition. Two similar approaches ryoo2017lowressiamese ; xu2018lowresactiontwostream

use two stream neural networks to aggregrate features and create a cross representation between high and low resolution images to learn an optimal feature mapping.

Several works also explore low-resolution facial recognition which could be applicable to a privacy-sensitive context. One such work li2018lowresfacewild attempts to learn a common feature space between low and high resolution images using center regularization and GAN-based techniques. Another zangeneh2017lowresfacetwobranch uses a two branch network to learn a cross representation between low and high-resolution faces.

Enhancing the quality of low-resolution images is also a well-explored task, with many recent works focused especially on applying deep neural networks for superresolution yang2018superresdeeplearning . An effective approach yamanaka2017superresskipconnection ; zhang2018superresresidualattention

uses skip/residual connections to bypass abundant low frequency information and focus the model on high frequency information for training. One work focuses instead on compacting networks and introducing a compression component to perform superresolution tasks with computational improvements

hui2018superresinfodistillation . Another distinct approach uses adversarial training techniques to generate realistic textures and applies the technique to video data perezpellitero2018superrresphotorealistic .

Figure 1: A low-resolution camera is used to monitor a hospital room. Public training data is used to train a super-resolution model to enhance images before fed to an action recognition model.

3 Methods

A framework of our method is illustrated in Figure 1. Low-resolution videos of a hospital room are captured and enhanced with a privacy-preserving DCSCN model, before the frames are fed to an action recognition model to perform tasks such as hand-hygiene monitoring or activity logging.

Downsampling: In order to simulate image downsampling, we used bicubic downsampling of the original depth images by different scales. Other methods to distort images include Gaussian blurring or superpixel clustering butler2015privacyutilityrobots

. Bicubic interpolation downscaling works well with images that have continuous-tone images

keys1981bicubic , suitable for depth images that do not have many sharp edges. We note that we can avoid collecting high-fidelity images altogether by using LR camera hardware ryoo2016lowresegocentric , which is used in other settings to reduce memory for data storage. It has been found in previous work that at image sizes of , there is not enough visual information to discern facial features, providing a general privacy-guideline for the level of downsampling required ryoo2016lowresegocentric ; ryoo2017lowressiamese . Starting from , we can downsize our images by 16x to images. We also note that our dataset consists of full-body viewpoints where the face takes up at most of an image; we also perform experiments downsampling by 4x to that provides weaker privacy assurances.

Private Super-resolution: To perform superresolution, we use a state-of-the-art DCSCN super-resolution network developed by Yamanaka et al yamanaka2017superresskipconnection

. The DCSCN is a CNN consisting of a feature extraction network and reconstruction network, and is trained by feeding in pairs of low resolution and high resolution images to find the optimal super-resolution weights. We trained two DCSCN models on images downsampled to

and on the open-source action recognition dataset NTU-RGDB. The NTU RGBD action recognition dataset shahroudy2016nturgbd consists of 56,880 samples containing RGB and depth map videos of 60 distinct actions recorded using Microsoft Kinect v.2. We only use the depth data from the dataset to train our DCSCN model. By training DCSCN models on public datasets disjoint from our dataset, we remove the privacy risk of DCSCN learning from information exclusive to our dataset while still learning a good representation for indoor action depth images.

4 Datasets

Hand Hygiene Detection: Significant efforts have been made in developing hand-hygiene compliance technologies to reduce HAIs (hospital acquired infections) cook2009hai , which affects a large population of hospital visitors and costs the industry billions of dollars a year zimlichman2013hai . Several approaches use RFID and wearable-devices granadovillar2013handhygiene with sensor-enabled soap-dispensers to log hand-sanitization. We focus on an visual-based approach which uses depth data and CNNs to identify dispenser usage events haque2017handhygiene . Images were collected from an acute care pediatric care unit and an adult ICU from two hospitals. Depth sensors were installed near alcohol-based gel dispensers from top-down and side-view rooms, with a functional range between 0.8 and 4.0 meters. As outlined by armellino2013handhygiene

, the images are positively labelled when a person correctly follows standard hand-hygiene protocol. A group of ten annotators trained on proper hand hygiene protocol annotated the images, with each depth image analyzed and cross-validated by one to three annotators.The data used for the classifier consists of 113,379 images, of which 11,994 images contained people using the dispenser. We used a train/test split of 90/10.

Figure 2: From left to right: A high res. depth image, a bicubic downsampled image, and a DCSCN enhanced image. DCSCN is trained to sharpen the silhouette of the low res. image.

ICU Activity Logging: In ICUs, fine-grained and reliable recording occurences of patient care activities helps improve patient outcomes through enforcing protocol adherence and studying correlation of care activities with care quality schweickert2009early ; team2015early . Computer vision-based approaches can help alleviate the workload on nurses and staff through automated activity detection and logging. We collect a dataset of patient care activities in a simulated ICU room. A depth sensor was used to collect four patient care mobility-related activties: "getting in bed", "sitting on a chair", "getting out of bed", and "standing up from a chair", using a side-view of the room. The simulation was guided by a clinician to ensure the scenarios covered the diverse patient conditions in ICUs; "getting out of bed" for instance may involve different numbers of caretakers and vary significantly in duration. The data was examined by the clinician to ensure the activities were conducted following correct protocols. In total, 16853 seconds of videos are collected from 10 actors, comprising of 316 activity instances. We use 90/10 for our train/test split. A sample frame from our ICU recordings is included Figure 1.

RGB Images: We did not obtain RGB videos for our healthcare tasks due to the privacy concerns raised by the participating hospitals and clinicians. We find that depth images are adequate for our experimental tasks, as silhouette information is enough to perform recognition of basic actions like dispenser usage or lying in bed. However, RGB videos are critical for finer-grained activities, especially for tasks involving objects such as surgery or X-ray scans. Our proposed low-resolution and DCSCN is compatible and perhaps better suited with RGB videos; many low-resolution works use RGB datasets ryoo2016lowresegocentric ; ryoo2017lowressiamese , and DCSCN techniques were developed primarily for RGB images zhang2018superresresidualattention ; yamanaka2017superresskipconnection . We believe our proposed technique will provide a method to deploy RGB cameras to hospitals with privacy-preserving assurances, adding more capability to smart-hospital systems.

5 Experiments

We perform experiments on both the Hand-Hygiene and ICU tasks with different dimensions and enhancement settings, measuring the accuracy and AUC for each task.

Original Dim DCSCN Test Acc. AUC
224 224 No 94.5% 0.987
56 56 No 96.27% 0.992
56 56 Yes 98.24% 0.995
14 14 No 92.59% 0.9735
14 14 Yes 95.87% 0.994
Table 1: Hand Hygiene Results: Results from our Resnet-52 model on the hand hygiene task, where we compare results with and without DCSCN enhancement. Surprisingly, we see that all experiments except unenhanced beat the original performance in terms of both Test Accuracy and AUC. In addition, we find that DCSCN improves the performance for either both downsampled dimensions.
Original Dim DCSCN Test Acc. AUC
Action Class Average 0 1 2 3 4
224 224 No 68.8% 0.905 0.825 0.94 0.85 0.845
56 56 No 70.8% 0.935 0.885 0.805 0.845 0.835
56 56 Yes 72.4% 0.93 0.865 0.905 0.805 0.9
14 14 No 66.0% 0.97 0.775 0.89 0.855 0.724
14 14 Yes 62.6% 0.945 0.8 0.81 0.79 0.725
Table 2: ICU Results: Results from our Resnet-18 model on the ICU task, and with columns AUC (n) where n represents the action class of ICU. We find that the best test accuracy performance is found with DCSCN enhanced images, and that for all of the metrics we find at most 0.12 AUC degradation for all dimensions and at most 4.8% test accuracy degradation for all dimensions.

Hand Hygiene:

We train an Imagenet-pretrained

Deng09imagenet Resnet-50 he2015resnet model. We include examples of downsampled images and respective DCSCN outputs in Figure 2. Due to the large class imbalance present in our dataset, we perform data augmentation on our dataset by applying random image transformations and feeding in equal numbers of positive/negative dispenser usage frames during the training phase. As we can see in Table 1, scaling down the hand-hygiene dataset does not cause significant drop in utility. In Table 1, we find that we actually get the highest performance with the DCSCN-enhanced images at 98.24%, and higher than baseline results of 94.5% with DCSCN-enhanced images at 95.87% which is also slightly better than the performance presented in haque2017handhygiene . Our experiments show that basic downsampled images already preserve a practical amount of utility for each task. For the hand hygiene task, our low resolution outputs even outperform the original images, possibly due to regularization effects that occur with downsampling.

ICU Logging: We use a Resnet-18 sherstinsky2018rnnlstm to process the ICU actions, using data augmentation to balance our classes. We present our results in Table 2, where classes 0-4 represent ’background’, ’get in bed’, ’get out of bed’, ’get in chair’, and ’get out of chair’. For the ICU task, we produce comparable or better performance relative to other works liu2018icu at 68.8% test acc. In Table 2 we find that scaling down the ICU data also does not cause significant accuracy loss. Athough DCSCN does not improve accuracy for images and only slightly improves the accuracy for images, it improves the AUC for several classes such as 2 and 4 for images and class 1 and 4 for images. In addition, Figure 2 shows that the super-resolution enhanced images are more visually-interpretable and easier to annotate. Most importantly, we can see how it is visually impossible to discern any personally identifying information from a frame.

6 Conclusion

Computer vision in healthcare facilities can greatly aid patient care as seen in tasks like hand hygiene monitoring and ICU logging, but can attract negative sentiment due to the intrusiveness of surveillance systems. By using downsampled depth images and super-resolution techniques, we can assure a high amount of privacy while preserving enough utility to perform healthcare-relevant action recognition. Our techniques our compatible with RGB images, and we plan to collect and experiment on a RGB healthcare dataset with state-of-the-art low-resolution action recognition techniques. We hope the framework we present can promote the development and acceptance of smart-hospitals, and encourage more works to preserve visual privacy in the healthcare domain.