Nowadays, built–in cameras have been serving as indispensable components of mobile and wearable devices. Cameras with smaller size and higher resolution support a number of services and applications such as taking photos, mobile augmented reality, and life–logging systems on devices like smartphones, Microsoft HoloLens , Google Glass , and Narrative Clip . The trend of embedding cameras in wearables will keep growing, an example of which is smart contact lens111http://money.cnn.com/2016/05/12/technology/eyeball-camera-contact-sony/.
However, the ubiquitous presence of cameras, the ease of taking photos and recording video, along with “always on” and “non–overt act” features threaten individuals’ rights to have private or anonymous social lives, raising people’s concerns of visual privacy. More specifically, photos and videos captured without getting permissions from bystanders, and then uploaded to social networking sites, can be accessed by everyone online, potentially leading to invasion of privacy. Malicious applications on the device may also inadvertently leak captured media data222http://www.infosecurity-magazine.com/news/popular-android-camera-app-leaks/. What makes it worse is that recognition technologies can link images to specific people, places, and things, thus reveal far more information than expected, making searchable what was not previously considered searchable [4, 5]. All these possible consequences, whether have been realized by people or not, may hinder their acceptance of advanced wearable consumer products. A representative example is Google Glass, which has been questioned by US Congressional Bi-Partisan Privacy Caucus and Data Protection Commissioners around the world concerning privacy risks to the public [6, 7]. They have huge concerns regarding the privacy of non–users/bystanders, and have raised questions of “How does Google plan to to prevent Google Glass from unintentionally collecting data about non–users without consent?” and “Are product lifecycle guidelines and frameworks, such as Privacy by Design, being implemented in connection with its design and commercialization?” From these legal concerns, we are confident that in the future wearable devices with cameras are supposed to implement Privacy by Design before being released to the global markets. Therefore, we base our research on this assumption and aim to develop the technology that can enable such requirement.
In reality, both legal and technical measurements have been proposed to address privacy issues raised by unauthorized or unnoticed visual information collection. For instance, Google Glass is banned at places such as banks, hospitals, and bars333https://www.searchenginejournal.com/top-10-places-that-have-banned-google-glass/66585/
. However, prohibiting camera usage does not resolve the issue fundamentally, but instead sacrifices people’s rights to capture happy moments even if there is no bystander in the background. As a result, there are growing needs to design technical solutions to protect individuals’ visual privacy in a world where cameras are becoming pervasive. Some recent attempts are using visual markers such as QR code[8, 9] or colorful hints like hats  for individuals to actively express their unwillingness to be captured. However, these visual markers suffer from similar limitations. First, people are less likely to wear a QR code, despite the technical feasibility of these approaches. Moreover, privacy concerns vary widely among individuals, and people’s privacy preferences change from time to time, following patterns which cannot be conveyed by static visual markers. In fact, what individuals are doing, with whom, and where at, are all factors which determine whether people think their privacy should be protected. Therefore, we are looking for a natural, flexible, and fine-grained mechanism for people to express, modify, and control their individualized privacy preferences.
In this paper, we propose a visual privacy protection framework for individuals using: i) personalized privacy profiles, that people can define their context–dependent privacy preferences with a set of privacy related factors including location, scene, and other’s presence; ii) face features, for devices to locate individuals who request privacy control; and iii) hand gestures, which help people interact with cameras to temporarily change their privacy preferences. By using this framework, the device will automatically compute context factors, compare them with people’s privacy profiles, and finally enforce privacy protection conforming to people’s privacy preferences.
The rest of the paper is organized as follows: in Section II we introduce motivation and challenges of practical visual privacy protection; in Section III we provide a high–level overview of our framework; in Section IV we describe detailed design and implementation of the system; in Section V we present the evaluation results; in Section VI we discuss the related work; and finally, in Section VII we conclude the paper and discuss plans for future work.
Ii Practical Visual Privacy Protection
The goal of our work is to propose an in situ privacy protection approach conforming with the principle of Privacy by Design. Devices with recording capability can protect bystanders’ visual privacy automatically and immediately according to their privacy preferences. Our work is motivated by the findings from recent user studies, and encounters several challenges that help shape up the final design.
Ii-a Context–dependent Personal Privacy Preferences
Recently, some works try to understand people’s attitudes towards visual privacy concerns raised by pervasive cameras and rapid development of wearable technologies. Hoyle et al. [11, 12] find that life–loggers are concerned about the privacy of bystanders, and a combination of factors including time, location, and objects in the photo determines the sensitivity of the photo. A user study conducted by Denning et al.  also shows that participants’ acceptability of being recorded by Augmented Reality Glasses are dependent on a number of elements, including the place and what they are doing when the recording is taken. Based on the above findings, we conclude the following points that motivate our work:
People’s privacy concerns are dependent on context. Although location is an important factor of privacy concerns, what individuals are doing and with whom are more essential and crucial factors that directly relate to privacy.
People’s privacy preferences vary from each other, thus individuals should be able to express their own personal privacy preferences.
People’s privacy preferences may change from time to time, therefore individuals need a way to change such preferences easily.
Generally the public hold positive attitudes towards enforcing privacy protection on images to respect others’ privacy preferences.
Ii-B Challenges, Principles, and Limitations
According to people’s context–dependent privacy concerns and conclusions, a practical visual privacy protection framework is faced with the following challenges:
The first challenge is what elements should be taken into consideration regarding context. A practical protection approach should seek a balance between granularity and representativity of context, at the same time taking computational complexity into consideration. In our current design, we choose location, scene, and presence of other people as elements to define context. Location gives an approximate area scope that requests privacy protection. Scene describes places, and usually indicates what individuals are doing. Presence of other people can be a concern when individuals want to keep their social relationships private.
The second challenge is how to inform cameras of bystanders’ privacy preferences. Compared with external marker, individuals’ faces are more natural and effective visual cues. To this end, people need to provide their face features for individual recognition, and then set their personal privacy profiles by selecting context elements they are concerned with.
The third challenge is how people can easily modify their privacy preferences. Moreover, bystanders should be able to react to cameras immediately when they find being captured without efforts on updating profiles. Therefore, we offer hand gestures for individuals to “speak out” their privacy prefences in the capturing moment. Once a certain gesture is detected, the image will be processed to retain or remove individual’s identifiable information accordingly.
The technical focus of our approach is on proving the feasibility of a context–aware and interactive visual privacy protection framework through system design, implementation, and evaluation. However, like other technical solutions [14, 10, 15, 9, 16], our approach only works with compliant devices and users. Any non–compliant device can still capture images without respecting people’s privacy preferences. Also, we assume the cloud server that stores user profiles is trusted. Though there are limitations of our approach, we hope our work can motivate more complete visual privacy protection approaches, and finally be integrated into default camera subsystems.
Iii Cardea Overview
In this section, we first introduce some key concepts. Then we discuss major components of Cardea.
Iii-a Key Concepts
Bystander, Recorder, User, and Cloud
A bystander is the person who may be captured by pervasive cameras. A recorder is the person who holds a device with a built–in camera. A bystander who worries about his visual privacy can use Cardea to express his privacy preferences. A recorder who respects bystanders’ privacy preferences uses Cardea framework to capture images. Both the bystander and recorder become users of the privacy protection framework. The cloud listens to recorder’s requests and makes sure captured images are compliant with registered bystander’s privacy preferences.
Privacy profiles contain context elements, whether enabling hand gestures or not, and the protection action.
a) Context Elements
Context elements are factors that reflect people’s visual privacy concerns. Currently, we have considered location, scene, and people in the image. More elements can be included in the profile to describe people’s context–dependent privacy preferences.
b) Gesture Interaction
We define “Yes” and “No” gestures. A “Yes” gesture is a victory sign, which means “I would like to appear in the photo.” A “No” gesture is a palm, which represents “I do not want to be photographed.” They are consistent with people’s usual expectations, and is commonly used in daily lives to express willingness or unwillingness to be captured by others.
c) Privacy Protection Action
This action refers to removal of identifiable information. In our implementation, we blur user’s face as an example of protection action. Other methods such as replacing the face with an average face, blurring the whole body, or blending the body region into the background can be integrated into our framework.
Iii-B System Overview
Cardea is composed of the client app and cloud server. It works based on data exchange and collaborative computing of both the client and cloud sides. The major components and interactions are shown in Figure 1.
Cardea Bystander application: Bystanders use Cardea application to register as users and define their privacy profiles.
A bystander is presented with the user interface of Cardea client application for registration. The application will extract face features from the bystander automatically, and then upload to the cloud to train the face recognition model. After registration, a user can define his context–dependent privacy profile. The profile will also be sent to the cloud and stored in the cloud database for future retrieval. Both features and profiles can be updated.
Cardea Recorder application: Recorders use Cardea application to take images. After capturing an image, context elements will be computed locally on the device. Meanwhile, the application detects faces and extracts face features. As hand gesture recognition is extremely computationally–intensive, the task will be offloaded to the cloud. To lower the risk of privacy leakage during transmission, detected faces will be removed from the image before sending the compressed image data and face features to the cloud.
Upon receiving synthesized results from the cloud, the final privacy protection action will be performed on the image according to the computing results from both the local device and remote cloud. Details of how to make privacy protection decisions are discussed in Section IV.
Cardea Cloud Server: The function of the cloud server is twofold: a) storing users’ profiles and training the user recognition model; and b) responding to clients’ requests by recognizing users in the image and perform corresponding tasks.
When the cloud server receives requests from the client application, it first recognizes users in the image using the face recognition model. If there is any recognized user, the corresponding privacy profile will be retrieved to trigger related tasks. For example, if “Yes” or “No” gesture is enabled, gesture recognition task will start. Finally, computing results on the cloud will be synthesized and sent back to the client.
Iv Design and Implementation
The client application and cloud server are implemented on Android smartphones and desktop computer respectively. Next, we discuss the whole decision making procedures as shown in Figure 2, which involves both the client application and cloud server.
After an image is captured using Cardea, we first detect faces and extract face features. If there is any face in the image, we then input the extracted face features into the pre–trained face recognition model to recognize users. If any Cardea user is recognized, his privacy profile will be retrieved. Otherwise, we do nothing to the original image.
The profile describes whether the user enables hand gestures, and the location where he is concerned about his visual privacy. The profile may also contain information about selected scenes and other people that the user especially cares about. Among these factors, “Yes” and “No” gestures have the highest priority, that is, if any gesture is detected and enabled in user’s profile, privacy protection action will be determined instantly. In this way, users can temporarily modify their privacy preferences for cases in which that they are aware of being photographed, without updating their privacy profiles.
If gestures are not enabled or not detected for a user, and the image is captured within the area/location user specifies, we will start tasks of recognizing scene and finding if there is anyone in the image that the user cares about according to his privacy profile. When all the tasks are finished, the final privacy protection decision will be made and performed on the original image automatically to protect user’s visual privacy.
Iv-a User Recognition
User recognition is to identify users that request visual privacy protection in the image.
Face detection is to detect face regions in the whole image. It is the first step for all face image processing. We use Adaboost cascade classifier implemented in OpenCV  library to detect face regions, which has real–time performance. We then filter detected face regions using Dlib  to reduce false positives.
Face Feature Extraction
Face feature extraction is to find a compact yet discriminative representation that can best describe a face instead of using raw pixels. Convolutional Neural Networks (CNNs) have achieved state-of-the-art results on many computer vision tasks, including face recognition, which has been studied for years using different image processing techniques[20, 21]. Compared with conventional features such as Local Binary Patterns , features extracted from the CNN models yield much better performance. Therefore, we use 256–dimensional CNN–features extracted from a lightened CNN model , which has a small size, fast speed of feature extraction and low-dimensional representation. The model is pre-trained with the CASIA-WebFace dataset containing face images from 10575 subjects . We use the output of the “eltwise_fc1” layer from the CNN model as face features. We have also experimented a deeper CNN model with VGG architecture 
, which is widely used in CNN models. However, the computational burden using VGG model is too heavy compared with the lighted CNN model, with no obvious performance improvement. We use the open source deep learning framework Caffe and its Android library  to extract face features using the lightened CNN model on Android smartphone,
Face recognition is to identify users in the image. We train the face recognition model using Support Vector Machine (SVM), which can achieve good results without much training data. The model training is a supervised learning process. The input training data for the face recognition model are face features from Cardea users. We train the SVM model using LibSVM with linear kernel, as the number of training samples from each user is smaller than the dimensions of each input data.
With the face recognition model, we can get the prediction result of the input face feature vector. The output is , where
is the probability of being user. We set a probability threshold , and only when does it mean the user is recognized. This is to avoid the cases when the input face feature from a non–registered bystander is mistakenly recognized as a registered user. The threshold is chosen based on experiments that will be described in the next section.
Iv-B Context–aware Computing
Location, scene, and people in the image are three context elements defined in users’ privacy profiles. We acquire these context elements in different ways after the image is captured. They decide the final privacy protection action to be performed on the raw image.
Location provides coarse control for individuals. Users can name a concrete sensitive area in which that they may have privacy worries. By specifying locations, such as a campus, user’s privacy control will work in the area of the campus. We directly obtain the location from the GPS when the image is taken.
|Group name||Scenes #||Examples|
|Shopping||clothing store, market, supermarket|
|Travelling||airport, bus station, subway platform|
|Park & street||downtown, park, street, alley|
|Eating &||bar, bistro, cafeteria, coffee shop,|
|drinking||fastfood restaurant, food court|
|Working &||classroom, conference center, library,|
|study||office, reading room|
|Scantily clad||beach, swimming pool, water park|
|Medical care||hospital room, nursing home|
|Religion||cathedral, chapel, church, temple|
|Entertainment||amusement park, ballroom, discotheque|
Scene context is a complex concept that not only relates to places, but also gives clues about what people may be doing. We summarize
general scene groups. In this way, people can select the general scene group, instead of listing all places they care about. We fine–tune the pre–trained CNN model to do the scene classification. The detailed procedures are described below.
Data Preparing and Preprocessing
The data for training scene classification is from Places2 dataset . At the time when Cardea is built, Places2 dataset contained categories with more than millions training images. Among scene categories, we choose categories, and then group them into general scene groups as listed in Table I based on their contextual similarity. The grouped scenes are either pervasive or common scenes in daily life that may have a number of bystanders, or places where people are more likely to have privacy concerns. This scene subset is composed of million training images and validation images, validation images for each category.
The training procedure is a standard fine–tuning process. We first extract features of all training images from categories. The features are the output of “fc7” layer using the pre–trained AlexNet model provided by Places2 project as a feature extractor . With all features, we then train a Softmax classifier for scene categories.
The category classifier achieves validation accuracy on categories. There is no other benchmark result on the subset we choose, but recent benchmarks give validation accuracy on the new Places2 dataset with 365 categories . Both feature extraction and classifier training are implemented using Caffe library [26, 30].
To get the probability of each scene group, we simply sum up the probabilities of categories in the same group. The final prediction result is the group that has the highest probability among groups, which can be seen as a hard–coded clustering process.
The prediction accuracy of our scene group model on the validation set achieves . As category prediction probabilities are usually distributed among similar categories that belong to the same group, the group prediction results are superior to category prediction results. It is what we desire for the purpose of privacy protection based on more general scene context description.
Presence of Other People
The third context element we take into account is people in the image. For example, user Alice can upload face features of Bob with whom Alice does not want to be photographed. The task is to determine whether Bob also appears in the image when Alice is captured in the image.
A simple similarity matching is adopted as the number of potential matches (i.e., people in the image except Alice
herself) to be considered is small. We apply cosine similarity as distance metric to get the distance value between a pair of feature vectors. Assume we haveof Bob’s face features and face feature extracted from a person P in the image. We compare P’s feature with each of Bob’s features. If the value does not exceed the distance threshold , we regard it as a hit. When the hit ratio of Bob’s features reaches the ratio threshold , we believe P is Bob, therefore Alice should be protected. Detailed face matching method is described in Algorithm 1.
Iv-C Interaction using Hand Gestures
Hand gestures are natural body languages. Our goal is to detect and recognize “Yes” and “No” gestures in the image. However, hand gesture recognition in images taken by regular cameras with cluttered background is a difficult task. Conventional skin color–based hand detectors will fail dramatically in complex environments. In our design, we use the state–of–the–art object detection framework Faster Region–based CNN (Faster R-CNN)  to train a hand gesture detector as described below.
Gesture Recognition Model Training
According to the gesture recognition task, we categorize hand gestures into classes: 1) “Yes” gesture; 2) “No” gesture; and 3) “Natural” gesture, which is a hand in any other pose. The date used to train the gesture recognition model is composed of “Natural” gesture instances from images in VGG hand dataset [32, 33], and “Yes” gesture instances, “No” gesture instances from images crawled from Google and Flickr. Each annotation consists of an axis aligned bounding rectangle, along with its gesture class.
With the comprehensive gesture dataset and VGG16 pre–trained model provided by Faster–RCNN library, we fine–tune the “conv3_1” and upper layers, together with region proposal layers and Fast–RCNN detection layers . The detailed training procedures can be found in . After the gesture is detected, we link it to the nearest face, which requires users to show their hands near their faces when using gestures.
In this section, we present evaluation results along axes:
1) vision micro–benchmarks, to evaluate the performance of different computer vision tasks, including face recognition and matching, scene classification, and hand gesture recognition.
2) system overall performance, according to final privacy protection decisions and users’ privacy preferences.
3) runtime and energy consumption, which shows the processing time of each part, and energy consumed on one image.
V-a Accuracy of Face Recognition and Matching
We first select subjects from LFW dataset  who has more than images as registered users. Note that subjects in the CASIA WebFace Database used to train the lightened CNN model do not overlap with those in LFW . For each subject, we extract at least face features form Youtube video to simulate the process of user registration. In total, we get feature vectors. These features are then divided into the training set and validation set. In addition, we collect user test set and non–user test set to evaluate the face recognition accuracy. The user test set is composed of face feature vectors from online images of all registered users. The non–user test set consists of face feature vectors from subjects in LFW database whose names start with “A” as non–registered bystanders.
Figure 4(a) shows the recognition accuracy as to validation set and user test set with different numbers of features per person used for training. The accuracy refers to the fraction of faces that are correctly recognized. The results show the model trained with features per person can achieve near accuracy on the validation set and over accuracy on the user test set. Little improvement can be achieved with more training data. Therefore, we train the face recognition model with features per person. Figure 4(b) shows the overall accuracy with probability threshold . For non–user test set, the accuracy means the fraction of faces that are not recognized as registered users. As a result, the accuracy will increase for the non–user test set but decrease for the validation set and user test set when goes up. To make sure that registered users can be correctly recognized, and non-registered users will not be mistakenly recognized, we choose to be , which achieves over recognition accuracy for both users and non–users.
To evaluate face matching performance, we still use the user test set. Subjects who have more than features are regarded as the database group, the rest are the query group. Similar to face recognition accuracy, we break face matching accuracy into two parts: for persons belonging to the database group, we need to correctly match features only to the correct person; for persons not belonging to the database group, we should not match features to anyone. Figure 5 shows the matching accuracy with Cosine distance and Euclidean distance respectively. The results show that Cosine distance can achieve better performance with near accuracy for both situations in which people belong to the database or query group. The preferable parameters would be distance threshold , and ratio threshold .
Overall, the face recognition and matching methods we employ with appropriate thresholds and can effectively and efficiently recognize users and match faces in the images. More importantly, only a small number of features are needed for training the face recognition model and for face matching algorithm.
V-B Performance of Scene Classification
We recruited volunteers and asked them to take pictures “in the wild” belonging to general scene groups. After manually annotating these pictures, we had images in total, with images in scene groups we are interested in. We then predict scene group for each image using the scene classification model we trained.
The number of images and classification result of each group are shown in Figure 6(a). The true positives (TP) refers to images that are correctly classified, and false negatives (FN) are those classified as other scene groups. Overall, we can achieve recall (TP/(TP+FN)), and scene groups are with recall more than . It is worth mention that the recall of scene groups Scantily clad (Sc), Medical care (Me), and Religion (Re) exceed , which provides strong support to protect users’ privacy in sensitive scenes.
We also give the detailed classification confusion matrix in Figure 6(b). It shows that most FN of Eating & drinking (Ea) are classified as Shopping (Sh) or Working & study (Wo), and most FN of Wo are classified as Ea or Sh. The reason is that boundaries between Sh, Ea, or Wo are not clear. For example, shopping malls have food courts, or people study in coffee shop. The same reason accounts for the confusion between Park & street (Pa) and Travelling (Tr). Moreover, for scene categories such as pub and bar, people may group them into Ea or En. Therefore, a safe way is to select more scene groups, for instance, both Ea and En when you go to a pub at night.
In general, the evaluation results from images captured “in the wild” demonstrate that most of scenes can be correctly classified, the performance is especially satisfactory for those sensitive scenes.
V-C Performance of Gesture Recognition
We asked our volunteers to take pictures for each other with different distances, angles, lighting conditions, and backgrounds. We manually annotated all images with hand regions and the scene group it belongs to. In total, we got hand gesture images with “Yes” gestures, “No” gestures, and natural hands. The images covers out of scene groups. For images do not belong to our summarized scene groups, we categorize them into group Others.
Figure 7(a) and Figure 7(b) show the recall and precision for “Yes” and “No” gestures under different scene groups. The recall and precision of “Yes” gesture, the precision of “No” gesture reach for most of the scene groups, while the recall of “No” gesture achieves on average. For Entertainment (En), the recall is low, resulting from dim illumination in most of the images we tested. In general, the performance of gesture recognition does not show any marked correlation with different scene groups. Therefore, we further investigated the recall in terms of gesture size, as low recall will greatly threatens user’s privacy compared with precision.
Figure 7(c) plots the recall of gestures with varying hand region sizes. Each image will be resized to about square pixels, while keeping its aspect ratio. We classify them into size intervals as plotted along the x–axis. The result shows “Yes” gestures can achieve more than recall for all sizes of hand region. On the other hand, recall of “No” gestures tends to rise with increasing hand region size in general. It indicates that the performance of “No” gesture recognition can be improved with more training data with smaller sizes.
In summary, the performance of gesture recognition demonstrates the feasibility of integrating gesture interaction in Cardea for flexible privacy preference modification. It performs extremely well for “Yes” gesture recognition, and there is room for improvement of “No” gesture recognition with more training samples.
V-D Overall Performance of Cardea Privacy Protection
|Overall accuracy||Protection accuracy|
|No protection accuracy|
|Face recognition accuracy||“Yes” gesture recall|
|scene classification recall||“No” gesture recall|
After evaluating each vision task separately, we now present Cardea’s overall privacy protection performance. Faces in the image taken using Cardea end up being protected (e.g., blurred) or remain unchanged, correctly or incorrectly, depending on protection decisions made based on both user’s privacy profile and results from vision tasks. Therefore, we asked volunteers to register as Cardea users and set their privacy profiles. Now the face recognition model is trained using face feature vectors from people, including people from LFW dataset. We take about images and get processed images. As we focus more on face recognition rather than face detection, we only keep images that faces have been successfully detected. In total we got images for evaluation.
Table III shows the final privacy protection accuracy, as well as performance of each vision task. The protection accuracy shows faces that require protection are actually protected, and faces that do not ask for protection remain unchanged. Overall, faces are processed correctly, though the scene classification recall and “No” gesture recall do not reach . The reason is that protection decision making process of Cardea sometimes can make up for mistakes happening in the early step. For example, if user’s “No” gesture is not detected, his face can still be protected when the predict scene is selected in user’s profile.
In summary, Cardea achieves over accuracy for users in the real world. Improvements of each vision part will directly benefit Cardea’s overall performance in the future.
V-E Runtime and Energy Consumption
We validate the client side implementations on Samsung Galaxy Note 4444http://www.gsmarena.com/samsung_galaxy_note_4-6434.php, with 42.7 GHz Krait 450 CPU, Qualcomm Snapdragon 805 Chipset, 3GB RAM, and 16 MP, f/2.2 Camera. The server side is configured with Intel i7–5820K CPU, 16GB RAM, GeForce 980Ti Graphic Card (6GB RAM). The client and server communicate via a TCP over Wi–Fi connection.
Figure 8 plots the time taken for Cardea to complete different vision tasks. We take images with face, faces, and faces, in the size of . The images will be compressed in JPEG format. On average, the data sent is about KB. Note, some vision tasks will not be triggered in some situations according to the decision workflow as explained in Section IV. For example, if no user is recognized in the image, all other tasks will not start. For the purpose of measurement, we still activate all tasks to illustrate the runtime difference of images captured with varying number of faces.
Among all vision tasks, face processing (i.e., face detection and feature extraction) and scene classification are performed on the smartphone. The face processing takes about milliseconds for 1 face, and increases about milliseconds per additional face. Therefore, it reaches about and milliseconds for faces and faces respectively. This is the most fundamental step that runs on the smartphone locally due to privacy considerations. Owing to the real–time OpenCV face detection implementation and lightened CNN feature extraction model, it takes less then overall runtime. The scene classification takes around milliseconds per image, as it only performs once, independent of number of people in the image. The face recognition and matching tasks on the server side take less than milliseconds for five people. Though the time grows with increasing number of people, compared with other tasks they barely affect the overall runtime. The gesture recognition also runs on the server, and it takes about milliseconds. Similar to scene recognition, it performs on the whole image once regardless the number of faces. According to the measurement, we find the network transmission accounts for a majority of the overall runtime due to the unstable network environment.
In general, photographers using Cardea to take pictures can get one processed image within seconds in the most heavy case (i.e., there is registered user who enables gestures, and scene classification will be triggered on the smartphone). Compared with existing image capture platforms that also provide privacy protection such as I-pic , Cardea offers more efficient image processing functionality.
|Face recognition||Whole process (uAh)||# of images|
|1 face||(std )||(std )|
|2 faces||(std )||(std )|
|5 faces||(std )||(std )|
Next, we measure the energy consumption of taking pictures with Cardea on Galaxy Note 4 phone using the Monsoon Power Monitor . The images are also taken in size of square pixels. The first two columns of Table II show the energy consumption for the face processing part only, and for the whole process (i.e., from taking a picture to getting the processed image) with 1 face, 2 faces, and 5 faces respectively. The screen stays on during the whole process, therefore a large portion of the energy consumption is due to the always–on screen. Moreover, we can observe that face processing energy is linear to face numbers. All the other parts including scene classification, sending and receiving data are independent of the number of faces in the image, which is consistent with runtime measurements.
Using energy measurements, we also show Cardea’s capacity on Galaxy Note 4 in the last column of Table II. This device has a mAh battery, therefore can capture about high quality images with faces using Cardea.
Vi Related Work
Visual privacy protection has attracted extensive attentions these years due to increasing popularity of mobile and wearable devices with built–in cameras. As a result, some research works have proposed solutions to address these privacy issues.
Halderman et al.  propose an approach in which closed devices can encrypt data together during recording utilizing short range wireless communication to exchange public keys and negotiate encryption key. Only by obtaining all of the permissions from people who encrypt the recording can one decrypts it. Juna et al. [39, 40] present methods that third–party applications such as perceptual and augmented reality applications have access to only higher-level objects such as a skeleton or a face instead of raw sensor feeds. Raval et al.  propose a system that gives users control to mark secure regions that the camera have access to, therefore cameras are prevented from capturing sensitive information. Unlike above solutions, our work focuses on protecting bystanders’ privacy by respecting their privacy preferences when they are captured in photos.
To identify individuals who request privacy protection, advanced techniques have been applied. PriSurv  is a video surveillance system that identifies objects using RFID–tags to protect the privacy of objects in the video surveillance system. Courteous Glass  is a wearable camera integrated with a far-infrared imager that turns off recording when new persons or specific gestures are detected. However, these approaches require extra facilities or sensors that are not currently equipped on devices. Our approach takes advantage of state–of–the–art computer vision techniques that are reliable and effective.
A recently interesting work I–pic  allows people to broadcast their privacy preferences and appearance information to nearby devices using BLE. This work can be incorporated into our framework. Furthermore, we specify context elements that have not been considered before, such as scene and presence of others. Besides, we provide a convenient mechanism for people to temporarily change their privacy preferences using hand gestures when facing the camera, while broadcasted data in I-pic may not be received by people who take images, or the received data is outdated.
Vii Conclusion and Future Work
In this work, we designed, implemented, and evaluated Cardea, a visual privacy protection system that aims to address individuals’ visual privacy issues caused by pervasive cameras. Cardea leverages the state–of–the–art CNN models for feature extraction, visual classification and recognition. With Cardea, people can express their context–dependent privacy preferences in terms of location, scenes, and presence of other people in the image. Besides, people can show “Yes” and “No” gestures to dynamically modify their privacy preferences. We demonstrated performances of different vision tasks with pictures “in the wild”, and overall it can achieve about accuracy on users’ requests. We also evaluated runtime and energy consumed with Cardea prototype, which proves the feasibility of running Cardea for taking pictures while respecting bystanders’ privacy preferences.
Cardea can be enhanced in different aspects. The future work includes improving scene classification and hand gesture recognition performances, compressing CNN models to reduce overall runtime, and integrating Cardea with camera subsystem to enforce privacy protection measure. Moreover, protecting people’s visual privacy in the video is another challenging and significant task, which will make Cardea a complete visual privacy protection framework.
-  “Microsoft Hololens,” http://www.microsoft.com/miscrosoft-hololens/en-us.
-  “Google Glass,” http://www.google.com/glass/start.
-  “Narrative Clip,” http://www.getnarrative.com.
-  R. Shaw, “Recognition markets and visual privacy,” UnBlinking: New Perspectives on Visual Privacy in the 21st Century, 2006.
-  A. Acquisti, R. Gross, and F. Stutzman, “Face recognition and privacy in the age of augmented reality,” Journal of Privacy and Confidentiality, vol. 6, no. 2, p. 1, 2014.
-  “Congress’s letter,” http://blogs.wsj.com/digits/2013/05/16/congress-asks-google-about-glass-privacy.
-  “Data protection authorities’ letter,” https://www.priv.gc.ca/media/nr-c/2013/nr-c_130618_e.asp.
-  C. Bo, G. Shen, J. Liu, X.-Y. Li, Y. Zhang, and F. Zhao, “Privacy. tag: Privacy concern expressed and respected,” in Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems. ACM, 2014, pp. 163–176.
-  F. Roesner, D. Molnar, A. Moshchuk, T. Kohno, and H. J. Wang, “World-driven access control for continuous sensing,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2014, pp. 1169–1181.
-  J. Schiff, M. Meingast, D. K. Mulligan, S. Sastry, and K. Goldberg, “Respectful cameras: Detecting visual markers in real-time to address privacy concerns,” in Protecting Privacy in Video Surveillance. Springer, 2009, pp. 65–89.
-  R. Hoyle, R. Templeman, S. Armes, D. Anthony, D. Crandall, and A. Kapadia, “Privacy behaviors of lifeloggers using wearable cameras,” in Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2014, pp. 571–582.
-  R. Hoyle, R. Templeman, D. Anthony, D. Crandall, and A. Kapadia, “Sensitive lifelogs: A privacy analysis of photos from wearable cameras,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 2015, pp. 1645–1648.
-  T. Denning, Z. Dehlawi, and T. Kohno, “In situ with bystanders of augmented reality glasses: Perspectives on recording and privacy-mediating technologies,” in Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014, pp. 2377–2386.
-  J. A. Halderman, B. Waters, and E. W. Felten, “Privacy management for portable recording devices,” in Proceedings of the 2004 ACM workshop on Privacy in the electronic society. ACM, 2004, pp. 16–24.
-  J. Jung and M. Philipose, “Courteous glass,” in Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. ACM, 2014, pp. 1307–1312.
-  P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattacharjee, and T. T. Wu, “I-pic: A platform for privacy-compliant image capture,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys, vol. 16, 2016.
-  P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.
-  “OpenCV,” http://opencv.org/.
-  “Dlib,” http://dlib.net/.
-  W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM computing surveys (CSUR), vol. 35, no. 4, pp. 399–458, 2003.
-  E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua, “Labeled faces in the wild: A survey,” in Advances in Face Detection and Facial Image Analysis. Springer, 2016, pp. 189–248.
-  T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037–2041, 2006.
-  X. Wu, R. He, and Z. Sun, “A lightened cnn for deep face representation,” arXiv preprint arXiv:1511.02683, 2015.
-  “CASIA-WebFace dataset,” http://www.cbsr.ia.as.cn/english/CASIA-WebFace-Database.html.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference, vol. 1, no. 3, 2015, p. 6.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 675–678.
-  sh1r0, “Caffe-android-lib,” https://github.com/sh1r0/caffe-android-lib.
-  C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places2 dataset project,” http://places2.csail.mit.edu/index.html.
-  BVLC, “Caffe,” http://caffe.berkeleyvision.org.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  “VGG hand dataset,” http://www.robots.ox.ac.uk/~vgg/data/hands.
-  A. Mittal, A. Zisserman, and P. H. Torr, “Hand detection using multiple proposals.” in BMVC. Citeseer, 2011, pp. 1–11.
-  R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
-  “Faster R-CNN (Python implementation),” https://github.com/rbgirshick/py-faster-rcnn.
-  “LFW dataset,” http://vis-www.cs.umass.edu/lfw.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
-  “Monsoon Power Monitor,” http:https://www.msoon.com/LabEquipment/PowerMonitor.
-  S. Jana, A. Narayanan, and V. Shmatikov, “A scanner darkly: Protecting user privacy from perceptual applications,” in Security and Privacy (SP), 2013 IEEE Symposium on. IEEE, 2013, pp. 349–363.
-  S. Jana, D. Molnar, A. Moshchuk, A. Dunn, B. Livshits, H. J. Wang, and E. Ofek, “Enabling fine-grained permissions for augmented reality applications with recognizers,” in Presented as part of the 22nd USENIX Security Symposium (USENIX Security 13), 2013, pp. 415–430.
-  N. Raval, A. Srivastava, A. Razeen, K. Lebeck, A. Machanavajjhala, and L. P. Cox, “What you mark is what apps see,” in ACM International Conference on Mobile Systems, Applications, and Services (Mobisys), 2016.
-  K. Chinomi, N. Nitta, Y. Ito, and N. Babaguchi, “Prisurv: privacy protected video surveillance system using adaptive visual abstraction,” in International Conference on Multimedia Modeling. Springer, 2008, pp. 144–154.