Towards Privacy-Preserving Visual Recognition via Adversarial Training: A Pilot Study

07/22/2018 ∙ by Zhenyu Wu, et al. ∙ adobe Texas A&M University 1

This paper aims to improve privacy-preserving visual recognition, an increasingly demanded feature in smart camera applications, by formulating a unique adversarial training framework. The proposed framework explicitly learns a degradation transform for the original video inputs, in order to optimize the trade-off between target task performance and the associated privacy budgets on the degraded video. A notable challenge is that the privacy budget, often defined and measured in task-driven contexts, cannot be reliably indicated using any single model performance, because a strong protection of privacy has to sustain against any possible model that tries to hack privacy information. Such an uncommon situation has motivated us to propose two strategies, i.e., budget model restarting and ensemble, to enhance the generalization of the learned degradation on protecting privacy against unseen hacker models. Novel training strategies, evaluation protocols, and result visualization methods have been designed accordingly. Two experiments on privacy-preserving action recognition, with privacy budgets defined in various ways, manifest the compelling effectiveness of the proposed framework in simultaneously maintaining high target task (action recognition) performance while suppressing the privacy breach risk.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 22

page 25

page 26

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Smart surveillance or smart home cameras, such as Amazon Echo and Nest Cam, are now found in millions of locations to remotely link users to their homes or offices, providing monitoring services to enhance security and/or notify environment changes, as well as lifelogging and intelligent services. Such a prevalence of smart cameras has reinvigorated the privacy debate, since most of them require to upload device-captured visual data to the centralized cloud for analytics. This paper seeks to explore: how to make sure that those smart computer vision devices are only seeing the things that we want them to see (and how to define what we want)? Is it at all possible to alleviate the privacy concerns, without compromising on user convenience?

At the first glance, the question itself is posed as a dilemma: we would like a camera system to recognize important events and assist human daily life by understanding its videos, while preventing it from obtaining sensitive visual information (such as faces) that can intrude people’s privacy. Classical cryptographic solutions secure the communication against unauthorized access from attackers. However, they are not immediately applicable to preventing authorized agents (such as the backend analytics) from the unauthorized abuse of information, that causes privacy breach concerns. The popular concept of differential privacy has been introduced to prevent an adversary from gaining additional knowledge by inclusion/exclusion of a subject, but not from gaining knowledge from released data itself [8]. In other words, an adversary can still accurately infer sensitive attributes from any sanitized sample available, which does not violate any of the (proven) properties of differential privacy [18]. It thus becomes a new and appealing problem, to find an appropriate transform on the collected raw visual data at the local camera end, so that the transformed data itself will only enable certain target tasks while obstructing other undesired privacy-related tasks. Recently, some new video acquisition approaches [3, 9, 47] proposed to intentionally capture or process videos in extremely low-resolution to create privacy-preserving “anonymized videos”, and showed promising empirical results.

In contrast, we formulate the privacy-preserving visual recognition in a unique adversarial training framework. The framework explicitly optimizes the trade-off between target task performance and associated privacy budgets, by learning active degradations to transform the video inputs. We investigate a novel way to model privacy budget in a task-driven context. Different from the standard adversarial training where two individual models compete, the privacy budget in our framework cannot be simply defined with one single model, as the ideal protection of privacy has to be universal and model-agnostic, i.e., obstructing every possible model from predicting privacy information. To resolve the so-called challenge”, we propose two strategies, i.e., restarting and ensembling budget model(s), to enhance the generalization capability of the learned degradation to defend against unseen models. Novel training strategies and evaluation protocols have been proposed accordingly. Two experiments on privacy-preserving action recognition, with privacy budgets defined in different ways, manifest the effectiveness of the proposed framework. With many problems left open and large improvement room existing, we hope this pilot study to attract more interests from the community.

2 Related Work

2.1 Privacy Protection in Computer Vision

With pervasive camera for surveillance or smart home devices, privacy-preserving visual recognition has draw increasing interests from both industry and academia, since (1) due to their computationally demanding nature, it is often impractical to run visual recognition tasks all at the resource-limited local device end. Communicating (part of) data to the cloud is indispensable; (2) while traditional privacy concerns mostly arise from the unsecured channel between cloud and device (e.g, malicious third-party eavesdropping), customers now possess increasing concerns against sharing their private visual information to the cloud (which might turn malicious itself).

A few cryptographic solutions [13, 66]

were developed to locally encrypt visual information in a homomorphic way, i.e., the cryptosystems allow for basic arithmetic classifiers over encrypted data. However, many encryptions-based solution will incur high computational costs at the local platforms. It is also challenging to generalize the cryptosystems to more complicated classifiers.

[4] combined the detection of regions of interest and the real encryption techniques to improve privacy while allowing general surveillance to continue. A seemingly reasonable, and computationally cheaper option is to extract and transmit feature descriptors from raw images, and transmit those features only. Unfortunately, a previous study [31] revealed that considerable information of original images could still be recovered from standard HOG or SIFT features (even they look visually distinct from natural images), making them fragile to privacy hacking too.

An alternative toward a privacy-preserving vision system concerns the concept of anonymized videos. Such videos are intentionally captured or processed to be in special low quality conditions, that only allow for the recognition of some target events or activities, while avoiding the unwanted leak of the identity information for the human subjects in the video [3, 9, 47]. Typical examples of anonymized videos are videos made to have extreme low resolution (e.g., ) by using low resolution camera hardware [9], based on image operations like blurring and superpixel clustering [3], or introducing cartoon-like effects with a customized version of mean shift filtering [63]. [41, 42] proposed to use privacy preserving optics to filter sensitive information from the incident light-field before sensor measurements are made, by -anonymity and defocus blur. Earlier work [23]

explored privacy-preserving tracking and coarse pose estimation using a network of ceiling-mounted time-of-flight low-resolution sensors.

[58] adopted a network of ceiling-mounted binary passive infrared sensors. However, both works handled only a limited set of activities performed at specific constrained areas in the room. Later, [47]

showed that even at the extreme low resolutions, reliable action recognition could be achieved by learning appropriate downsampling transforms, with neither unrealistic activity-location assumptions nor extra specific hardware resources. The authors empirically verified that conventional face recognition easily failed on the generated low-resolution videos. The usage of low-resolution anonymized videos

[9, 47] is computationally cheaper, and is also compatible with sensor and bandwidth constraints. However, [9, 47] remain empirical in protecting privacy. In particular, neither were their models learned towards protecting any visual privacy, nor were the privacy-preserving effects carefully analyzed and evaluated. In other words, privacy protection in [9, 47] came as a “side product” of down-sampling, and was not a result of any optimization. The authors of [9, 47]

also did not extend their efforts to studying deep learning-based recognition, making their task performance less competitive.

Very recently, a few learning-based approaches have come into play to ensure better privacy protection. [53] defined a utility metric and a privacy metric for a task entity, and then designed a data sanitization function to achieve privacy while providing utility. However, they considered only simple sanitization functions such as linear projection and maximum mean discrepancy transformation. In [43], the authors proposed a game-theoretic framework between an obfuscator and an attacker, in order to hide visual secrets in the camera feed without significantly affecting the functionality of the target application. This seems to be the most relevant work to the proposed one: however, [43] only discussed a toy task to hide QR codes while preserving the overall structure of the image. Another relevant work [18] addressed the optimal utility-privacy tradeoff by formulating it as a min-diff-max optimization problem. Nonetheless, The empirical quantification of privacy budgets in existing works [53, 43, 18] only considered to protect privacy against one hacker model, and was thus insufficient, for which we will explain more in Section 3.1.

2.2 Privacy Protection in Social Media and Photo Sharing

User privacy protection is also a topic of extensive interests in the social media field, especially for photo sharing. The most common means to protect user privacy in a uploaded photo is to add empirical obfuscations, such as blurring, mosaicing or cropping out certain regions (usually faces) [26]. However, extensive research showed that such an empirical means can be easily hacked too [37, 32]. A latest work [38] described a game-theoretical system in which the photo owner and the recognition model strive for antagonistic goals of dis-/enabling recognition, and better obfuscation ways could be learned from their competition. However, it was only designed to confuse one specific recognition model, via finding its “adversarial perturbations” [36]. That can caused obvious overfitting as simply changing to another recognition model will likely put the learning efforts in vain: such perturbations even cannot protect privacy from human eyes. Their problem setting thus deviated far away from our target problem. Another notable difference is that in social photo sharing, we usually hope to cause minimum perceptual quality loss to those photos, after applying any privacy-preserving transform to them. The same concern does not exist in our scenario, allowing us to explore much more free, even aggressive image distortions.

A useful resource to us was found in [39], which defined concrete privacy attributes and correlated them to image content. The authors categorized possible private information in images, and then run a user study to understand the privacy preferences. They then provided a sizable set of 22k images annotated with 68 privacy attributes, on which they trained privacy attribute predictors.

2.3 Recognition from Visually Degraded Data

To enable the usage of anonymized videos, one important challenge is to ensure reliable performance of the target tasks on those lower-quality videos, besides suppressing the undesired privacy leak. Among all low visual quality scenarios, visual recognition in low resolution is probably best studied.

[61, 28, 7] showed that low resolution object recognition could be significantly enhanced through proper pre-training and domain adaption. Low-resolution action recognition has also drawn growing interests: [46] proposed a two-stream multi-Siamese CNN that learns the embedding space to be shared by low resolution videos down sampled in different ways, on top of which a transform-robust action classifier was trained. [6] leveraged a semi-coupled filter-sharing two stream network to learn a mapping between the low- and high-resolution feature space. In comparison, the “low-quality” anonymized videos in our case are generated by learned and more complicated degradations, other than simple downsampling [61, 6].

3 Technical Approach

3.1 Problem Definition

Assume our training data (raw visual data captured by camera) are associated with a target task and a privacy budget . We mathematically express the goal of privacy-preserving visual recognition as below ( is a weight coefficient):

(1)

where denotes the model to perform the target task on its input data. Since is usually a supervised task, e.g., action recognition or visual tracking, a label set is provided on , and a standard cost function (e.g., softmax) is defined to evaluate the task performance on . On the other hand, we need to define a budget cost function to evaluate the privacy leak risk of its input data: the larger , the higher privacy leak risk. Our goal is to seek such an active degradation function to transform the original as the common input for both and , such that:

  • The target task performance is minimally affected compare to when using the raw data, i.e., .

  • The privacy budget is greatly suppressed compared to raw data, i.e., .

The definition of the privacy budget cost is not straightforward. Practically, it needs to be placed in concrete application contexts, often in a task-driven way. For example, in smart workplaces or smart homes with video surveillance, one might often want to avoid a disclosure of the face or identity of persons. Therefore, to reduce could be interpreted as to suppress the success rate of identity recognition or verification on the transformed video . Other privacy-related attributes, such as race, gender, or age, can be similarly defined too. We denote the privacy-related annotations (such as identity label) as , and rewrite as , where denotes the budget model to predict the corresponding privacy information. Different from , minimizing will encourage to diverge from as much as possible.

Such a supervised, task-driven definition of poses at least two-fold challenges: (1) the privacy budget-related annotations, denoted as , often have less availability than target task labels. Specifically, it is often challenging to have both and ready on the same ; (2) considering the nature of privacy protection, it is not sufficient to merely suppress the success rate of one model. Instead, define a privacy prediction function family : , the ideal privacy protection of should be reflected as suppressing every possible model from . That diverts from the common supervised training goal, where one only needs to find one model to successfully fulfill the target task. We re-write the general form (1) with the task-driven definition of :

(2)

For the solved , the two goals should be simultaneously satisfied: (1) there exists (“”) at least one function that can predict from well; (2) for all (“”) functions , none of them (even the best one) can reliably predict from . Most existing works chose an empirical (e.g., simple downsampling) and solved [9, 61]. [47] essentially solved to jointly adapted and , after which the authors empirically verified the effect of on (defined as face recognition error rates). Those approaches lack the explicit optimization towards privacy budgets, and thus have no guaranteed privacy-protection effects.

Comparison to Standard Adversarial Training

The most notable difference between (2) and existing works based on standard adversarial training [43, 38] lies in whether the adversarial perturbations are optimized for “fooling” one specific , or all possible s. We believe the latter to be necessary, as it considers generalization ability to suppressing unseen privacy breach. Moreover, most existing works seek perturbations with minimal human visual impacts, e.g, by enforcing norm constraint on the pixel domain. That is clearly unaligned with our purpose. In fact, our model could be viewed as to minimize the perturbation in the (learned) feature domain of target utility task.

3.2 Basic Framework

Overview Figure 1 depicts a model architecture to implement the proposed formulation (2). It first takes the original video data as the input, and passes it through the active degradation module to generate the anonymized video . During training, the anonymized video simultaneously goes through a target task model and a privacy prediction model . All three modules, , and

, are learnable and can be implemented by neural networks. The entire model is trained under the hybrid loss of

and . By tuning the entire pipeline from end to end, will find the optimal task-specific transformation, to the advantage of target task but to the disadvantage of privacy breach, fulfilling the goal of privacy-preserving visual recognition. After training, we can apply the learned active degradation at the local device (e.g., camera) to convert incoming video to its anonymized version, which is then transmitted to the backend (e.g., cloud) for target task analysis.

The proposed framework leads to an adaptive and end-to-end manageable

Figure 1: Basic adversarial training framework for privacy-preserving visual recognition.

pipeline for privacy-preserving visual recognition. Its methodology is related to the emerging research of feature disentanglement [64]. That technique leads to non-overlapped groups of factorized latent representations, each of which would properly describe information corresponding to particular attributes of interest. Previously it was applied to generative models [10, 51]

and reinforcement learning

[20].

Similar to GANs [16] and other adversarial models, our training is prone to collapse and/or bad local minimums. We thus propose a carefully-designed training algorithm with three-module alternating update strategy, explained in the supplementary, which could be interpreted as a three-party game. In principle, we strive to avoid any of the three module , , and to change “too quickly”, and thus keep monitoring and to decide which of the three modules to be updated next.

Choices of , and The choices of the three modules will significantly impact the performance. As [47] pointed out, can be constructed as a nonlinear mapping by filtering. The form of can be flexible, and its output is unnecessary to be a natural image. For simplicity, we choose

to be a “learnable filtering” in the form of 2-D convolutional neural network (CNN), whose the output

will be a 2-D feature map of the same resolution as the input video frame. Such a choice is only to facilitate the initial concatenation of building blocks, e.g., and often start with pre-trained models on natural images. Besides, should preferably be in a compact form and light to transmit, considering it will be sent to the cloud through (limited-bandwidth) channels.

To ensure the effectiveness of , it is necessary to choose sufficiently strong and models and let them compete. We employ state-of-the-art video recognition CNNs for corresponding tasks, and adapt them for the degraded input using the robust pre-training strategy proposed in [61].

Particular attentions should be paid towards the budget cost (second term) defined in (2), which we refer as “the Challenge”: if we use with some pre-defined CNN architecture, how could we be sure that it is the “best possible” privacy prediction model? That is to say, even we are able to find a function that manages to fail one model, is it possible that some other would still be able to predict from , thus leaking privacy? While it is computationally intractable to exhaustively search over , a naive empirical solution would be to chose a very strong privacy prediction model, hoping that a function that can confuse this strong one will be able to fool other possible functions as well. However, the resulting may still overfit the artifacts of one specific and fails to generalize. Section 3.3 will introduce two more advanced and feasible recipes.

Choices of and Without loss of generality, we assume both target task and privacy prediction to be classification models and output class labels. To optimize the target task performance, could be simply chosen as the KL divergence: .

Choosing is non-standard and tricky since we require minimizing the privacy budget to enlarge the divergence between and

. One possible choice is the negative KL divergence between the predicted class vector and the ground truth label; but minimizing a concave funcion will cause a ton of numerical instabilities (often explosions). Instead, we use the negative entropy function of the predicted class vector, and minimizing it to encourage “uncertain” predictions. Meanwhile, we will use

to ensure a sufficiently strong at the initialization (see 4.1.2). Furthermore, will play a critical role in model restarting (see 3.3).

3.3 Addressing the Challenge

To improve the generalization of learned over all possible (i.e, privacy cannot be reliably predicted by any model), we hereby discuss two simple and easy-to-implement options. Other more sophisticated model re-sampling or model-search approaches, e.g., [68], will be explored in future work.

Budget Model Restarting At certain point of training (e.g., when the privacy budget stops decreasing any further), we replace the current weights in with random weights. Such a random re-starting aims to avoid trivial overfitting between and (i.e., is only specialized at confusing the current ), without incurring more parameters. We then start to train the new model to be a strong competitor, w.r.t. the current : specifically, we freeze the training of and , and change to minimizing , until the new has been trained from scratch to become a strong privacy prediction model over current . We then resume adversarial training by unfreezing and , as well as replacing the loss for back to the negative entropy. It can repeat several times.

Budget Model Ensemble The other strategy proposes to approximate the continuous with a discrete set of sample functions. Assuming the budget model ensemble , we turn to minimizing the following discretized surrogate of (2):

(3)

At each iteration (mini-batch), minimizing (3) will only suppress the model with the largest cost, e.g., the “most confident” one about its current privacy prediction. The previous basic framework is a special case of (3) with = 1. The ensemble strategy can easily be combined with re-starting.

3.4 Two-Fold Evaluation Protocol

Apart from training data , assume we have an evaluation set , accompanied with both target task labels and privacy annotations . Our evaluation is significantly more complicated than classical visual recognition problems. After applying the learned active degradation, we need to examine in two folds: (1) whether the learned target task model maintains satisfactory performance; (2) whether the performance of an arbitrary privacy prediction model will deteriorate. The first one can follow the standard routine: applying the learned and to , and computing the classification accuracy via comparing w.r.t. : the higher the better.

For the second evaluation, it is apparently insufficient if we only observe that the learned and lead to poor classification accuracy on , because of the challenge. In other words, needs to generalize not only in the data space, but also w.r.t. the model space. To empirically verify that prohibits reliable privacy prediction for other possible models, we propose a novel procedure: we first re-sample a different set of models from ; none of them will be overlapped with the budget models used in training. We then train each of them to predict privacy information, over the degraded training data by applying the learned , i.e., minimizing , . Eventually, we apply each trained and on and compute the classification accuracy for the -th model. The highest accuracy achieved among the models on , denoted as , will be by default used to indicate the privacy protection capability of : the lower the better.

4 Experiments

We present two experiments on privacy-preserving action recognition, as proof-of-concepts for our proposed general framework. For the target task, we choose video-based action recognition, because it is a highly demanded feature in many smart homes and smart workplaces. The definition of privacy will vary by contexts, and we will study two settings: (1) avoiding the leak of person identities present in the current work; and (2) avoiding the leak of multiple privacy attributes, e.g., multiple crowdsoursed attributes studied in [39]. We emphasize that the generality of the proposed framework (2) can fit in a vast more variety of target task and privacy information compositions.

4.1 Identity-Preserving Action Recognition on SBU

Problem Setting

The SBU Kinect Interaction Dataset [67] is a two-person interaction dataset for video-based action recognition, with 8 types of actions and 13 different actor pairs annotated. We define action recognition as the target task , and the privacy budget task as reducing the correct identification rates of the actor pairs in the same video. We note that the target trade-off is highly challenging to achieve. As can be seen from the first Table in the supplementary, the actor pair recognition task easily achieves over 98% accuracy on the original dataset, and stands robust even when the frames are downsampled 28 times, while the action recognition performance already starts to deteriorate significantly. We compare the following five methods:

  • [wide, labelwidth=!, labelindent=0pt, noitemsep,topsep=0pt]

  • Method 1 (naive downsampling): using raw RGB frames under different down-sampling rates .

  • Method 2 (proposed w/o re-starting): applying the proposed adversarial training to RGB frames, using budget model ensemble without restarting .

  • Method 3 (proposed): applying the proposed adversarial training to RGB frames, using budget model ensemble with restarting .

  • Method 4: detecting and cropping out faces from RGB frames.

  • Method 5: detecting and cropping out whole actor bodies from RGB frames.

Method 1 follows [47], while Methods 4 and 5 are inspired by [26].

Implementation Details

We segment video sequences into groups of 16 frames, and use those frame groups as our default input data . We use the C3D net [60] as the default action recognition model, i.e. . For the identity recognition model, we choose MobileNet [21] to identify actor pair in each frame, and use average pooling to aggregate the frame-wise predictions. The active degradation module

adopts the image transformation network in

[24].

We choose to suppress the identity recognition performance on SBU. We first initialize the active degradation module as reconstruction of the input. We next take the pre-trained version of C3D net and concatenate it with , and jointly train them for action recognition on the SBU dataset, to initialize . We then freeze them both, and start initializing (MobileNet) for the actor pair identification task, by adapting it to the output of the currently trained . Experiments show that such initializations provide robust starting points for the follow-up adversarial training. If budget model restarting is adopted, we set to “re-start” MobileNet from random initialization after every 100 iterations. The number of ensemble budget models varies in . Different budget models can be obtained via setting different depth-multiplier parameter [21] of MobileNet.

Evaluation Procedure

We will follow the procedure described in Section 3.4, for two-fold evaluations on the SBU testing set. For the set of models used towards privacy-protection examination, we sample = 10 popular image classification CNNs, a list of which can be found in the supplementary

. Among them, 8 models start from ImageNet-pretrained versions, including MobileNet (different from those used in training)

[21], ResNet [19] and Inception [55]. To eliminate the possibility that the initialization might prohibit privacy prediction, we also intentionally try another 2 models trained from scratch (random initialization). We did not choose any non-CNN image classification model for two reasons: (1) CNNs have state-of-the-art performance and also strong fitting capability when re-trained; (2) most non-CNN image classification models rely on effective feature descriptors, that are designed for natural images. Since / are no longer natural images, the effectiveness of such models is in jeopardy too.

Results and Analysis
Figure 2: Target and Budget Task Performance Trade-off on SBU Dataset.

We present an innovative visualization in Figure 2, to display the trade-off between the action recognition accuracy and the actor pair recognition accuracy , in an easy-to-interpret way. All accuracy numbers for both task evaluation can be found in the supplementary. To read the figure, note that a desirable trade-off should incur minimal loss of (y-axis) when reducing (x-axis). Therefore, a point closer to the upper left corner denotes a more desirable model that achieves better trade-off, since it incurs less utility performance loss (larger ) while suppressing more even the best of unseen privacy prediction models (smaller ). For Method 1, a larger marker (circle) size represents a larger downsampling rate. For Methods 2 and 3, a larger marker (star) size denotes more budget models used in ensemble (i.e., larger ). Both Methods 4 and 5 give single points. Observations can be summarized below:

  1. [wide = 0pt, leftmargin = *]

  2. Methods 2 and 3 has obvious advantages over naive downsampling (Method 1) in terms of more compelling trade-off between and .

  3. Method 3 achieves much better individual model performance than Methods 2, showing the significance of re-starting in overcoming over-fitting.

  4. For Method 3 (similarly for Method 2), adding more models into the budget model ensemble (i.e., increasing ) consistently leads to points closer to the desired upper left (despite small randomness caused by training). The results show that proposed budget model ensemble technique effectively alleviates the challenge, making generalize better to unseen s.

  5. Neither Methods 4 nor 5 performs competitively. Cropping out faces does not add to protection of identity privacy (its is almost the same as using raw RGB frames without downsampling), since body shape and clothing styles reveal significant amounts of identity information too. Cropping out body not only removes identity information, but also deteriorates action recognition.

In brief, Method 3 (the proposed, with both restarting and ensemble) is able to achieve most compelling results on SBU, the best ( = 18) at = 82.3% and = 41.8%, without relying on extra hardware (e.g., depth camera). A few visualized examples of learned can be found in Figure 7.

4.2 Protecting Multiple Privacy Attributes on UCF-101/VISPR

Problem Setting

The UCF101 dataset [54] is a more challenging action recognition benchmark with 101 different action classes and much richer scenes than SBU. The newly-released Visual Privacy (VISPR) dataset [39] is a collection of 22, 167 images annotated with 68 privacy attributes, e.g. semi-nudity, hobbies, face, race, gender, skin color, and so on.

Either alone cannot be used for our adversarial training because neither dataset has the pair of competing tasks/labels. However, we show the possibility of jointly utilizing UCF101 and VISPR to train one function, such that it simultaneously enables effective action recognition (as measured on UCF-101) and suppresses multiple privacy attribute prediction (as measured on VISPR), via a novel cross-dataset training and evaluation strategy.

Cross-Dataset Training and Evaluation

and are the same as the SBU one. Regarding for the task of multiple privacy attribute prediction, we adopt budget models from {ResNet-V1-50, ResNet-V2-50, MobileNet-1.0, MobileNet-0.75} and vary the number of in . They are trained with a multi-label classification loss with sigmoid activations, identically to [39]. is chosen as 0.5. Rather than recognition accuracy, the multiple privacy attribute prediction is evaluated by Class-based Mean Average Precision (C-MAP), whose definition can be referred to [39].

The major hassle is the unavailability of and on the same dataset: since UCF-101 does not have privacy attributes annotated, we cannot directly perform adversarial training and evaluate privacy protection on it; similarly for VISPR. We notice that [39] trained the model on VISPR to be a privacy predictor for general images. We also visually observe that the VISPR model can correctly detect privacy attribute occurring in UCF-101 videos (examples in the supplementary). Therefore, we hypothesize that the privacy attributes have good “transferability” between UCF-101 and VISPR, and hence use a privacy prediction model trained on VISPR to assess the privacy leak risk on UCF-101.

Instead of using all 68 attributes in [39], we find that many of them rarely appear in UCF-101 (shown in the supplementary). We thus create two subsets for training and evaluating budget models here: one VISPR-17 set consists of 17 attributes that occurs most in UCF-101 and their associated images in VISPR; the other VISPR-7 set is further a subset of VISPR-17, that include 7 privacy attributes out of 17 that are most common in smart home settings. Their attribute lists are in the supplementary.

During training, we have two pipelines: one is + trained on UCF-101 for action recognition; the other is + trained on VISPR to suppress multiple privacy attribute prediction. The two pipelines share the same parameters for . The initialization and alternating training strategy remain unchanged from SBU. During evaluation, we perform the first part of two-fold evaluation, e.g., action recognition, on UCF-101 testing set. We then evaluate the performance of the -model examination on privacy protection, using the VISPR-17/7 testing sets. Such cross-dataset training and evaluation sheds on new possibilities on training privacy-preserving recognition models, even under the practical shortages of datasets that have been annotated for both tasks.

Figure 3: Performance Trade-off on UCF-101/VISPR dataset. The left one is on VISPR-17 and the right one on VISPR-7.
Results and Analysis

We choose Methods 1, 2,and 3 for comparison, defined the same as SBU. All the quantitative results, as well as visualized examples of on UCF-101, are shown in the supplementary

. Similarly to the SBU case, simply downsamping video frames (even with the aid of super resolution as we tried) will not lead to any competitive trade-off between action recognition (at UCF-101) and privacy prediction suppression (at VISPR). As is shown in

Figure 3, our proposed adversarial training again leads to more favorable trade-offs on VISPR-17 and VISPR-7, with major conclusions concur with SBU: both ensemble and restarting help generalize better against privacy breach.

Method 2, M=1
(a) Method 2, M=1
(b) Method 2, M=4
(c) Method 2, M=8
(d) Method 2, M=14
(e) Method 3, M=1
(f) Method 3, M=4
(g) Method 3, M=8
(h) Method 3, M=14
Figure 4: *
Figure 5: Example frames after applying the learned degradation on SBU.
Figure 4: *

5 Limitations and Discussions

As noted by one anonymous reviewer, a possible alternative to avoid leaking visual privacy to the cloud is to perform action recognition completely at the local device. In comparison, our proposed solution is motivated by at least three folds: i) for single utility task (which is not just limited to action recognition), running on device is much more compact and efficient than full For example, our model (11-layer C3D net) has over 70 million parameters, while is a much more compact 3-layer CNN with 1.3 million parameters. At the inference, the total time cost of running over the SBU testing set is 45 times more than running . It also facilitates upgrading to more sophisticated models; ii) The smart home scenario calls for the scalability to multiple utility tasks (computer vision functions). It is not economic to load all utility models in the device. Instead, we can train one to work with multiple utility models, and only store and run at the device. More utility models (if no overlap with privacy) could be possibly added in the cloud by training on ; iii) We further point out that the proposed approach can further have a wider practical application scope beyond smart home, e.g, de-identified data sharing.

The current pilot study is preliminary in many ways, and there is large performance room to improve until achieving practical usefulness. First, the definition of and is core to the framework. Considering the challenge, the current budget model ensemble is a rough discretized approximation of . More elegant ways to tackle this optimization can lead to further breakthroughs in universal privacy protection. Second, adversarial training is well-known to be difficult and instable. Improved training tricks, such as [48], will be useful. Third, a lack of related benchmark datasets, on which and are both appropriately defined, has become a bottleneck. We see that more concrete and precise privacy definitions, such as VISPR attributes, can certainly result in better feature disentanglement and - performance trade-offs. Current cross-dataset training and evaluation partially alleviate the absence of dedicated datasets. However, the inevitable domain mismatch between two datasets can still hurdle the performance. We plan to refer to crowdsourcing to identify and annotate privacy-related attributes on existing action recognition or other benchmarks, which we hope could help promote this research direction.

References

Appendix A Adversarial Training Algorithm

Algorithm 1 outlines a complete and unified adversarial training algorithm using the ensemble of budget models, with restarting. If we choose = 1 and skip the restarting step, it is reduced to the basic adversarial training framework.

The algorithm could also be viewed as a 3-competitor game: as an obfuscator, (or the ensemble) as an attacker, and as an utilizer. Algorithm 1 then essentially solves the following two optimization problems iteratively (single case for example):

(4)
(5)

where both and are softmax functions, is the entropy function. In the -ensemble case, (5) will search for the worst case to minimize.

Given pre-trained active degradation module , target task module , and budget modules {}
for number of training iterations do
     Sample a mini-batch of k examples {}
     Update active degradation module (weights ) with stochastic gradients:
Suppress only the most confident one among all budget models
     
The L1 Loss term is only used in the SBU experiment
     while target task validation accuracy Threshold do
Threshold = 90% for SBU and 70% for UCF101/VISPR
         Sample mini-batch of examples {}
         Update target task module (weights ) and active degradation
         module (weights ), with stochastic gradients:
Avoid too weak competitor on the side.
         ,
     end while
     while budget task training accuracy Threshold do
Threshold = 95% for both datasets
         Sample mini-batch of k examples {}
         Update budget task module (weights ) by stochastic gradients:
Avoid too weak competitor on the side.
         
     end while
     if current training iteration % 100 = 0 then
We empirically restart all budget models every 100 iterations
         Restart all budget models, and repeat Algorithm 1 from the beginning.
     end if
end for
Algorithm 1 Adversarial Training for Privacy-Preserving Visual Recognition.

Appendix B Experiments on SBU

b.1 Results for Methods 1

The proposed identity-preserving action recognition task on SBU is a very challenging one, since videos are taken in highly controlled indoor environments and all actors are clearly viewable in the central regions of each frame. The identity recognition task can also utilize information other than faces: the body shape and even clothes colors are invariant for the same actor across different videos/actions. Different actors wear very distinct clothes with different colors and textures. Table 1 displays the trade-off numbers at different downsampling ratios , for Methods 1.

b.2 Two-Fold Evaluation Results for Methods 2 and 3

Table 2 displays the details numbers, for the second part of our proposed two-fold evaluation, with = 10 models. The top sub-table is for Method 2, and the bottom sub-table for Method 3.

The corresponding action recognition results, i.e. the first part of two-fold evaluation, are also attached after either sub-Table.

We want to make an additional note here: for Methods 1, 4 and 5, the privacy prediction is evaluated using only one model; while in Methods 2 and 3, the privacy suppression effect is evaluated using the highest achievable number among = 10 different models. Therefore, the evaluation protocol for Methods 2 and 3 is “stricter”, and its gain on privacy protection compared to Methods 1, 4, 5 will be essentially “underestimated”, if we just directly compare accuracy numbers.

b.3 Visualization Examples of Learned Degradations on SBU

Please refer to Figure 7 for visualized examples of learned .

Method 2, M=1
(a) Method 2, M=1
(b) Method 2, M=2
(c) Method 2, M=4
(d) Method 2, M=6
(e) Method 3, M=1
(f) Method 3, M=2
(g) Method 3, M=4
(h) Method 3, M=6
(i) Method 2, M=8
(j) Method 2, M=10
(k) Method 2, M=12
(l) Method 2, M=14
(m) Method 3, M=8
(n) Method 3, M=10
(o) Method 3, M=12
(p) Method 3, M=14
Figure 6: *
Figure 7: Example frames after applying the learned degradation on SBU
Figure 6: *

-.5in-.5in

Table 1: The action recognition and actor pair recognition accuracies w.r.t. the spatial downsampling ratio , using pre-trained C3D net and MobileNet.
s=1 s=2 s=3 s=4 s=6 s=8 s=14 s=16 s=28 s=56
Method 1 Action 88.83 87.90 86.98 81.86 79.53 74.88 65.12 64.37 56.28 33.49
(RGB Downsampling) Actor 98.87 97.23 96.45 95.50 95.24 94.11 93.94 92.15 90.28 60.93

-.5in-.5in

Table 2: SBU Two Fold Evaluation
M=1 M=2 M=4 M=6 M=8 M=10 M=12 M=14 M=16 M=18
resnet_v1_50 70.8 65.4 70.3 67.2 65.1 68.3 65.8 61.7 62.4 59.3
resnet_v1_101 68.3 67.6 71.4 69.4 66.8 69.7 63.0 62.5 59.2 57.0
resnet_v2_50 62.6 62.1 61.9 64.9 63.3 62.3 58.4 61.1 62.9 60.8
resnet_v2_101 69.6 66.9 71.4 68.9 66.1 64.2 65.2 64.9 64.8 60.0
mobilenet_v1_100 73.6 71.8 72.9 65.4 65.7 71.2 67.5 65.4 67.3 63.2
mobilenet_v1_075 71.3 72.4 71.4 70.9 66.5 66.3 66.1 66.3 65.5 61.1
inception_v1 66.7 60.8 66.4 58.9 64.2 60.5 58.5 61.8 57.4 63.5
inception_v2 60.6 61.3 68.7 67.6 60.3 59.1 62.3 61.1 61.6 62.1
mobilenet_v1_050 71.2 70.5 69.6 71.6 67.2 70.6 67.5 65.2 64.4 63.2
mobilenet_v1_025 70.6 71.5 71.9 70.2 66.4 70.7 69.8 65.8 65.5 64.2
C3D 83.2 84.1 82.7 83.6 80.8 88.3 82.7 83.3 83.5 82.6
M=1 M=2 M=4 M=6 M=8 M=10 M=12 M=14 M=16 M=18
resnet_v1_50 55.5 47.2 54.1 46.9 41.9 42.8 44.2 38.4 37.3 32.4
resnet_v1_101 49.7 54.6 40.2 51.2 44.9 57.2 44.7 41.7 42.2 34.5
resnet_v2_50 42.3 49.7 52.9 40.8 42.3 43.8 57.8 40.4 40.9 35.2
resnet_v2_101 54.4 38.9 49.2 44.9 41.5 44.8 44.02 42.0 39.6 50.6
mobilenet_v1_100 60.5 55.8 51.2 49.8 47.7 45.3 42.8 43.1 41.9 41.8
mobilenet_v1_075 58.2 57.9 52.4 51.1 46.9 44.1 45.2 41.8 41.2 40.2
inception_v1 51.3 54.4 45.8 44.9 42.5 41.2 44.8 38.8 35.3 45.8
inception_v2 44.2 38.2 42.4 49.4 45.9 44.3 41.0 42.5 39.4 47.1
mobilenet_v1_050 58.2 56.2 54.6 46.6 43.6 41.2 38.5 39.3 34.2 35.8
mobilenet_v1_025 54.8 54.3 52.9 52.5 43.5 44.7 41.1 42.6 42.5 38.5
C3D 81.7 82.6 78.0 82.8 82.2 82.1 83.5 83.1 82.6 82.3

stands for training from scratch instead of fine-tuning and stands for budget model restarting.

Appendix C Experiments on UCF-101 / VISPR

c.1 “Transferability” Study of Privacy Attributes between UCF-101 and VISPR

c.1.1 Selection of 17 and 7 Privacy Attributes

Figure 8: Attribute-wise occurrence statistics on UCF-101 videos, evaluated using the pretrained privacy prediction model on VISPR.

There are 13,421 videos in the UCF-101 dataset. For each video, we evaluate it using the privacy attribute prediction model pretrained on VISPR dataset: see the statistic plot in Figure Figure 8, we observe that there are 43 attributes that can be found at least once in UCF101 videos. But only 17 out of the 43 are frequently occurring. These 17 attributes are {age_approx, weight_approx, height_approx, gender, eye_color, hair_color, face_complete, face_partial, semi-nudity, race, color, occupation, hobbies, sports, personal relationship, social relationship, safe}.

Among the 17 frequent attributes, we carefully select 7 privacy attributes that best fit the smart home setting. These 7 attributes are {semi-nudity, occupation, hobbies, sports, personal relationship, social relationship}.

c.1.2 Privacy Attribute Examples in UCF-101

In Figure 9, we show some example frames from UCF101 with privacy attributes predicted using the VISPR-pretrained model. In each example, the right column denotes the predicted privacy attributes (as defined in the VISPR dataset [40]) and associated confidences from the left column frames, showing a high risk of privacy leak in daily common videos. We qualitatively examine a large number of UCF-101 frames and determine that privacy attributes prediction are highly reliable.

(a) ApplyLipStick
(b) BabyCrawling
(c) PlayingPiano
(d) ShavingBeard
(e) Situp
(f) YoYo
Figure 9: Privacy attributes prediction on example frames from UCF101. The right column denotes the predicted privacy attributes (as defined in the VISPR dataset [40]) and associated confidences from the left column frames, showing a high risk of privacy leak in daily common videos.
Method 2, M=1  
(a) Method 2, M=1  
(VISPR-17)
(b) Method 2, M=2  
(VISPR-17)
(c) Method 2, M=3  
(VISPR-17)
(d) Method 2, M=4  
(VISPR-17)
(e) Method 3, M=1  
(VISPR-17)
(f) Method 3, M=2  
(VISPR-17)
(g) Method 3, M=3  
(VISPR-17)
(h) Method 3, M=4  
(VISPR-17)
(i) Method 2, M=1  
(VISPR-7)
(j) Method 2, M=2  
(VISPR-7)
(k) Method 2, M=3  
(VISPR-7)
(l) Method 2, M=4  
(VISPR-7)
(m) Method 3, M=1  
(VISPR-7)
(n) Method 3, M=2  
(VISPR-7)
(o) Method 3, M=3  
(VISPR-7)
(p) Method 3, M=4  
(VISPR-7)
Figure 10: *
Figure 11: Example frames after applying the learned degradation on UCF-101 with adversarial training on VISPR-17 and VISPR-7
Figure 10: *

c.2 UCF-101 / VISPR Two-Fold Evaluation

The trade-off results between UCF-101 with VISPR-17 and VISPR-7 are found in Tables 3 and 4, respectively. Note that for the =10 privacy attribute prediction evaluation, the results are in class-based MAP (cMAP) rather than recognition accuracy.

M=1 M=1 M=2 M=2 M=3 M=3 M=4 M=4
resnet_v1_50 66.68 63.45 62.12 63.78 65.59 62.12 65.12 59.83
resnet_v1_101 65.78 59.24 62.48 61.29 59.59 61.23 64.21 61.49
resnet_v2_50 62.12 65.28 66.94 62.48 59.59 59.56 62.34 60.47
resnet_v2_101 59.12 61.45 57.59 59.43 58.32 61.43 64.23 59.48
mobilenet_v1_100 63.45 58.48 62.69 61.47 64.39 61.59 65.01 57.43
mobilenet_v1_075 62.23 62.48 64.28 59.47 60.27 58.57 55.48 57.57
inception_v1 58.32 62.49 59.39 64.82 63.57 61.39 63.58 58.46
inception_v2 65.79 61.28 64.52 63.58 60.49 63.58 60.25 59.39
mobilenet_v1_050 65.12 60.25 64.29 59.49 62.48 63.58 63.58 62.06
mobilenet_v1_025 62.54 63.59 62.58 62.46 60.47 59.20 58.27 61.36
C3D 66.58 66.36 64.46 65.27 65.28 65.89 66.59 65.83

stands for training from scratch instead of fine-tuning and stands for budget model restarting

Table 3: UCF-101 / VISPR-17 Two-Fold Evaluation
M=1 M=1 M=2 M=2 M=3 M=3 M=4 M=4
resnet_v1_50 40.68 38.24 38.45 35.67 35.34 32.54 35.58 33.41
resnet_v1_101 32.21 37.69 37.31 36.21 37.35 34.53 37.48 32.67
resnet_v2_50 33.46 37.13 39.94 36.28 32.59 34.13 36.69 33.46
resnet_v2_101 35.25 34.49 32.58 35.38 38.59 35.16 37.24 31.53
mobilenet_v1_100 33.28 35.24 37.54 32.48 31.59 28.36 32.48 29.57
mobilenet_v1_075 28.59 34.58 38.23 31.59 35.38 30.94 29.58 32.58
inception_v1 35.28 37.56 36.84 27.48 29.48 30.48 32.04 34.48
inception_v2 38.47 36.39 35.29 30.92 28.59 33.59 35.38 29.58
mobilenet_v1_050 38.49 28.49 32.56 33.48 31.58 32.58 38.32 33.48
mobilenet_v1_025 35.47 38.42 34.93 31.28 33.37 34.78 33.57 30.08
C3D 65.16 65.58 64.53 66.46 65.38 64.28 64.83 65.37

stands for training from scratch instead of fine-tuning and stands for budget model restarting

Table 4: UCF-101 / VISPR-7 Two-Fold Evaluation

c.3 Visualization Examples of Learned Degradation on UCF-101 / VISPR

For visualized examples of learned , please refer to Figure 11 for VISPR-17 and VISPR-7.