Realistic Full-Body Anonymization with Surface-Guided GANs

01/06/2022
by   Håkon Hukkelås, et al.
5

Recent work on image anonymization has shown that generative adversarial networks (GANs) can generate near-photorealistic faces to anonymize individuals. However, scaling these networks to the entire human body has remained a challenging and yet unsolved task. We propose a new anonymization method that generates close-to-photorealistic humans for in-the-wild images.A key part of our design is to guide adversarial nets by dense pixel-to-surface correspondences between an image and a canonical 3D surface.We introduce Variational Surface-Adaptive Modulation (V-SAM) that embeds surface information throughout the generator.Combining this with our novel discriminator surface supervision loss, the generator can synthesize high quality humans with diverse appearance in complex and varying scenes.We show that surface guidance significantly improves image quality and diversity of samples, yielding a highly practical generator.Finally, we demonstrate that surface-guided anonymization preserves the usability of data for future computer vision development

READ FULL TEXT VIEW PDF

Authors

page 1

page 3

page 5

page 7

page 8

02/28/2020

A U-Net Based Discriminator for Generative Adversarial Networks

Among the major remaining challenges for generative adversarial networks...
04/12/2018

MGGAN: Solving Mode Collapse using Manifold Guided Training

Mode collapse is a critical problem in training generative adversarial n...
01/23/2019

Learning to navigate image manifolds induced by generative adversarial networks for unsupervised video generation

In this work, we introduce a two-step framework for generative modeling ...
01/07/2019

Better Guider Predicts Future Better: Difference Guided Generative Adversarial Networks

Predicting the future is a fantasy but practicality work. It is the key ...
05/18/2022

BodyMap: Learning Full-Body Dense Correspondence Map

Dense correspondence between humans carries powerful semantic informatio...
05/06/2019

Source Generator Attribution via Inversion

With advances in Generative Adversarial Networks (GANs) leading to drama...
02/03/2020

Adversarial-based neural networks for affect estimations in the wild

There is a growing interest in affective computing research nowadays giv...

Code Repositories

full_body_anonymization

Realistic Full-Body Anonymization with Surface-Guided GANs


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art computer vision methods require a substantial amount of data, but collecting this data without anonymization would violate privacy regulations in several regions. Recent work reflects that generative adversarial networks (GANs) [12]

can realistically anonymize faces, where the anonymized datasets perform similarly to the original for future computer vision development. However, these methods

[49, 21, 35] focus solely on face anonymization, leaving several primary (e.g. ears, gait [23]) and soft identifiers (e.g. gender) on the human body untouched.

Generative adversarial networks are great at synthesizing high-resolution images in many domains, including humans [27]. Despite this success, previous work on full-body generative modeling focus on simplified tasks, such as motion transfer [6], pose transfer [3, 31], garment swapping [16], or rendering a body with known 3D structure of the scene [54]. These methods do not directly apply to in-the-wild anonymization, as the model has to ensure that the generated human figure is consistent with the environment of the scene. As far as we know, our work is the first to address the task of synthesizing humans into in-the-wild images without simplifying the task (e.g. having a source texture to transfer, or known 3D structure of the scene) 222Although, we note that CIAGAN [35] ablates their method for low-resolution human synthesis..

In this work, we propose Surface-Guided GANs that utilize Continuous Surface Embeddings (CSE) [41] to guide the generator with pixel-to-surface correspondences. The compact, high-fidelity and continuous representation of CSE excels for human synthesis, as it allows for simple modeling choices without compromising fine-grained details. This is in contrast to previously adapted conditional representations (e.g. DensePose [15] or segmentation maps), where it is not clear how to perform simple operations (e.g. downsampling) or how to handle border conditions with semantic-based representations. These operations are straightforward with CSE.

Our contributions address the challenge of efficiently utilizing the powerful CSE representation for human synthesis and can be summarized into three points.

First, we propose Variational Surface Adaptive Modulation (V-SAM) that projects the input latent space of the generator to an intermediate surface-adaptive latent space. This allows the generator to directly map the latent factors of variations to relevant surface locations, resulting in a latent space disentangled from the spatial image.

Secondly, we propose Discriminator Surface Supervision that incentivizes the discriminator to learn pixel-to-surface correspondences. The surface awareness of the discriminator provides higher-fidelity feedback to the generator, which significantly improves image quality. In fact, we show that the surface-aware feedback from the discriminator is a key factor to the powerful representation learned by V-SAM, where similar semantic-based supervision [48] yields suboptimal results.

Finally, we present a full-body anonymization framework for in-the-wild images that produces close-to-photorealistic images. We demonstrate that surface-guided anonymization significantly improves upon traditional methods (e.g. pixelation) in terms of data usability and privacy. For example, pixelation degrades the person average precision by 14.4 for Mask R-CNN [17] instance segmentation. In contrast, surface-guided anonymization yields only a 2.8 degradation.

2 Related Work

Anonymization of Images

Naive anonymization methods that apply simple image distortions (e.g. blurring) are known to be inadequate for removing privacy-sensitive information [14, 39], and severely distorts the data. Recent work reflects that deep generative models can realistically anonymize faces by inpainting [50, 49, 21, 35, 2] or transforming the original image [10]. These methods demonstrate that retaining the original data distribution is important for future computer vision development (e.g

. evaluation of face detection

[21]). However, prior work focuses on face anonymization, leaving several primary and secondary identifiers untouched. Some methods anonymize the entire body [35, 4], but these methods are limited to low-resolution images [35] or generate images with visual artifacts [4].

Conditional Image Synthesis

Current state-of-the-art for conditional image synthesis generates highly realistic images in many domains, such as image-to-image translation

[22, 48]. An emerging approach is to introduce conditional information to the generator via adaptive modulation (also known as adaptive normalization [19]). This is known to be effective for unconditional synthesis [27], semantic synthesis [43], and style transfer [19]. Adaptive modulation conditions the generator by layer-wise shifting and scaling feature maps of the generator, where the shifting and scaling parameters are adaptive with respect to the condition. In contrast to prior semantic-modulation methods [43, 51, 52], V-SAM conditions the modulation parameters on dense surface information, and generates global modulation parameters instead of independent layer-wise parameters. Conditional modulation is adapted for human synthesis, where prior methods adapt spatially-invariant [36, 46], or spatially-variant modulation [58, 1]. However, these methods are conditioned on a source human image, making them inadequate for anonymization purposes.

Human Synthesis

Prior work for person image generation often focus on resynthesizing humans with user-guided input, such as rendering persons in novel poses [3, 31], with different garments [16], or with a new motion [6]. Recent work [7, 30, 47, 40, 13] employ dense pixel-to-surface correspondences in the form of DensePose UV-maps [15]. These methods ”fill in” UV texture maps, then render the person in new camera views [7] or poses [30, 47, 40, 13]. In other cases, the aim is to reconstruct the 3D surface along with the texture [38, 45, 54], which can be rendered to the scene given a camera view [54]. A limited amount of work focuses on human synthesis without a source image, where Ma et al. [34] maps background, pose and person style into Gaussian variables, enabling synthesis of novel persons. None of the aforementioned methods are directly applicable for human anonymization, as they require information about a source identity, or the camera position to render the person. Additionally, none of them account for modeling variations in the scene.

3 Method

Figure 2: (a) A CSE-detector [41] predicts pixel-to-surface correspondences represented as a continuous positional embedding . For simplicity, we show the pipeline with a single person, but multi-person detection is done by cropping each person out (see Figure 1). (b) The mapping network () transform surface locations and the latent variable () into an intermediate surface-adaptive latent space () (Sec. 3.1, Sec. 3.2). Then, controls the generator with pixel-wise modulation and normalization at every convolutional layer. (c) Our FPN-discriminator predicts the surface embedding for each image and optimizes a surface-regression loss (, Sec. 3.3) along with the adversarial loss ().

We describe the anonymization task as a guided inpainting task. The objective of the generator is to inpaint the missing regions in the image , where , and indicates that pixel is missing. For each missing pixel, the continuous surface embedding (the output of a CSE-detector [41]) represents the position of pixel on a canonical 3D surface  (i.e. the position on a ”T-shaped” human body). The surface is discretized with vertices, where each vertex has a positional embedding obtained from the pre-trained CSE-detector [41]. From this, pixel-to-vertex correspondences are found from euclidean nearest neighbour search between and 333 Finding pixel-to-vertex correspondence is not strictly necessary for our method. However, replacing the regressed embedding with the nearest prohibits the generator from directly observing embeddings regressed from the original image. This can mitigate identity leaking through CSE-embeddings. . Figure 2 shows the overall architecture.

3.1 Surface Adaptive Modulation

Inspired by the effectiveness of semantic-adaptive modulation [43], we introduce Surface Adaptive Modulation (SAM). SAM normalizes and modulates convolutional feature maps with respect to dense pixel-to-surface correspondences between the image and a fixed 3D surface. Given the continuous positional embedding , a non-linear mapping transforms to an intermediate surface-adaptive representation ;

(1)

where and is a pixel-independent learned parameter for all pixels that do not correspond to the surface. Given , a learned affine operation transforms to layer-wise ”styles” () (we use the word ”style” following prior work [19, 27]) to scale and shift the feature map ;

(2)

where each pixel is modulated by (), independently. Note that we follow StyleGAN2 design [28], with modulation before convolution, and normalization after.

(a)
(b)
(c)
(d)
(e) SPADE
Figure 3: Visualization of the norm of for SAM where has  layers (a-d), and (e) show SPADE [43] with 26 semantic regions. Note that SAM learns much more fine-grained details (e.g. zoom in on head or fingers) than its semantic counterpart [43].

The global mapping network () enables the generator to adapt the smooth surface embedding into semantically meaningful surface-adaptive styles, which are not necessarily smooth. For instance, this enables the generator to learn part-wise continuous styles with clearly defined semantic borders (e.g. between two pieces of clothing). We observe that a deeper mapping network learns higher-fidelity modulation parameters (Figure 3), and experimentally show that this leads to improved image quality (see Section 4.1).

Compared to prior semantic-based modulation [43, 51, 52], SAM uses a denser and more informative representation that excels at human synthesis. Semantic-based modulation learns spatially-invariant (but semantic-variant) modulation parameters [52], which is reflected in Figure 3. These spatially-invariant parameters are efficient for natural image synthesis, but translates poorly to the highly fine-grained task of human figure synthesis. In contrast, SAM learns high-fidelity modulation parameters that are independent of pre-defined semantic regions, a trait which is crucial to synthesize realistic humans. Learning this representation through semantic-based modulation is difficult, as it is computationally expensive, requires pre-defined semantic regions, and complicates simple operations (e.g. downsampling of border regions).

3.2 Variational Surface Adaptive Modulation

A key limitation to the modulation method above is that the appearance of the synthesized person is variant to affine image transformations. Typically, an image-to-image generator inputs a latent code () directly to a 2D feature map through concatenation or additive noise. However, this entangles the latent code with the spatial feature map, making the appearance of the generated person dependent on the position in the image.

Instead of inputting to a 2D feature map, we extend SAM to condition the mapping network on ; . Now, transforms the latent variable to a surface-adaptive intermediate latent space (), which is modulated onto the spatial feature map. This naive extension of SAM allows the generator to directly relate latent factors of variations to specific positions on the surface, improving the disentangled representation of the generator (ablated in Section 4.1). Furthermore, the pixel-wise design of V-SAM modulates the activations invariant to image rotation and translation. As an effect, this improves the ability of the generator to synthesize the same person independent of its spatial position. However, rotational invariance is not retained in the generator, as the generator is not rotationally invariant itself 444However, adapting V-SAM with StyleGAN3-R [26] results in a surface-guided rotationally invariant generator..

Our mapping network is similar to the one of StyleGAN [27], where removing from the input of

 would yield an identical mapping network to the one in StyleGAN (except that we use residual connections in

to improve training stability). Furthermore, V-SAM is different from variational semantic-based modulation [51, 62]. Variational semantic modulation [51, 62] use class-specific latent variables to enable controlled synthesis of different regions, whereas we use a single latent variable for the entire body. Class-specific latent variables do not directly apply to human synthesis, unless one introduces further knowledge of human characteristics into the generation of these (e.g. the left arm is often similar to the right arm).

3.3 Discriminator Surface Supervision

Supervising the discriminator by teaching it to predict conditional information (instead of inputting it), is known to improve image quality and training stability [48, 42]. Motivated by the effectiveness of semantic supervision [48], we propose a similar objective for surface embeddings.

We formulate the surface embedding prediction as a regression task. We extend the discriminator with a FPN-head that outputs a continuous embedding for each pixel; . Then, along with the adversarial objective, the discriminator optimizes a masked version of the smooth loss [11];

(3)

Similarly, the generator objective is extended with the regression loss with respect to the generated image. Unlike the original CSE loss [41], this objective is simpler as we assume a fixed embedding  which is learned in beforehand.

Discriminator surface supervision explicitly encourages the discriminator to learn pixel-to-surface correspondences. This yields a discriminator that provides highly detailed gradient signals to the generator, which considerably improves image quality. In comparison to semantic-based supervision [48], surface-supervision provides higher-fidelity feedback without relying on pre-defined semantic regions. Finally, we found that additionally predicting ”real” and ”fake” areas (as in OASIS [48]) negatively affects training stability, and that a FPN-head is more stable to train compared to a U-Net [44] architecture (as used in [48]).

3.4 The Anonymization Pipeline

Our proposed anonymization framework consists of two stages. Initially, a CSE-based [41] detector computes the location of humans, including a dense 2D-3D correspondence between the 2D image and a fixed 3D human surface. Given the detected human, we zero-out pixels covering the human body and complete the partial image with a generative model. Note that the masks generated from CSE [41] do not cover areas that are ”outside” of the human body, thus we dilate the mask to ensure that it covers clothing and hair. We extend Equation 1 with an additional pixel-independent learned parameter for the dilated regions (similar to ), to ensure a smooth transition between known areas and unknown dilated areas (without a surface embedding).

(a) Original
(b) SPADE
(c) Config B
(d) D, (n=0)
(e) Config D
(f) Config E
Figure 4: Synthesized images for the different model iterations in Table 1. Appendix D includes random examples.
(a)
(b)
(c)
(d)
(e)
Figure 5: Diverse synthesis with Config E. (a) is the input, (b) is the generated image with latent truncation (t=0), and (c-e) are without truncation. Appendix D includes random examples.

4 Experiments

We validate our design choices in Section 4.1 and compare Surface-guided GANs to the semantic-based equivalent in Section 4.2. In Section 4.3 we ablate on the DeepFashion [33] dataset for scene-independent human synthesis. Finally, in Table 6 we evaluate the impact of anonymization for future computer vision development.

Architecture Details

We follow the design principles of StyleGAN2 [28] for our network architectures. The generator is a typical U-Net [44] architecture previously adapted for image-to-image translation [22], and the discriminator is similar to the one of StyleGAN2. The baseline discriminator has 8.5M parameters and the generator has 7.4M. We use the non-saturating adversarial loss [12] and regularize the discriminator with epsilon penalty [24] and r1-regularization [37]. We mask the r1-regularization by , similar to [56, 20]. Data augmentation is used for COCO-Body, including horizontal flip, geometrical transforms and color transforms. Otherwise, we keep the training setup simple, with no feature matching loss [53], path length regularization [28], nor adaptive data augmentation [25]. For simplicity, we set the dimensionality of and the fully connected layers in to 512. We use 6 fully-connected layers in , unless stated otherwise. See details in Appendix A.

Dataset Details

We validate our method on two datasets; a derived version of the COCO-dataset [32] (named COCO-Body) for full-body anonymization, and DeepFashion [33]

for static scene synthesis. We will open source the CSE-annotations for both datasets.

  • COCO-Body contains cropped images from COCO [32], where a single human is in the center of the image. Each image has automatically annotated CSE embeddings and a boolean mask indicating the area to be replaced. Note that each mask is dilated from the original CSE-embedding to ensure that the mask covers all parts of the body. The dataset contains 43,053 training images and 10,777 validation images, with a resolution of . The annotation process is described in Appendix B.

  • DeepFashion-CSE includes images from the In-shop Clothes Retrieval Benchmark of DeepFashion [33], where we have annotated each image with a CSE embedding. It has 40,625 training images and 10,275 validation images, where each image is downsampled to . The dataset includes some errors in annotations, as no annotation validation is done.

Evaluation Details

We follow typical evaluation practices for generative modeling. We report Fréchet Inception Distance (FID) [18], Learned Perceptual Image Patch Similarity (LPIPS) [60], LPIPS Diversity [61], and Perceptual Path Length (PPL) [27]. FID, LPIPS and LPIPS diversity is found by generating 6 images per validation sample, where the reported LPIPS is the average. In addition, we report the face quality by evaluating FID for the face region (details in Appendix A). All metrics for each model is reported in Appendix C.

max width= Method LPIPS FID PPL LPIPS Diversity A Baseline 0.239 10.8 50.1 0.167 B + * 0.224 7.3 40.4 0.157 C + SAM 0.217 4.8 22.0 0.163 D + V-SAM 0.217 4.5 18.1 0.173 E + Larger D/G 0.210 4.0 25.1 0.172

Table 1: FID and LPIPS for various generator (G) and discriminator (D) designs. * is applied to G and D, where G receives CSE-information by concatenation with the input image.

4.1 Attributes of Surface-Guided GANs

We iteratively develop the baseline architecture to introduce surface-guidance. Table 1 (and Figure 4) reflects that the addition of discriminator surface-supervision (config B) and surface modulation (config C/D) drastically improves image quality. Config E increases the model size of the generator and discriminator to 39.4M and 34M parameters, respectively. The final generator produces high-quality and diverse results (Figure 5). In addition, the conditional intermediate latent space is amenable to similar techniques as the latent space of StyleGAN [27], e.g. the truncation trick [5]

and latent interpolation (ablated in Appendix

C). Figure 5 includes generated images with latent truncation.

Mapping Network Depth

max width= depth SAM (Config C) V-SAM (Config D) Face FID FID PPL Face FID FID PPL 0 8.5 5.2 35.9 7.9 4.8 25.3 2 7.5 4.9 21.9 7.8 4.5 25.6 4 7.5 5.0 32.8 7.9 4.6 19.8 6 7.8 4.8 22.0 7.1 4.5 18.1

Table 2: FID, LPIPS and FID of the face region for Config C/D with different number of layers () in the mapping network ().

A deeper mapping network allows the generator to learn finer-grained modulation parameters, which we find to significantly improve image quality (see Table 2). We observe a strangely modest improvement of FID as the complexity of increases. However, we qualitatively observe a significant improvement in the quality of fine-grained regions (e.g. the face and fingers, see Figure 4), and speculate that these details are not reflected in FID. Thus, we evaluate FID for the upsampled face region separately, which reflects a significant improvement.

Furthermore, a deeper mapping network allows the generator to better disentangle the latent space 555Following Karras et al. [27], ”disentangled latent space” refers to that the latent factors of variations are separated into linear subspaces., which is reflected by PPL. The improved disentanglement is rooted in two design choices; first, SAM explicitly disentangles the variations of pose into surface-adaptive modulation. Secondly, V-SAM ”unwarps” the fixed distribution to the surface-conditioned distribution , which allows the generator to easier control specific areas of the human body disentangled of the spatial image.

Affine Invariance Studies

V-SAM is invariant to affine image-plane transformations, and thus, improves the ability of the generator to disentangle the latent representation from such transforms. We quantitatively evaluate this with Peak Signal-to-Noise Ratio (PSNR), following the approach in

[59],

(4)

where , E is the spatial embedding, G is the generator, and is the distribution of vertical and horizontal image shifts.  is limited to translate the image by maximum of the image width/height. We similarly evaluate rotational invariance and horizontal flip, where the rotation is limited to .

As reflected in Table 3, V-SAM improves the invariance of the generator to affine transformations. The aspect of affine-invariance is important for realistic anonymization, as the detection can induce slight shifts across frames.

Affine Transformation
Method Translation Rotation Hflip
Config A 23.4 19.3 19.8
Config B 24.7 20.6 20.8
B + SAM 21.6 19.4 19.0
B + V-SAM 25.9 21.2 21.5
B + SPADE 21.0 19.1 18.9
B + INADE 23.6 20.7 20.2
Table 3: PSNR for different architectures evaluating the invariance to affine transformations and horizontal flip.

Computational Complexity

V-SAM consists of two stages, the mapping network, and layer-wise linear transformations. Each layer-wise transformation is efficiently implemented as

 convolution. The mapping network is a sequence of fully-connected layers, which can be implemented as  convolution by using the spatial embedding map  for each pixel i. However, in practice, we find the nearest vertex embedding for each embedding , and transform the 27K vertex-embeddings to . This results in a mapping network independent of image resolution. In addition, if we want to map a static in inference for several images, we only need to do one forward pass of the mapping network. Finally, we note that the computational complexity is strongly dependent on the dimension of , and we further ablate this in Appendix C.

4.2 Semantic vs. Surface Guidance

We now compare Surface-Guided GANs to the semantic-guided counterpart, and highlight that the simple and high-fidelity representation of surface-guided GANs significantly improve image quality.

SAM vs. Semantic Modulation

V-SAM slightly improves image quality compared to semantic-based modulation (Table 4). However, we observe a considerable difference in high-fidelity details (Figure 4), which is reflected by FID of the face region. Notice that CLADE [52] and INADE [51], which removes the spatial adaptiveness of SPADE [43], significantly degrades image quality, reflecting that purely semantic modulation is not suitable for human synthesis. Furthermore, V-SAM improves affine invariance (Table 3) and generator disentanglement (PPL, Table 4) in comparison to semantic modulation.

max width= Method LPIPS FID PPL Face FID CLADE [52] 0.224 5.2 24.1 8.9 INADE [51] 0.225 5.4 23.8 9.4 SPADE [43] 0.219 5.0 23.6 8.7 SAM 0.217 4.8 22.0 7.8 V-SAM 0.217 4.5 18.1 7.1

Table 4: Comparison of V-SAM to semantic modulation methods. All modulation methods are applied on top of config B.

Semantic vs. Surface Supervision

Surface supervision generally improves over equivalent semantic supervision 666Supervising the discriminator on semantic cross-entropy loss. for both semantic and surface modulation (Table 5). We find it interesting that V-SAM benefits much more from surface-supervision compared to SPADE. We speculate that this is caused by a shift in the training dynamics, where the discriminator learns fine-grained surface information and easily identifies generated images of SPADE, as they are not coherent with the surface.

max width= Generator Conditioning Supervision Face FID LPIPS Diversity FID PPL Concatenate Semantic* Semantic 10.9 0.152 7.4 45.1 Concatenate CSE* CSE 12.3 0.157 7.3 40.4 SPADE [43] Semantic 9.5 0.142 5.2 14.0 SPADE [43] CSE 8.7 0.164 5.0 23.6 V-SAM Semantic 10.2 0.172 6.0 19.8 V-SAM CSE 7.1 0.173 4.5 18.1

Table 5: Comparison of different generator conditioning methods in combination with semantic/surface discriminator supervision. *CSE/Semantic map is concatenated with input image.

4.3 Synthesis of Humans in Static Scenes

Figure 6: V-SAM can transfer attributes between poses by simply sampling the same latent variable . Each row shows synthesized images with the same latent variable, but different input pose.

We demonstrate that V-SAM excels at human synthesis for the DeepFashion [33] dataset. Following the design of SPADE [43], we design a decoder-only generator, that synthesizes humans independent of any background image (details in Appendix A). Combining this architecture with V-SAM, yields a highly practical generator.

The disentangled and spatially-invariant latent space of V-SAM allows the generator to transfer attributes between poses. By sampling the same latent variable for different poses, V-SAM is able to perform pose/motion transfer of synthesized humans (Figure 6) without any task-specific modeling choices (e.g. including a texture encoder [58]). However, V-SAM is variant to 3D affine transformations that are not parallel to the imaging plane (e.g. changing the depth of the scene). This is reflected in Figure 6, where changing the depth of the scene significantly changes the synthesized person. We believe that combining V-SAM with task-specific modeling choices from the pose/motion transfer literature [58, 36] could resolve these issues.

4.4 Effect of Anonymization for Computer Vision

(a) Original
(b) Pixelation
(c) Pixelation
(d) Masked out
(e) Ours
Figure 7: Different anonymization methods for an image in the COCO [32] validation set. Appendix D includes random examples.

max width= Validation Dataset AP AP AP AP AP AP AP Original 37.2 58.6 39.9 18.6 39.5 53.3 47.7 Mask Out 32.8 52.0 35.1 16.3 34.6 47.3 27.5 Pixelation 32.8 51.8 35.2 16.4 34.6 47.2 33.3 Pixelation 33.4 53.0 35.7 16.7 35.0 48.1 38.4 Ours 34.6 55.0 37.0 17.1 36.8 50.0 44.9

Table 6: Instance segmentation mask AP on the COCO validation set [32]. The results are from a pre-trained Mask R-CNN [17] R50-FPN-3x from detectron2 [55] evaluated on different anonymized datasets.

We analyze the effect of anonymization for future computer vision development by evaluating a pre-trained Mask R-CNN [17]

on the COCO dataset (results on PASCAL VOC

[9] are included in Appendix B). We anonymize all individuals that are detected by a pre-trained CSE-detector [41], where we use all detections with a confidence score higher than . We compare our framework to traditional anonymization methods (Figure 7).

Our method significantly improves compared to traditional anonymization (Table 6), even pixelation, which is known to be questionable for anonymization [14, 39]. However, we observe a notable drop in average precision for other object classes, which originates from two sources of error. First, full-body anonymization removes objects that often appear with human figures. For example, the ”tie” class drops from 31% AP to 1% and ”toothbrush” drops from 14.6% to 6.2%. Secondly, the detections includes false positives, yielding highly corrupted images when anonymizing these. For example, the ”zebra” class drops from 56.2% to 48.0%. We observe insignificant degradation for objects that are rarely detected as person (e.g. car, train, elephant). Finally, surface-guided anonymization improves over traditional techniques for training purposes, which we validated on the anonymized COCO dataset (Appendix B).

5 Conclusion

We present a novel full-body anonymization framework that generates close-to-photorealistic and diverse humans in varying and complex scenes. Our experiments show that guiding adversarial nets with dense pixel-to-surface correspondences strongly improves synthesis of high-fidelity textures for varying poses and scenes. Finally, we demonstrate that our anonymization framework better retains the usability of data for future computer vision development, compared to traditional anonymization.

Limitations

Our contributions significantly improves the usability of anonymized data and generates new identities independent of the original. However, our method has limitations that can compromise the privacy of individuals. As with any anonymization method, our method relies on detection that is far from perfect 777Current CSE-based detectors (R-101-FPN-DL-s1x [55]) has an average recall rate of 96.65% (AR50) for human segmentation on COCO-DensePose [15]. Note that COCO-DensePose contains primarily high-resolution human figures. and vulnerable to adversarial attacks. Detection is improving every year and defense against adversarial attacks is currently a large focus in the community [29]. We believe that potential errors in detection can be circumvented with face detection as a fallback.

With the assumption of perfect detections, identification is still possible through gait recognition (when anonymizing videos), or through identity leaks in the CSE-embeddings. We speculate that gait recognition can be mitigated by slightly randomizing the original pose between frames. Furthermore, identity leaking through surface embeddings is possible, as they are regressed from the original image and could include identifying information. We reduce this possibility by discretizing the regressed embedding into one of the 27K vertex-specific embeddings (Section 3).

Surface-guided GANs significantly improves human figure synthesis for in-the-wild image anonymization. Nevertheless, human synthesis is a complicated task and many of the images generated by our method are recognizable as artificial by a human evaluator. One of the limiting factors of our model is the dataset, where COCO-Body contains 40K images with a large variety. This is relatively small compared to the 70K images in FFHQ [27], which is a considerably simpler task. Our method applies data augmentation to mitigate this. However, further extension with adaptive augmentation [25]

or transfer learning could be fruitful.

Societal Impact

We live in the age of Big Data, where personal information is the business model for many companies. Recently, regulators have introduced legislation that complicates data collection, requiring consent to store any data that contains personal information. This can be viewed as a barrier to research and development, especially for the data-dependent field of computer vision. We present a method that can better preserve the privacy of individuals, while retaining the usability of the data. Nevertheless, our work focus on the synthesis of realistic humans, which has a potential for misuse. The typical example is misuse of DeepFakes, where generative models can be used to create manipulated content with an intention to misinform. Several solutions have been proposed, where the DeepFake Detection Challenge [8] has increased the ability of models to detect manipulated content, and pre-emptive solutions such as model watermarking [57] can mitigate the potential for misuse.

References

  • [1] Badour AlBahar, Jingwan Lu, Jimei Yang, Zhixin Shu, Eli Shechtman, and Jia-Bin Huang. Pose with style: Detail-preserving pose-guided image synthesis with conditional stylegan. arXiv preprint arXiv:2109.06166, 2021.
  • [2] Thangapavithraa Balaji, Patrick Blies, Georg Göri, Raphael Mitsch, Marcel Wasserer, and Torsten Schön. Temporally coherent video anonymization through gan inpainting. arXiv preprint arXiv:2106.02328, 2021.
  • [3] Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 8340–8348, 2018.
  • [4] Karla Brkic, Ivan Sikiric, Tomislav Hrkac, and Zoran Kalafatic. I know that person: Generative full body and face de-identification of people in images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1319–1328. IEEE, 2017.
  • [5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations, 2019.
  • [6] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5933–5942, 2019.
  • [7] Bindita Chaudhuri, Nikolaos Sarafianos, Linda Shapiro, and Tony Tung. Semi-supervised synthesis of high-resolution editable textures for 3d humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7991–8000, 2021.
  • [8] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge dataset. arXiv e-prints, pages arXiv–2006, 2020.
  • [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  • [10] Oran Gafni, Lior Wolf, and Yaniv Taigman. Live face de-identification in video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9378–9387, 2019.
  • [11] Ross Girshick. Fast r-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, dec 2015.
  • [12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [13] Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, and Victor Lempitsky. Coordinate-based texture inpainting for pose-guided human image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12135–12144, 2019.
  • [14] R. Gross, L. Sweeney, F. de la Torre, and S. Baker. Model-based face de-identification. In 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW 06). IEEE.
  • [15] Riza Alp Guler, Natalia Neverova, and Iasonas Kokkinos.

    DensePose: Dense human pose estimation in the wild.

    In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, jun 2018.
  • [16] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018.
  • [17] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [19] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, oct 2017.
  • [20] Håkon Hukkelås, Frank Lindseth, and Rudolf Mester. Image inpainting with learnable feature imputation. arXiv preprint arXiv:2011.01077, 2020.
  • [21] Håkon Hukkelås, Rudolf Mester, and Frank Lindseth. Deepprivacy: A generative adversarial network for face anonymization. In Advances in Visual Computing, pages 565–578. Springer International Publishing, 2019.
  • [22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jul 2017.
  • [23] Anil Jain, Patrick Flynn, and Arun Ross. Handbook of Biometrics. 01 2008.
  • [24] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • [25] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. In Proc. NeurIPS, 2020.
  • [26] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. arXiv preprint arXiv:2106.12423, 2021.
  • [27] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • [28] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958, 2019.
  • [29] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, et al. Adversarial attacks and defences competition. In The NIPS’17 Competition: Building Intelligent Systems, pages 195–231. Springer, 2018.
  • [30] Verica Lazova, Eldar Insafutdinov, and Gerard Pons-Moll. 360-degree textures of people in clothing from a single image. In 2019 International Conference on 3D Vision (3DV), pages 643–653. IEEE, 2019.
  • [31] Yining Li, Chen Huang, and Chen Change Loy. Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3693–3702, 2019.
  • [32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [33] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [34] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 99–108, 2018.
  • [35] Maxim Maximov, Ismail Elezi, and Laura Leal-Taixé. Ciagan: Conditional identity anonymization generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5447–5456, 2020.
  • [36] Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and Zhouhui Lian. Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5084–5093, 2020.
  • [37] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Which training methods for gans do actually converge? In

    International Conference on Machine Learning (ICML)

    , 2018.
  • [38] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo Morishima. Siclope: Silhouette-based clothed people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4480–4490, 2019.
  • [39] Carman Neustaedter, Saul Greenberg, and Michael Boyle. Blur filtration fails to preserve privacy for home-based video conferencing. ACM Transactions on Computer-Human Interaction, 13(1):1–36, mar 2006.
  • [40] Natalia Neverova, Riza Alp Guler, and Iasonas Kokkinos. Dense pose transfer. In Proceedings of the European conference on computer vision (ECCV), pages 123–138, 2018.
  • [41] Natalia Neverova, David Novotny, Marc Szafraniec, Vasil Khalidov, Patrick Labatut, and Andrea Vedaldi. Continuous surface embeddings. Advances in Neural Information Processing Systems, 33, 2020.
  • [42] Augustus Odena, Christopher Olah, and Jonathon Shlens.

    Conditional image synthesis with auxiliary classifier gans.

    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642–2651. JMLR. org, 2017.
  • [43] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2019.
  • [44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [45] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2304–2314, 2019.
  • [46] Kripasindhu Sarkar, Vladislav Golyanik, Lingjie Liu, and Christian Theobalt. Style and pose control for image synthesis of humans from a single monocular view. arXiv preprint arXiv:2102.11263, 2021.
  • [47] Kripasindhu Sarkar, Dushyant Mehta, Weipeng Xu, Vladislav Golyanik, and Christian Theobalt. Neural re-rendering of humans from a single image. In European Conference on Computer Vision, pages 596–613. Springer, 2020.
  • [48] Edgar Schönfeld, Vadim Sushko, Dan Zhang, Juergen Gall, Bernt Schiele, and Anna Khoreva. You only need adversarial supervision for semantic image synthesis. In International Conference on Learning Representations, 2020.
  • [49] Qianru Sun, Liqian Ma, Seong Joon Oh, Luc Van Gool, Bernt Schiele, and Mario Fritz. Natural and effective obfuscation by head inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5050–5059, 2018.
  • [50] Qianru Sun, Ayush Tewari, Weipeng Xu, Mario Fritz, Christian Theobalt, and Bernt Schiele. A hybrid model for identity obfuscation by face replacement. In Proceedings of the European Conference on Computer Vision (ECCV), pages 553–569, 2018.
  • [51] Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Bin Liu, Gang Hua, and Nenghai Yu.

    Diverse semantic image synthesis via probability distribution modeling.

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7962–7971, 2021.
  • [52] Zhentao Tan, Dongdong Chen, Qi Chu, Menglei Chai, Jing Liao, Mingming He, Lu Yuan, Gang Hua, and Nenghai Yu. Efficient semantic image synthesis via class-adaptive normalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
  • [53] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, jun 2018.
  • [54] Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. Vid2actor: Free-viewpoint animatable person synthesis from video in the wild. arXiv preprint arXiv:2012.12884, 2020.
  • [55] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  • [56] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang.

    Generative image inpainting with contextual attention.

    In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, jun 2018.
  • [57] Ning Yu, Vladislav Skripniuk, Sahar Abdelnabi, and Mario Fritz. Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14448–14457, 2021.
  • [58] Jinsong Zhang, Kun Li, Yu-Kun Lai, and Jingyu Yang. Pise: Person image synthesis and editing with decoupled gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7982–7990, 2021.
  • [59] Richard Zhang. Making convolutional networks shift-invariant again. In International conference on machine learning, pages 7324–7334. PMLR, 2019.
  • [60] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In CVPR, 2018.
  • [61] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In Advances in neural information processing systems, pages 465–476, 2017.
  • [62] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5104–5113, 2020.