We propose a novel architecture which is able to automatically anonymize faces in images while retaining the original data distribution. We ensure total anonymization of all faces in an image by generating images exclusively on privacy-safe information. Our model is based on a conditional generative adversarial network, generating images considering the original pose and image background. The conditional information enables us to generate highly realistic faces with a seamless transition between the generated face and the existing background. Furthermore, we introduce a diverse dataset of human faces, including unconventional poses, occluded faces, and a vast variability in backgrounds. Finally, we present experimental results reflecting the capability of our model to anonymize images while preserving the data distribution, making the data suitable for further training of deep learning models. As far as we know, no other solution has been proposed that guarantees the anonymization of faces while generating realistic images.READ FULL TEXT VIEW PDF
Privacy-preserving data-processing is becoming more critical every year; however, no suitable solution has been found to anonymize images without degrading the image quality. The General Data Protection Regulation (GDPR) came to effect as of 25th of May, 2018, affecting all processing of personal data across Europe. GDPR requires regular consent from the individual for any use of their personal data. However, if the data does not allow to identify an individual, companies are free to use the data without consent. To effectively anonymize images, we require a robust model to replace the original face, without destroying the existing data distribution; that is: the output should be a realistic face fitting the given situation.
Anonymizing images, while retaining the original distribution, is a challenging task. The model is required to remove all privacy-sensitive information, generate a highly realistic face, and the transition between original and anonymized parts has to be seamless. This requires a model that can perform complex semantic reasoning to generate a new anonymized face. For practical use, we desire the model to be able to manage a broad diversity of images, poses, backgrounds, and different persons. Our proposed solution can successfully anonymize images in a large variety of cases, and create realistic faces to the given conditional information.
Our proposed model, called DeepPrivacy, is a conditional generative adversarial network [3, 18]. Our generator considers the existing background and a sparse pose annotation to generate realistic anonymized faces. The generator has a U-net architecture  that generates images with a resolution of . The model is trained with a progressive growing training technique  from a starting resolution of to , which substantially improves the final image quality and overall training time. By design, our generator never observes the original face, ensuring removal of any privacy-sensitive information.
For practical use, we assume no demanding requirements for the object and keypoint detection methods. Our model requires two simple annotations of the face: (1) a bounding box annotation to identify the privacy-sensitive area, and (2) a sparse pose estimation of the face, containing keypoints for the ears, eyes, nose, and shoulders; in total seven keypoints. This keypoint annotation is identical to what Mask R-CNN provides.
We provide a new dataset of human faces, Flickr Diverse Faces (FDF), which consists of 1.47M faces with a bounding box and keypoint annotation for each face. This dataset covers a considerably large diversity of facial poses, partial occlusions, complex backgrounds, and different persons. We will make this dataset publicly available along with our source code and pre-trained networks111Code: www.github.com/hukkelas/DeepPrivacy222FDF Dataset: www.github.com/hukkelas/FDF.
We evaluate our model by performing an extensive qualitative and quantitative study of the model’s ability to retain the original data distribution. We anonymize the validation set of the WIDER-Face dataset 
, then run face detection on the anonymized images to measure the impact of anonymization on Average Precision (AP). DSFD achieves ( out of AP), (), and () of the original AP on the easy, medium, and hard difficulty, respectively. On average, it achieves of the original AP. In contrast, traditional anonymization techniques, such as pixelation achieves , heavy blur , and black-out of the original performance. Additionally, we present several ablation experiments that reflect the importance of a large model size and conditional pose information to generate high-quality faces.
In summary, we make the following contributions:
We propose a novel generator architecture to anonymize faces, which ensures 100% removal of privacy-sensitive information in the original face. The generator can generate realistic looking faces that have a seamless transition to the existing background for various sets of poses and contexts.
We provide the FDF dataset, including 1.47M faces with a tight bounding box and keypoint annotation for each face. The dataset covers a considerably larger diversity of faces compared to previous datasets.
De-Identifying Faces: Currently, there exists a limited number of research studies on the task of removing privacy-sensitive information from an image including a face. Typically, the approach chosen is to alter the original image such that we remove all the privacy-sensitive information. These methods can be applied to all images; however, there is no assurance that these methods remove all privacy-sensitive information. Naive methods that apply simple image distortion have been discussed numerous times in literature [1, 19, 5, 20, 4], such as pixelation and blurring; but, they are inadequate for removing the privacy-sensitive information [4, 19, 20], and they alter the data distribution substantially.
K-same family of algorithms [4, 11, 20] implements the k-anonymity algorithm  for face images. Newton et al. prove that the k-same algorithm can remove all privacy-sensitive information; but, the resulting images often contain ”ghosting” artifacts due to small alignment errors .
Jourabloo et al.  look at the task of de-identification grayscale images while preserving a large set of facial attributes. This is different from our work, as we do not directly train our generative model to generate faces with similar attributes to the original image. In contrast, our model is able to perform complex semantic reasoning to generate a face that is coherent with the overall context information given to the network, yielding a highly realistic face.
Generative Adversarial Networks (GANs)  is a highly successful training architecture to model a natural image distribution. GANs enables us to generate new images, often indistinguishable from the real data distribution. It has a broad diversity of application areas, from general image generation [2, 12, 13, 30], text-to-photo generation , style transfer [8, 24] and much more. With the numerous contributions since its conception, it has gone from a beautiful theoretical idea to a tool we can apply for practical use cases. In our work, we show that GANs are an efficient tool to remove privacy-sensitive information without destroying the original image quality.
Ren et al.  look at the task of anonymizing video data by using GANs. They perform anonymization by altering each pixel in the original image to hide the identity of the individuals. In contrast to their method, we can ensure the removal of all privacy-sensitive information, as our generative model never observes the original face.
Progressive Growing of GANs  propose a novel training technique to generate faces progressively, starting from a resolution of and step-wise increasing it to . This training technique improves the final image quality and overall training time. Our proposed model uses the same training technique; however, we perform several alterations to their original model to convert it to a conditional GAN. With these alterations, we can include conditional information about the context and pose of the face. Our final generator architecture is similar to the one proposed by Isola et al. , but we introduce conditional information in several stages.
Image Inpainting is a closely related task to what we are trying to solve, and it is a widely researched area for generative models [10, 15, 17, 29]. Several research studies have looked at the task of face completion with a generative adversarial network [15, 29]. They mask a specific part of the face and try to complete this part with the conditional information given. From our knowledge, and the qualitative experiments they present in their papers, they are not able to mask a large enough section to remove all privacy-sensitive information. As the masked region grows, it requires a more advanced generative model that understands complex semantic reasoning, making the task considerably harder. Also, their experiments are based on the Celeb-A dataset , primarily consisting of celebrities with low diversity in facial pose, making models trained on this dataset unsuitable for real-world applications.
FDF (Flickr Diverse Faces) is a new dataset of human faces, crawled from the YFCC-100M dataset . It consists of 1.47M human faces with a minimum resolution of , containing facial keypoints and a bounding box annotation for each face. The dataset has a vast diversity in terms of age, ethnicity, facial pose, image background, and face occlusion. Randomly picked examples from the dataset can be seen in Figure 2. The dataset is extracted from scenes related to traffic, sports events, and outside activities. In comparison to the FFHQ  and Celeb-A  datasets, our dataset is more diverse in facial poses and it contains significantly more faces; however, the FFHQ dataset has a higher resolution.
The FDF dataset is a high-quality dataset with few annotation errors. The faces are automatically labeled with state-of-the-art keypoint and bounding box models, and we use a high confidence threshold for both the keypoint and bounding box predictions. The faces are extracted from images in the YFCC100-M dataset. For keypoint estimation, we use Mask R-CNN , with a ResNet-50 FPN backbone . For bounding box annotation, we use the Single Shot Scale-invariant Face Detector . To combine the predictions, we match a keypoint with a face bounding box if the eye and nose annotation are within the bounding box. Each bounding box and keypoint has a single match, and we match them with a greedy approach based on descending prediction confidence.
Our proposed model is a conditional GAN, generating images based on the surrounding of the face and sparse pose information. Figure 1 shows the conditional information given to our network, and Appendix A has a detailed description of the pre-processing steps. We base our model on the one proposed by Karras et al. . Their model is a non-conditional GAN, and we perform several alterations to include conditional information.
We use seven keypoints to describe the pose of the face: left/right eye, left/right ear, left/right shoulder, and nose. To reduce the number of parameters in the network, we pre-process the pose information into a one-hot encoded image of size, where is the number of keypoints and is the target resolution.
Progressive growing training technique is crucial for our model’s success. We apply progressive growing to both the generator and discriminator to grow the networks from a starting resolution of . We double the resolution each time we expand our network until we reach the final resolution of . The pose information is included for each resolution in the generator and discriminator, making the pose information finer for each increase in resolution.
Figure 3 shows our proposed generator architecture for resolution. Our generator has a U-net  architecture to include background information. The encoder and decoder have the same number of filters in each convolution, but the decoder has an additional bottleneck convolution after each skip connection. This bottleneck design reduces the number of parameters in the decoder significantly. To include the pose information for each resolution, we concatenate the output after each upsampling layer with pose information and the corresponding skip connection. The general layer structure is identical to Karras et al. , where we use pixel replication for upsampling, pixel normalization and LeakyReLU after each convolution, and equalized learning rate instead of careful weight initialization.
Progressive Growing: Each time we increase the resolution of the generator, we add two convolutions to the start of the encoder and the end of the decoder. We use a transition phase identical to Karras et al.  for both of these new blocks, making the network stable throughout training. We note that the network is still unstable during the transition phase, but it is significantly better compared to training without progressive growing.
Our proposed discriminator architecture is identical to the one proposed by Karras et al. 
, with a few exceptions. First, we include the background information as conditional input to the start of the discriminator, making the input image have six channels instead of three. Secondly, we include pose information at each resolution of the discriminator. The pose information is concatenated with the output of each downsampling layer, similar to the decoder in the generator. Finally, we remove the mini-batch standard deviation layer presented by Karraset al. , as we find the diversity of our generated faces satisfactory.
The adjustments made to the generator doubles the number of total parameters in the network. To follow the design lines of Karras et al. , we desire that the complexity in terms of the number of parameters to be similar for the discriminator and generator. We evaluate two different discriminator models, which we will name the deep discriminator and the wide discriminator. The deep discriminator doubles the number of convolutional layers for each resolution. To mimic the skip-connections in the generator, we wrap the convolutions for each resolution in residual blocks. The wider discriminator keeps the same architecture; however, we increase the number of filters in each convolutional layer by a factor of .
DeepPrivacy can robustly generate anonymized faces for a vast diversity of poses, backgrounds, and different persons. From qualitative evaluations of our generated results on the WIDER-Face dataset , we find our proposed solution to be robust to a broad diversity of images. Figure 4 shows several results of our proposed solution on the WIDER-Face dataset. Note that the network is trained on the FDF dataset; we do not train on any images in the WIDER-Face dataset.
We evaluate the impact of anonymization on the WIDER-Face  dataset. We measure the AP of a face detection model on the anonymized dataset and compare this to the original dataset. We report the standard metrics for the different difficulties for WIDER-Face. Additionally, we perform several ablation experiments on our proposed FDF dataset.
Our final model is trained for 17 days, 40M images, until we observe no qualitative differences between consecutive training iterations. It converges to a Frèchect Inception Distance (FID)  of . Specific training details and input pre-processing are given in Appendix A.
|No Anonymization |
|9x9 Gaussian Blur (|
|Heavy Blur (filter size = 30% face width)|
Table 1 shows the AP of different anonymization techniques on the WIDER-Face validation set. In comparison to the original dataset, DeepPrivacy only degrades the AP by , , and on the easy, medium, and hard difficulties, respectively.
We compare DeepPrivacy anonymization to simpler anonymization methods; black-out, pixelation, and blurring. Figure 5 illustrates the different anonymization methods. DeepPrivacy generally achieves a higher AP compared to all other methods, with the exception of pixelation.
Note that pixelation does not affect a majority of the faces in the dataset. For the ”hard” challenge, of the faces has a resolution larger than . For the easy and medium challenge, and has a resolution larger than . The observant reader might notice that for the ”hard” challenge, pixelation should have no effect; however, the AP is degraded in comparison to the original dataset (see Table 1). We believe that the AP on the ”hard” challenge is degraded due to anonymizing faces in easy/medium challenge can affect the model in cases where faces from ”hard” and easy/medium are present in the same image.
Experiment Details: For the face detector we use the current state-of-the-art, Dual Shot Face Detector (DSFD) . The WIDER-Face dataset has no facial keypoint annotations; therefore, we automatically detect keypoints for each face with the same method as used for the FDF dataset. To match keypoints with a bounding box, we use the same greedy approach as earlier. Mask R-CNN  is not able to detect keypoints for all faces, especially in cases with high occlusion, low resolution, or faces turned away from the camera. Thus, we are only able to anonymize of the faces in the validation set. Of the faces that are not anonymized, are partially occluded, and are heavily occluded. For the remaining non-anonymized faces, has a resolution smaller than . Note that for each experiment in Table 1, we anonymize the same bounding boxes.
We perform several ablation experiments to evaluate the model architecture choices. We report the Frèchet Inception Distance  between the original images and the anonymized images for each experiment. We calculate FID from a validation set of faces from the FDF dataset. The results are shown in Table 2 and discussed in detail next.
Effect of Pose Information: Pose of the face provided as conditional information improves our model significantly, as seen in (a)
. The FDF dataset has a large variance of faces in different poses, and we find it necessary to include sparse pose information to generate realistic faces. In contrast, when trained on the Celeb-A dataset, our model completely ignores the given pose information.
Discriminator Architecture: (b) compares the quality of images for a deep and wide discriminator. With a deeper network, the discriminator struggles to converge, leading to poor results. We use no normalization layers in the discriminator, causing deeper networks to suffer from exploding forward passes and vanishing gradients. Even though, Brock et al.  also observe similar results; a deeper network architecture degrades the overall image quality. Note that we also experimented with a discriminator with no modifications to number of parameters, but this was not able to generate realistic faces.
Model Size: We empirically observe that increasing the number of filters in each convolution improves image quality drastically. As seen in (c), we train two models with and
parameters. Unquestionably, increasing the number of parameters generally improves the image quality. For both experiments, we use the same hyperparameters; the only thing changed is the number of filters in each convolution.
Our method proves its ability to generate objectively good images for a diversity of backgrounds and poses. However, it still struggles in several challenging scenarios. Figure 6 illustrates some of these. These issues can impact the generated image quality, but, by design, our model ensures the removal of all privacy-sensitive information from the face.
Faces occluded with high fidelity objects are extremely challenging when generating a realistic face. For example, in Figure 6, several images have persons covering their faces with hands. To generate a face in this scenario requires complex semantic reasoning, which is still a difficult challenge for GANs.
Handling non-traditional poses can cause our model to generate corrupted faces. We use a sparse pose estimation to describe the face pose, but there is no limitation in our architecture to include a dense pose estimation. A denser pose estimation would, most likely, improve the performance of our model in cases of irregular poses. However, this would set restrictions on the pose estimator and restrict the practical use case of our method.
We propose a conditional generative adversarial network, DeepPrivacy, to anonymize faces in images without destroying the original data distribution. The presented results on the WIDER-Face dataset reflects our model’s capability to generate high-quality images. Also, the diversity of images in the WIDER-Face dataset shows the practical applicability of our model. The current state-of-the-art face detection method can achieve of the original average precision on the anonymized WIDER-Face validation set. In comparison to previous solutions, this is a significant improvement to both the generated image quality and the certainty of anonymization. Furthermore, the presented ablation experiments on the FDF dataset suggests that a larger model size and inclusion of sparse pose information is necessary to generate high-quality images.
DeepPrivacy is a conceptually simple generative adversarial network, easily extendable for further improvements. Handling irregular poses, difficult occlusions, complex backgrounds, and temporal consistency in videos is still a subject for further work. We believe our contribution will be an inspiration for further work into ensuring privacy in visual data.
We use the same hyperparameters as Karras et al. , except the following: We use a batch size of 256, 256, 128, 72 and 48 for resolution 8, 16, 32, 64, and 128. We use a learning rate of 0.00175 with the Adam optimizer. For each expansion of the network, we have a transition and stabilization phase of 1.2M images each. We use an exponential running average for the weights of the generator as this improves overall image quality . For the running average, we use a decay given by:
where is the batch size. Our final model was trained for 17 days on two NVIDIA V100-32GB GPUs.
Figure 7 shows the input pre-processing pipeline. For each detected face with a bounding box and keypoint detection, we find the smallest possible square bounding box which surrounds the face bounding box. Then, we resize the expanded bounding box to the target size (). We replace the pixels within the face bounding box with a constant pixel value of . Finally, we shift the pixel values to the range .
To utilize tensor cores in NVIDIA’s new Volta architecture, we do several modifications to our network, following the requirements of tensor cores. First, we ensure that each convolutional block use number of filters that are divisible by 8. Secondly, we make certain that the batch size for each GPU is divisible by 8. Further, we use automatic mixed precision for pytorch to significantly improve our training time. We see an improvement of in terms of training speed with mixed precision training.
Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 7354–7563. External Links: Cited by: §2.