Privacy-Preserving Portrait Matting

04/29/2021 ∙ by Jizhizi Li, et al. ∙ The University of Sydney 0

Recently, there has been an increasing concern about the privacy issue raised by using personally identifiable information in machine learning. However, previous portrait matting methods were all based on identifiable portrait images. To fill the gap, we present P3M-10k in this paper, which is the first large-scale anonymized benchmark for Privacy-Preserving Portrait Matting. P3M-10k consists of 10,000 high-resolution face-blurred portrait images along with high-quality alpha mattes. We systematically evaluate both trimap-free and trimap-based matting methods on P3M-10k and find that existing matting methods show different generalization capabilities when following the Privacy-Preserving Training (PPT) setting, i.e., "training on face-blurred images and testing on arbitrary images". To devise a better trimap-free portrait matting model, we propose P3M-Net, which leverages the power of a unified framework for both semantic perception and detail matting, and specifically emphasizes the interaction between them and the encoder to facilitate the matting process. Extensive experiments on P3M-10k demonstrate that P3M-Net outperforms the state-of-the-art methods in terms of both objective metrics and subjective visual quality. Besides, it shows good generalization capacity under the PPT setting, confirming the value of P3M-10k for facilitating future research and enabling potential real-world applications. The source code and dataset will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The success of deep learning in many computer vision and multimedia areas, largely rely on large-scale of training data. However, for some tasks such as face recognition 

(masi2018deep), human activity analysis (sun2019deep), speech recognition, as well as emotion analysis (buechel2017emobank), privacy concerns about the personally identifiable information in the datasets, such as face, gait, and voice, have attracted increasing attention in recent years. Unfortunately, how to alleviate the privacy concerns in data while not affecting the performance remains challenging and under-explored  (yang2021study)

. As an example, portrait matting, which refers to estimating the accurate foregrounds from portrait images, also involves the privacy issue, since portrait images usually contain identifiable faces as found in previous matting datasets

(dim; hatt; dapm). This issue has received more and more concerns due to the popular of virtual video meeting during the COVID-19 pandemic, since portrait matting is a key technique in this multimedia application for changing virtual background. However, we found that all the previous portrait matting methods pay less attention to the privacy issue and adopted the intact identifiable portrait images for both training and evaluation, leaving privacy-preserving portrait matting (P3M) as an open problem.

Figure 1. (a) Some anonymized portrait images from our P3M-500-P test set. (b) Some non-anonymized celebrity or back portrait images from our P3M-500-NP test set. We also provide the alpha mattes predicted by our P3M-Net, following the privacy-preserving training setting.

In this paper, we make the first attempt to fill this gap by presenting a large-scale anonymized portrait matting benchmark and investigating the impact of privacy-preserving training (PPT) on portrait matting models. P3M-10k consists of 10,000 high-resolution face-blurred portrait images along with high-quality ground truth alpha mattes. We carefully collect, filter, and annotate a huge number of images with high-resolution, diverse foregrounds and backgrounds, various postures to ensure the diversity, large volume, and high quality of the dataset, compared with existing matting datasets (dapm; hatt; lf). Besides, we choose face obfuscation as the privacy protection technique to remove the identifiable face information while retaining fine details such as hairs. We split out 500 images from P3M-10k to serve as a face-blurred validation set, named P3M-500-P. Some examples are shown in Figure 1(a). Furthermore, to evaluate the generalization ability of matting models on normal images, which are trained following the PPT setting, , only using the face-blurred portrait images in P3M-10k training set, we construct a validation set with 500 images without privacy concerns, named P3M-500-NP. All the images in P3M-500-NP are either frontal images of celebrities or profile/back images without any identifiable faces. Some examples are shown in Figure 1(b).

It is very interesting and of significant practical meaning to see whether the privacy-preserving training will have a side impact on the matting models, since face obfuscation brings noticeable artefacts to the images which are not observed in normal portrait images. We notice that a contemporary work (yang2021study) has shown empirical evidences that face obfuscation only has minor side impact on object detection and recognition models. However, in the context of portrait matting, where the pixel-wise alpha matte (a soft mask) with fine details is expected to be estimated from a high-resolution portrait image, the impact remains unclear.

In this paper, we systematically evaluate both trimap-based and trimap-free matting methods on the unmodified version and face-blurred version of P3M-10k, and provide our insight and analysis about the impact. Specifically, we found that for trimap-based matting, where the trimap is used as an auxiliary input, face obfuscation shows little impact on the matting models, , a slight performance change of models following the PPT setting. As for trimap-free matting, since it actually involves two sub-tasks: foreground segmentation and detail matting, we found that the methods using a multi-task framework that explicitly model and jointly optimize both tasks (gfm; hatt) are able to mitigate the impact of face obfuscation to an acceptable level (2% to 5%). Besides, the matting methods that solve the problem through sequential segmentation and matting (shm; dapm), show a significant performance drop, since face obfuscation leads to segmentation errors that will mislead the subsequent matting model. Other methods that involve several stages of networks to progressively refine the alpha mattes from coarse to fine, shows to be less affected by face obfuscation but still observe a performance drop due to the lack of explicit semantic guidance. Meanwhile, these methods are a little awkward due to the tedious training process.

Based on the above observations, we propose a novel automatic portrait matting network named P3M-Net, which is able to serve as a strong trimap-free matting baseline for the P3M task (See the results in Figure 1). Technically, we also adopt a multi-task framework like (gfm; hatt) as our basic structure, which learns common visual features through a sharing encoder and task-aware features through a segmentation decoder and a matting decoder. In contrast to previous methods (gfm; hatt)

, we specifically emphasize the interaction between two decoders and those between encoder and decoders. To this end, we devise a Tripartite-Feature Integration (TFI) module for each matting decoder block to effectively fuse the matting decoder features from the previous block, semantic features from the segmentation decoder block, and base visual features from the encoder block. Besides, we devise a Deep Bipartite-Feature Integration (dBFI) module and a Shallow Bipartite-Feature Integration (sBFI) module to leverage deep features with high-level semantics and shallow features with fine details for improving the segmentation decoder and matting decoder, respectively. Experiments on the P3M-10k benchmark demonstrate that P3M-Net achieves a neglectable performance drop under the PPT setting and outperforms all previous trimap-free matting methods by a large margin.

To sum up, the contributions of this paper are three-fold. First, to the best of our knowledge, we are the first to study the problem of privacy-preserving portrait matting and establish the largest privacy-preserving portrait matting dataset P3M-10k, which can serve as benchmark for P3M. Second, we systematically investigate the impact of face obfuscation on both trimap-based and trimap-free matting models under the privacy-preserving training setting and provide insights about the evaluation protocol, performance analysis, and model design. Third, We propose a novel trimap-free portrait matting network named P3M-Net that follows a multi-task framework and specifically focuses on the interactions between encoder and decoders. P3M-Net demonstrates its value for privacy-preserving portrait matting and can serve as a strong baseline.

2. Related Work

2.1. Image Matting

Image matting is a typical ill-posed problem to estimate the foreground, background, and the alpha matte from a single image. Specifically, portrait matting refers to a specific image matting task where the input image is a portrait. From the perspective of input, image matting can be divided into two categories, , trimap-based methods and trimap-free methods. Trimap-based matting methods use a user defined trimap, , a 3-class segmentation map, as an auxiliary input, which can provide explicit guidance on the transition area where need to be estimated as the alpha matte. Previous methods include affinity-based methods (levin2007closed; chen2013knn; aksoy2018semantic), sampling-based methods (he2011global; shahrian2013improving), and deep learning based methods (dim; lu2019indices; hou2019context). Besides, there are other methods such as background matting (backgroundmatting) and background matting v2 (backgroundmattingv2) that also used other forms of auxiliary inputs, a background image of the same scene.

To enable automatic (portrait) image matting, recent works (shm; lf; hatt; gfm) tried to estimate the alpha matte directly from a single image without using any auxiliary input, also known as trimap-free methods. For example, DAPM (dapm) and SHM (shm) tackled the task by separating it into two sequential stages, , segmentation and matting. However, the semantic error produced in the first stage will mislead the matting stage and can not be corrected. LF (lf) and SHMC (shmc) solved the problem by generating coarse alpha matte first and then refining it. Besides of the tedious training process, these methods suffer from ambiguous boundaries due to the lack of explicit semantic guidance. HATT (hatt) and GFM (gfm) proposed to model both the segmentation and matting tasks in a unified multi-task framework, where a sharing encoder was used to learn base visual features and two individual decoders are used to learn task-relevant features. However, HATT (hatt) lacks explicit supervision on the global guidance while GFM (gfm) lacks modeling the interactions between both tasks. By contrast, we propose a novel model named P3M-Net, which is also based on the multi-task framework but specifically focuses on modeling the interactions between encoders and decoders. Besides, we comprehensively investigate their performance under the PPT setting on P3M-10k and provide some useful insights on their model structures.

2.2. Privacy Issues in Visual Tasks

There are two kinds of privacy issues in visual tasks, , private data protection and private content protection in public academic datasets. For the former, there are concerns of information leak caused by insecure data transferring and membership inference attacks to the trained models (shokri2017membership; hisamoto2020membership; fredrikson2015model; carlini2019secret). Privacy-preserving machine learning (PPML) aims to solve these problems by homonorphic encryption (erkin2009privacy; yonetani2017privacy) and differential privacy algorithms (NEURIPS2020_fc4ddc15; abadi2016deep).

For public academic datasets, there is no concern for information leak, thus PPML is no longer needed. But, there still exists privacy breach incurred by exposure of personally identifiable information, faces, license plates, addresses. It is a common problem in the benchmark datasets for many visual tasks, object recognition and semantic segmentation. Recently a contemporary work (yang2021study) has shown empirical evidences that face obfuscation, as an effective data anonymization technique, only has minor side impact on object detection and recognition models. However, since portrait matting requires to estimate a pixel-wise soft mask (alpha matte) for a high-resolution portrait image, the impact of data anonymization such as face obfuscation remains unclear.

2.3. Privacy-preserving Methods

Normally, to protect the private information in the public images, a common method is to capture or process the data in special low quality conditions (dai2015towards; butler2015privacy). For example, Dai . captured the anonymized video data in extremely low resolution using low-resolution cameras to avoid the leak of personally identifiable information such as frontal faces (dai2015towards). Another way is to add empirical obfuscations (uittenbogaard2019privacy; caesar2020nuscenes; frome2009large), such as blurring and mosaicing at certain regions. Yang

. used face blurring to obfuscate the faces in the ImageNet dataset

(yang2021study). Caesar . detected and blurred the license plates in nuScenes dataset before publishing to avoid privacy concerns (caesar2020nuscenes). For the portrait matting task, all preivous benchmarks or methods pay little attention to the privacy issue. By contrast, we make the first attempt to construct a large-scale anonymized dataset for privacy-preserving portrait matting named P3M-10k. Specifically, we use face obfuscation as the privacy-preserving strategy to anonymize the identities of all images. Intuitively, the anonymized images with blurred faces may degenerate the performance of matting models due to the domain gap between anonymized training images and normal test images. In this paper, we make the first attempt to investigate the impact of face obfuscation on portrait matting under the PPT setting and identify that it has negligible impact on trimap-based matting methods but has different impact on trimap-free matting methods, depending on their model structure.

2.4. Matting Datasets

Existing matting datasets either contain only a small number of high-quality images and annotations, or the images and annotations are in low-quality. For example, the online benchmark alphamatting (TUW-180666) only provides 27 high-resolution training images and 8 test images. None of them is portrait image. Composition-1k (dim), the most commonly used dataset, contains 431 foregrounds for training and 20 foregrounds for testing. However, many of them are consecutive video frames, making it less diverse. GFM (gfm) provides 2,000 high-resolution natural images with alpha mattes, but they are all animal images. With respect to portrait image matting dataset, DAPM (dapm)

provided a large dataset of 2,000 low-resolution portrait images with alpha mattes generated by KNN matting 

(chen2013knn) and closed form matting (levin2007closed), whose quality is limited. Late fusion (lf) built a human image matting dataset by combining 228 portrait images from Internet and 211 human images in Composition-1k. Distinction646 (hatt) is a portrait image dataset containing 646 distinct human images but in low-resolution. There are also some large-scale portrait datasets, , SHM (shm), SHMC (shmc), and background matting  (backgroundmatting; backgroundmattingv2), which are unfortunately not public. Most importantly, no privacy preserving method is used to anonymize the images in the aforementioned datasets, making all the frontal faces exposed. By contrast, we establish the first large-scale matting dataset with 10,000 high-resolution portrait images with high-quality alpha mattes and anonymize all images using face obfuscation.

Test set Metric Closed (levin2007closed) IFM (aksoy2017designing) KNN (chen2013knn) Compre (shahrian2013improving) Robust (wang2007optimized) Learning (zheng2009learning) Global (he2011global) Shared (gastal2010shared)
B SAD 9.5750 10.887 15.378 8.3208 9.3321 10.248 9.6157 10.553
MSE 0.0214 0.0326 0.0511 0.0194 0.0214 0.0238 0.0242 0.0285
MAD 0.0693 0.0760 0.1087 0.0602 0.0674 0.0737 0.0708 0.0774
N SAD 9.4812 10.793 15.366 8.2295 9.2486 10.151 9.4908 10.386
MSE 0.0210 0.0318 0.0506 0.0191 0.0211 0.0236 0.0236 0.0277
MAD 0.0686 0.0748 0.1078 0.0595 0.0668 0.0729 0.0698 0.0760
Table 1. Results of trimap-based traditional methods on the blurred images (“B”) and normal images (“N”) in P3M-500-P.
Setting B:B B:N N:B N:N
Method SAD MSE MAD SAD MSE MAD SAD MSE MAD SAD MSE MAD
DIM (dim) 4.8906 0.0115 0.0342 4.8940 0.0116 0.0342 4.8050 0.0116 0.0334 4.7941 0.0116 0.0334
AlphaGAN (bmvcLutzAS18) 5.2669 0.0112 0.0373 5.2367 0.0112 0.037 5.7060 0.0120 0.0412 5.6696 0.0119 0.0408
U-Net (GCA) (li2020natural) 4.3212 0.0088 0.0303 4.3119 0.0088 0.0302 4.3263 0.0088 0.0303 4.3181 0.0088 0.0303
GCA (li2020natural) 4.3593 0.0088 0.0307 4.3469 0.0089 0.0306 4.4068 0.0089 0.0310 4.4002 0.0089 0.0310
IndexNet (lu2019indices) 5.1959 0.0156 0.0368 5.2188 0.0158 0.0370 5.8267 0.0202 0.0420 5.8509 0.0204 0.0422
Table 2. Results of trimap-based deep learning methods on P3M-500-P. “B” denotes the blurred images while “N” denotes the normal images. “B:N” denotes training on blurred images while testing on normal images, vice versa.
Setting B:B B:N N:B N:N
Method SAD MSE MAD SAD MSE MAD SAD MSE MAD SAD MSE MAD
SHM (shm) 21.56 0.0100 0.0125 24.33 0.0116 0.0140 23.91 0.0115 0.0139 17.13 0.0075 0.0099
LF (lf) 42.95 0.0191 0.0250 30.84 0.0129 0.0178 41.01 0.0174 0.0240 31.22 0.0123 0.0181
HATT (hatt) 25.99 0.0054 0.0152 26.5 0.0055 0.0155 35.02 0.0103 0.0204 22.93 0.0040 0.0133
GFM (gfm) 13.20 0.0050 0.0080 13.08 0.0050 0.0080 13.54 0.0048 0.0078 10.73 0.0033 0.0063
BASIC 15.13 0.0058 0.0088 15.52 0.0060 0.0090 24.38 0.0109 0.0141 14.52 0.0054 0.0085
P3M-Net (Ours) 8.73 0.0026 0.0051 9.22 0.0028 0.0053 11.22 0.0040 0.0065 9.06 0.0028 0.0053

Table 3. Results of trimap-free methods on P3M-500-P. Please refer to Table 2 for the meaning of different symbols.

3. A Benchmark for P3M and Beyond

Privacy-preserving portrait matting (P3M) is an important and meaningful topic due to the increasing privacy concerns. In this section, we first define this problem, then establish a large-scale anonymized portrait matting dataset P3M-10k to serve as a benchmark for P3M. A systematic evaluation of the existing trimap-based and trimap-free matting methods on P3M-10k is conducted to investigate the impact of the privacy-preserving training (PPT) setting on different matting models and gain some useful insights.

3.1. PPT setting and P3M-10k dataset

3.1.1. PPT Setting

Due to the privacy concern, we propose the privacy-preserving training (PPT) setting in portrait matting, , training on privacy-preserved images (, processed by face obfuscation) and testing on arbitrary images with or without privacy content. As an initial step towards privacy-preserving portrait matting problem, we only define the identifiable faces in frontal and some profile portrait images as the private content in this work. Intuitively, PPT setting is challenging since face obfuscation brings noticeable artefacts to the images which are not observed in normal portrait images,

, probably resulting in a domain gap between training and testing. Therefore, it is very interesting and of significant practical meaning to see whether the PPT setting will have a side impact on the matting models.

Figure 2. (a) Samples from the P3M-10k training set and P3M-500-P test set. (b) Samples from P3M-500-NP test set.

3.1.2. P3M-10k Dataset

To answer the above question, we establish the first large-scale privacy-preserving portrait matting benchmark named P3M-10k. It contains 10,000 anonymized high-resolution portrait images by face obfuscation along with high-quality ground truth alpha mattes. Specifically, we carefully collect, filter, and annotate about 10,000 high-resolution images from the Internet with free use license. There are 9,421 images in the training set and 500 images in the test set, denoted as P3M-500-P. In addition, we also collect and annotate another 500 public celebrity images from the Internet without face obfuscation, to evaluate the performance of matting models under the PPT setting on normal portrait images. Some examples are shown in Figure 2.

Our P3M-10k outperforms existing matting datasets in terms of dataset volume, image diversity, privacy preserving, and providing natural images instead of composited ones. The diversity is not only shown in foreground, , half and full body, frontal, profile, and back portrait, different genders, races, and ages, ., but also in background, , images in P3M-10k are captured in different indoor and outdoor environments with various illumination conditions. Some examples are shown in Figure 2. In addition, we argue that large volume and high diversity of P3M-10k enable models to train on the natural images without the need of image composition using low-resolution background images, which is a common practice in previous works (dim; hatt) to increase data diversity due to the small scale dataset volume. However, there are obvious composition artefacts in the composition images due to the discrepancy of foreground and background images in noise, resolution, and illumination. By contrast, the background in the natural images are compatible with the foreground since they are captured from the same scene. The composition artefacts may have a side impact on the generalization ability of matting models as shown in (gfm). We leave it as the future work to systematically investigate this problem and only focus on the PPT setting in this paper.

Figure 3. Illustration of the face blurring process.

3.2. Privacy-preserving Method in P3M-10k

We propose to use blurring to obfuscate the identifiable faces. Instead of using a face detector to obtain the bounding box of face and blurring it accordingly as in (yang2021study), we adopt a facial landmark detector to obtain the face mask. It is because different from the classification and detection tasks in (yang2021study), which may not sensitive to the blurry boundaries, portrait matting requires to estimate the foreground alpha matte with clear boundaries, including the transition areas of face such as cheek and hair. As shown in Figure 3, after obtaining the landmarks, a pixel-level face mask is automatically generated along the cheek and eyebrow landmarks in step (3). Then, we exclude the transition area shown in step (4) and generate an adjusted face mask at step (5). Finally, we use Gaussian blur to obfuscate the identifiable faces in the mask and the final result is shown in step (6). Note that for those images with failure landmark detection, we manually annotate the face mask.

3.3. Benchmark Setup

3.3.1. Methods

We evaluate both trimap-based and trimap-free matting methods. The full list of methods are shown in Table 123.

3.3.2. Evaluation Metrics

We use the common metrics including MSE, SAD, and MAD for evaluation. For trimap-based methods, the metrics are only calculated over the transition area, while for trimap-free methods, they are calculated over the whole image.

3.3.3. Training and Evaluation Protocols

Four kinds of training and evaluation protocols are proposed, including “trained on blurred images, test on blurred ones (B:B)”, “trained on blurred images and test on normal ones (B:N)”, “trained on normal images and test on blurred ones (N:B)”, and “trained on normal images and test on normal ones (N:N)”. The first two protocols correspond to the proposed PPT settings. All the methods are trained using the normal or blurred images in the P3M-10k training set and evaluated on P3M-500-P test set. The only difference between blurred and normal images is whether or not face obfuscation is applied.

3.4. Study on The Impact of PPT

3.4.1. Impact on Trimap-based Traditional Methods

As seen in Table 1

, trimap-based traditional methods show neglectable performance variance under different training and evaluation protocols, indicating that the PPT setting brings little impact on these methods. This observation is reasonable, since traditional methods mainly make prediction based on local pixels in the transition area, where no blurring occurs, although a few of sampled neighboring pixels may be blurred. Note that we define the transition area as in previous works but exclude the blurred area.

3.4.2. Impact on Trimap-based Deep Learning Methods

Similar to traditional trimap-based methods, deep learning methods also show very minor changes across different settings, as shown in Table 2. This is because trimap-based deep learning methods use the ground truth trimap as an auxiliary input and focus on estimating the alpha matte of the transition area, probably guiding the model to pay less attention to the blurred areas. In addition, there are also some observations opposed to intuition. When testing on normal images, models trained on the normal training images surprisingly fall behind of those trained on the blurred ones. For instance, the SAD of IndexNet on “N:N” is 0.6 higher than the score on “B:N”. Similar results can also be found for AlphaGAN, U-Net, GCA in Table 2. We suspect that the blurred pixels near the transition area may serve as a random noise during the training process, which makes the model more robust and leads to a better generalization.

3.4.3. Impact on Trimap-free Methods

Different from trimap-based methods, trimap-free methods show significant performance changes under four protocols. The results are shown in Table 3. First, we start with the test set of normal images by comparing results in the “B:N” and “N:N” tracks. Models trained on normal training images (N:N) usually outperform those using the blurred ones (B:N), , from 24.33 to 17.13 at SAD for DIM. This observation makes sense since there is a domain gap between blurred images and normal ones due to face obfuscation. By comparison, we found trimap-free methods show different generalization ability in dealing with this domain gap. For example, SHM is the worst with a large drop of 7 in SAD, while HATT and GFM only show a drop less than 3 in SAD. We suspect that an end-to-end multi-task framework may probably mitigate the domain gap issue via joint optimization. By contrast, two-stage methods such as SHM may produce segmentation errors, which can mislead the following matting stage and can not be corrected. To validate this hypothesis, we devise a baseline model called “BASIC” by adopting a similar multi-task framework like HATT and GFM but removing the bells and whistles, , only using a sharing encoder and two individual decoders. As shown in Table 3, the small performance drop (less than 1 in SAD) proves its superiority in overcoming domain gap and supports our hypothesis.

Second, we focus on the test set of blurred images by comparing results in the “B:B” and “N:B” tracks. The performance drop in most methods, , 9.03 in SAD for HATT, proves that without seeing the blurred pattern during training, models cannot generalize well to the face-blurred images. It implies the value of the blurred training set in P3M-10k for training models that will be deployed in privacy-preserving scenarios, where faces may be blurred.

Third, we fix the training set to be the blurred one. By comparing the “B:B” and “B:N” tracks, we observe similar performance of most methods under these two settings, , SADs of HATT are 25.99 and 26.5 on the blurred test set and normal one. These results imply that we can use the performance on the blurred test images in P3M-500-P as a bold indicator of that on the normal ones.

Figure 4. Diagram of the proposed P3M-Net structure. It adopts a multi-task framework, which consists of a sharing encoder, a segmentation decoder, and a matting decoder. Specifically, a TFI module, a dBFI module, and a sBFI module are devised to model different interactions among the encoder and the two decoders.

4. A Strong Baseline for P3M

4.1. A Multi-task Framework

As discussed in Section 3.4.3, the trimap-free matting model benefits from explicitly modeling of both semantic segmentation and detail matting tasks and jointly optimizing them in a end-to-end multi-task framework. Therefore, we also adopt the multi-task framework, where base visual features are learn from a sharing encoder and task-relevant features are learned from individual decoders, , semantic decoder and matting decoder, respectively. For the sharing encoder, we choose a modified version of ResNet-34 (he2016deep)

with max pooling layers to serve as our light-weight backbone. Specifically, to make it more suitable for portrait matting and retain more details in the features of shallow layers, we modified the stride of the convolution layers from 2 to 1 in ResNet-34 block and added max pooling layers to down-sample the features maps accordingly. We also keep the indices of the max pooling operation, which are used in the unpooling layers in the matting decoder. Both semantic decoder and matting decoder have five blocks, each of which contains three convolution layers. We then choose different upsampling operations to suit each task. We use bilinear interpolation in the semantic decoder for simplicity and use max unpooling operation with the indices from corresponding encoder blocks in the matting decoder to learn features for fine details.

4.2. TFI: Tripartite-Feature Integration

Most of the previous matting methods either model the interaction between encoder and decoder such as the U-Net (unet) style structure in (gfm) or model the interaction between two decoders such as the attention module in (hatt). In this paper, we comprehensively model all the interactions between the sharing encoder and two decoders, , 1) a tripartite-feature integration (TFI) module in each matting decoder block to model the interaction between encoder, segmentation decoder, and the matting decoder; 2) a deep bipartite-feature integration (dBFI) module to model the interaction between the encoder and segmentation decoder; and 3) a shallow bipartite-feature integration (sBFI) module to model the interaction between the encoder and matting decoder.

Specifically, for each TFI, it has three inputs, , the feature map of the previous matting decoder block , the feature map from same level semantic decoder block , and the feature map from the symmetrical encoder block , where stands for the block index, stands for the downsample ratio of the feature map compared to the input size, and . For each feature map, we use an convolutional projection layer for further embedding and channel reduction. The output of for each feature map is . We then concatenate the three embedded feature maps and feed them into a convolutional block containing a

convolutional layer, a batch normalization layer, and a ReLU layer. As shown in Eq. 

1, the output feature is :

(1)

4.3. sBFI: Shallow Bipartite-Feature Integration

The matting task requires to distinguish fine foreground details from the background. Therefore, the features in the shallow layers in the encoder may be useful since they contain abundant structural detail features. To leverage them to improve the matting decoder, we propose the shallow bipartite-feature integration (sBFI) module.

As shown in Figure 4, we use the feature map in the first encoder block as a guidance to refine the output feature map from the previous matting decoder block since contains many details and local structural information. Here, stands for the layer index, stands for the downsample ratio of the feature map compared to the input size, and . Since and are with different resolution, we first adopt max pooling with a ratio on to generate a low-resolution feature map . We then feed both and to two projection layers implemented by convolution layers for further embedding and channel reduction, , from to . Finally, the two feature maps are concatenated and and fed into a convolutional block containing a convolutional layer, a bacth normalization layer, and a ReLu layer. As shown in Eq. 2, we adopt the residual learning idea by adding the output feature map back to the input matting decoder feature map :

(2)

In this way, sBFI helps the matting decoder block to focus on the fine details guided by .

4.4. dBFI: Deep Bipartite-Feature Integration

Same as sBFI, features in the encoder can also provide valuable guidance to the segmentation decoder. In contrast to sBFI, we chose the feature map from the last encoder block, since it encodes abundant global semantics.

Specifically, we devise the deep bipartite-feature integration (dBFI) module to fuse it with the feature map from the th segmentation decoder block to improve the feature representation ability for the high-level semantic segmentation task. Here, . Note that since is in low-resolution, we use a upsampling operation with a ratio on to generate . we then feed both and into two projection layers , concatenated together, and fed into a convolutional block . We adopt the identical structures for and as those in sBFI. Similarly, this process can be described as:

(3)

Note that we reuse the symbols of and in Eq. 1, Eq. 2, and Eq. 3 for simplicity, although each of them denotes a specific layer (block) in TFI, sBFI, and dFI, respectively.

4.5. Training Objective

For the segmentation task, we leverage the deep supervision idea and add side loss on the segmentation decoder to stable and improve the training performance. Specifically, we use a convolution layer and an upsample operation on each output feature map from dBFI blocks, to predict the segmentation map with 3 channels and in the same resolution as input, denoting as . We then calculate the cross-entropy loss between and the ground truth trimap label (, the one-shot representation) defined as follows:

(4)

where represents the number of classes in the trimap.

Following (gfm), for the matting decoder, we adopt alpha loss and Laplacian loss calculated only on the transition region. For the segmentation decoder, we adopt a cross-entropy loss on its final output. For the final output, we adopt alpha loss , Laplacian loss , and composition loss calculated on the whole image. The final training objective is a combination of all the aforementioned losses, ,

(5)

where , , and are loss weights.

P3M-500-P P3M-500-NP
Method SAD MSE MAD SAD-T MSE-T MAD-T GRAD CONN SAD MSE MAD SAD-T MSE-T MAD-T GRAD CONN
LF 42.95 0.0191 0.0250 12.43 0.0421 0.0824 42.19 18.80 32.59 0.0131 0.0188 14.53 0.0420 0.0825 31.93 19.50
HATT 25.99 0.0054 0.0152 11.03 0.0377 0.0752 14.91 25.29 30.53 0.0072 0.0176 13.48 0.0403 0.0803 19.88 27.42
SHM 21.56 0.0100 0.0125 9.14 0.0255 0.0545 21.24 17.53 20.77 0.0093 0.0122 9.14 0.0255 0.0545 20.30 17.09
GFM 13.20 0.0050 0.0080 8.84 0.0269 0.0616 12.58 17.75 15.50 0.0056 0.0091 10.16 0.0268 0.0620 14.82 18.03
DIM 4.89 0.0009 0.0029 4.89 0.0115 0.0342 4.48 9.68 5.32 0.0009 0.0031 5.32 0.0094 0.0324 4.70 7.70
P3M-Net 8.73 0.0026 0.0051 6.89 0.0193 0.0478 8.22 13.88 11.23 0.0035 0.0065 7.65 0.0173 0.0466 10.35 12.51
Table 4. Results of P3M-Net and other methods on P3M-500-P and P3M-500-NP. DIM uses ground truth trimap.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y) Image
(z) GT
(aa) LF (lf)
(ab) HATT (hatt)
(ac) SHM (shm)
(ad) GFM (gfm)
(ae) DIM (dim)+Trimap
(af) Our P3M-Net
Figure 5. Subjective results of different methods on P3M-500-P and P3M-500-NP. Please zoom in for more details.

5. Experiments

5.1. Experiment Settings

To compare the proposed P3M-Net with existing trimap-free methods (shm; lf; hatt; gfm), we train them on the P3M-10k face-blurred images and evaluate them on 1) P3M-500-P face-blurred validation set; and 2) P3M-500-NP normal, following the PPT setting.

Implementation Details For training P3M-Net, we crop a patch from the image with a size randomly chosen from , , , and then resize it to . We randomly flip the patches for the data augmentation. The learning rate is fixed as

. We train P3M-Net on a single NVIDIA Tesla V100 GPU with a batch size of 8. P3M-Net is trained 150 epochs for about 2 days. It takes 0.132s to test on an

image. For GFM (gfm) and LF (lf), we use the code provided by the authors. For SHM (shm), HATT (hatt) and DIM (dim) without codes, we re-implement them.

Evaluation Metrics

We follow previous works and adopt the evaluation metrics including the sum of absolute differences (SAD), mean squared error (MSE), mean absolute difference (MAD), gradient (Grad.) and Connectivity (Conn.) 

(rhemann2009perceptually). We calculated them over the whole image for trimap-free methods. We also report the SAD-T, MSE-T, MAD-T metrics within the transition area.

5.2. Objective and Subjective Results

The objective and subjective results of different methods are listed in Table 4 and Figure 5. As can be seen, P3M-Net outperforms all the trimap-free methods in all metrics and even achieves competitive results with trimap-based method DIM (dim), which requires the ground truth trimap as an auxiliary input, denotes as DIM. These results support the designed integration modules, which are able to model abundant interactions between encoder and decoders. As for SHM (shm), it has worse SAD than P3M-Net on both datasets, , 21.56 vs. 8.73 and 20.77 vs. 11.23, due to its stage-wise structure, which produces many segmentation errors. LF (lf) and HATT (hatt) have large error in transition area, , 12.43 and 11.03 SAD vs. 6.89 SAD of ours, since they lack explicit semantic guidance for the matting task. As in Figure 5, they have ambiguous segmentation results and inaccurate matting details. GFM (gfm) is able to predict more accurate semantic mask owing to its multi-task framework. However, it still fails to predict correct context (the last row) and has worse performance than ours, , 13.20 vs. 8.73 in SAD, since it lacks of interactions between encoder and decoders. DIM (dim) has lower SAD compare with us since it uses ground truth trimap. Nevertheless, P3M-Net still achieves competitive performance in the transition area, , 6.89 vs. 4.89 SAD. It is also noteworthy that even trained only on privacy-preserving training set, most methods can generalize well on arbitrary images, clearly validating the effectiveness of the proposed P3M-10k and the practical value of the PPT setting. Meanwhile, the performance gap when testing on face-blurred images and normal images, , 8.73 SAD vs. 11.23 SAD of P3M-Net, also implies more efforts can be made to advance the research for privacy-preserving portrait matting.

5.3. Ablation Studies

We conduct ablation studies of P3M-Net on two datasets P3M-500-P and P3M-500-NP. As seen from Table 5, the basic multi-task structure without any advanced modules can achieve a fairly good result compared with previous methods (shm; hatt; lf). With TFI, SAD decreases dramatically to 11.32 and 13.7, owing to the valuable semantic features from encoder and segmentation decoder for matting. Besides, sBFI (dBFI) decreases SAD from 11.32 to 9.47 (9.76) on P3M-500-P and from 13.7 to 12.36 (12.45) on P3M-500-NP, confirming their values in providing useful guidance from relevant visual features. With all three modules, the SAD decreases from 15.13 to 8.73, and 17.01 to 11.23, indicating that our proposed modules bring about 50% relative performance improvement.

P3M-500-P P3M-500-NP
TFI sBFI dBFI SAD MSE MAD SAD MSE MAD
15.13 0.0058 0.0088 17.01 0.0062 0.0099
11.32 0.0042 0.00066 13.7 0.0052 0.008
9.47 0.0030 0.0055 12.36 0.0043 0.0072
9.76 0.0031 0.0057 12.45 0.0043 0.0073
8.73 0.0026 0.0051 11.23 0.0035 0.0065
Table 5. Ablation study of P3M-Net.

6. Conclusions

In this paper, we make the first study on the privacy-preserving portrait matting (P3M) problem to respond to the increasing privacy concerns. Specifically, we define the privacy-preserving training (PPT) setting, establish the first large-scale anonymized portrait dataset P3M-10k, containing of 10,000 face-blurred images and ground truth alpha mattes. We empirically find that the PPT setting has little side impact on trimap-based methods while trimap-free methods perform differently, depending on their model structures. We identify that trimap-free methods using a multi-task framework that explicitly models and optimizes both segmentation and matting tasks can effectively mitigate the side impact of PPT. Accordingly, we provide a strong baseline model named P3M-Net, which specifically focuses on modeling the interactions between encoder and decoders, showing promising performance and outperforming all previous trimap-free methods. We hope this study can open a new perspective for the research of portrait matting and attract more attention from the community to address the privacy concerns.

References