Designing a lightweight and robust portrait segmentation algorithm is an important task for a wide range of face applications. However, the problem has been considered as a subset of the object segmentation problem and less handled in the semantic segmentation field. Obviously, portrait segmentation has its unique requirements. First, because the portrait segmentation is performed in the middle of a whole process of many real-world applications, it requires extremely lightweight models. Second, there has not been any public datasets in this domain that contain a sufficient number of images with unbiased statistics. To solve the first problem, we introduce the new extremely lightweight portrait segmentation model SINet, containing an information blocking decoder and spatial squeeze modules. The information blocking decoder uses confidence estimates to recover local spatial information without spoiling global consistency. The spatial squeeze module uses multiple receptive fields to cope with various sizes of consistency in the image. To tackle the second problem, we propose a simple method to create additional portrait segmentation data which can improve accuracy on the EG1800 dataset. In our qualitative and quantitative analysis on the EG1800 dataset, we show that our method outperforms various existing lightweight segmentation models. Our method reduces the number of parameters from2.1Mto86.9K(around 95.9 maintaining the accuracy under an 1 segmentation method. We also show our model is successfully executed on a real mobile device with 100.6 FPS. In addition, we demonstrate that our method can be used for general semantic segmentation on the Cityscape dataset. The code is available inhttps://github.com/HYOJINPARK/ExtPortraitSegREAD FULL TEXT VIEW PDF
An OBS plugin for removing background in portrait images (video), making it easy to replace the background when screen recording.
The MNN base implementation of SINet for CPU realtime portrait segmentation
Developing algorithms targeting face data has been considered as an important task in the computer vision field, and many related vision algorithms including detection, recognition, and key-point extraction are actively studied. Among them, portrait segmentation is commonly used in real-world applications such as background editing, security checks, and face resolution enhancement[19, 29], giving rise to the need for fast and robust segmentation models.
The challenging point of the segmentation task is that the model have to solve two contradictory problems simultaneously; (1) Handling long-range dependencies or global consistency and (2) preserving detailed local information. Figure 2 shows two common segmentation errors. First, the blue blob in Figure 2
(b) is classified as foreground, even though it is easily recognized as a wood region. The reason for this problem is that the segmentation model fails to get global context information which prevents wrong representation. Second, the red blobs in Figure2
(b) show the model’s failure to accurately segment fine details. The lateral part of hair needs fine segmentation due to its small size and similar color to the wood. The model is not able to produce a sharp segmentation image because of the lack of detail information about hair. This is because the usage of stride convolution or pooling layer. These techniques induces the model can capture global information by enlarging receptive field size. However, the local information might be destroyed.
|(a) Input image||(b) Typical segmentation errors|
|(c) Ground truth||(d) Example of Ours|
Researchers have developed several strategies to solve these problems and the first one is to produce multiple receptive fields for each layer. This multi-path structure is able to enhance both global and local information, but comes at the cost of increased latency due to fragmented parallel operations . Another method is using a two-branch network, which consists of a deeper branch that is employed to produce global context, and a shallow branch that preserves detailed local features by keeping high resolution [17, 18, 27]. Even though the shallow branch has few convolutional layers, it is computationally heavy due to its high-resolution feature maps. Also, this method has to extract features two times, once for each branch.
The portrait segmentation problem comes with a set of additional challenges. The first one being the small amount of available data. The EG1800 dataset , an accessible public portrait segmentation dataset, contains only around 1,300 training images, and has large biases with regard to attributes such as race, age, and gender. Second, portrait segmentation is usually used just as one of several steps in real-world applications. Since many of these applications run on mobile devices, the segmentation model needs to be lightweight to ensure real-time speeds. Researchers have developed plenty of lightweight segmentation methods, but most of them are still not lightweight enough for portrait segmentation tasks. For example, PortraitNet , the current state-of-the-art model on the EG1800 dataset, has parameters. A few examples of general lightweight segmentation models are ESPNetV2  with parameters, and MobileNet V3  with parameters.
In this paper, we propose a new extremely lightweight portrait segmentation model called SINet with an information blocking decoder structure and spatial squeeze modules (S2-module). Furthermore, we collect additional portrait segmentation data to overcome the aforementioned dataset problems. The proposed SINet has 86.9K parameters, achieving 100.6 FPS in iPhone XS without any low-floating point operations or pruning methods. Compared with the baseline model, PortraitNet, which has 2.1M parameters, the accuracy degradation is just under on the EG1800 dataset, as can be seen in Figure 1.
Our contributions can be summarized as follows: (1) We introduce the information blocking scheme to the decoder. It measures the confidence in a low-resolution feature map, and blocks the influence of high-resolution feature maps in highly confident pixels. This prevents noisy information to ruin already certain areas, and allows the model to focus on regions with high uncertainty. We show that this information blocking decoder is robust to translation and can be applied to general segmentation tasks. (2) We propose a spatial squeeze module (S2-module), an efficient multi-path network for feature extraction. Existing multi-path structures deal with the various size of long-range dependencies by managing multiple receptive fields. However, this increases latency in real implementations, due to having unsuitable structure with regard to kernel launching and synchronization. To mitigate this problem, we squeeze the spatial resolution from each feature map by average pooling, and show that this is more effective than adopting multi-receptive fields. (3) The public portrait segmentation dataset has a small number of images compared to other segmentation datasets, and is highly biased. We propose a simple and effective data generation method to augment the EG1800 dataset with a significant amount of images.
Portrait Application: PortraitFCN+  built a portrait dataset from Flickr and proposed a portrait segmentation model based on FCN . After that, PortraitNet proposed a real-time portrait segmentation model with higher accuracy than PortraitFCN+.  integrated two different segmentation schemes from Mask R-CNN and DensePose, and generated matting refinement based on FCN.  introduced a boundary-sensitive kernel to enhance semantic boundary shape information. While these works achieved good segmentation results, their models are still too heavy for embedded systems.
Global consistency: Global consistency and long range of dependencies are critical factors for the segmentation task, and models without a large enough receptive field will produce error-prone segmentation maps. One way of creating a large receptive field is to use large kernels. However, this is not suitable for lightweight models due to their large number of parameters. Another method is to reduce the size of feature maps through downsampling, but this leads to difficulties in segmenting small or narrow objects.
To resolve this problem, dilated convolutions (or atrous convolutions) have been introduced as an effective solution to get a large receptive field while preserving localization information [28, 2], keeping the same amount of computation as the normal convolution. However, as the dilation rate is increased the count of valid weights decreases, to finally degenerate to a convolution . Also, the grid effect degrades the segmentation result with checker-board pattern. Another method is to use spatial pyramid pooling to get a larger receptive field. The spatial pyramid pooling uses different sizes of pooling and concatenates each resultant feature map to obtain a multi-scale receptive field. Similarly, the Atrous Spatial Pyramid Pooling layer  replaces the pooling with dilated convolutions to get a multi-scale representation. To get a multi-scale representation, some works use a multi-path structure for feature extraction [6, 12, 15]. Each module splits the input feature map and translates the feature map with a different dilation rate. This method is well suited for lightweight models, but suffers from high latency. Recently, the asymmetric non-local block  was proposed, inspired by the non-local block  and spatial pyramid pooing. Because the non-local block calculates all the pairwise pixel dependencies, it is computationally heavy. Asymmetric non-local block approximates the calculation with spatial pyramid pooling. However, the computational cost is still too large to fit a lightweight model.Recently, some works adopt average pooling to reduce complexity more [21, 8].
Detail local information: Recovering detailed local information is crucial to generating sharp segmentation maps. Conventionally, an encoder-decoder structure based on deconvolution (or transposed convolution) is applied [10, 13]. By concatenating the high-resolution feature, they recover the original resolution step by step. Also, some works use global attention for upsampling. The feature pyramid attention 
uses global pooling to enhance the high resolution feature map from the low-resolution. However, the attention vector can not reflect the local information well due to global pooling. Recently, the two-branch method is suggested for better segmentation. ContextNet and FastSCNN  designed a two-path network, each branch of which is for global context and detailed information, respectively. BiSeNet  also proposed a similar two-path network for preserving spatial information as well as acquiring a large enough receptive field. However, it needs to calculate features twice, once for each branch.
In this section, we explain the structure of the proposed SINet which consists of a Spatial squeeze module and a Information blocking decoder. The spatial squeeze block (S2-module) handles global consistency by using the multi-receptive field scheme, and squeezes the feature resolution to mitigate the high latency of multi-path structures. The information blocking decoder is designed to only take the necessary information from the high-resolution feature maps by utilizing the confidence score of the low-resolution feature maps. The information blocking in the decoder is important for increasing robustness regarding translation (Section 3.1) and the S2-module can handle global consistency without heavy computation (Section 3.2). We also demonstrate a simple data generation framework to solve the lack of data in two situations: 1) having human segmentation ground truths and 2) having only raw images (Section 3.4).
An encoder-decoder structure is the most commonly used structure for segmentation. An encoder extracts semantic features of the incoming images according to semantic information, and a decoder recovers detailed local information and resolution of the feature map. For designing the decoder, bilinear upsampling or transposed convolution upsampling blocks are commonly used to expand the low-resolution feature maps from the encoder. Also, recent works [6, 15, 7, 4] re-use additional high-resolution feature maps from the encoder to make more accurate segmentation results. To the best of our knowledge, most studies take all the information of high-resolution feature maps from the encoder by conducting concatenation, element-wise summation, or by enhancing high-resolution feature maps via attention vectors from low-resolution. However, using the high-resolution feature maps means that we give nuisance local information, which is already removed by the encoder. Therefore, we have to take only the necessary clue and avert the nuisance noise.
Here, we introduce a new concept of a decoder structure using information blocking. We measure the confidence score in the low-resolution feature map and block the information flow from the high-resolution feature into the region where the encoder successfully segmented with high confidence. The information blocking process removes nuisance information from the image and makes the high-resolution feature map concentrate only on the low confidence regions.
shows the overall architecture of SINet and the detailed process of the information blocking decoder. The model projects the last set of feature maps of the encoder to the size of the number of classes by a pointwise convolution and uses a bilinear upsampling to make the same resolution as the high-resolution target segmentation map. The model employs a softmax function to get a probability of each class and calculates each pixel’s confidence scoreby taking maximum value among the probabilities of each class. Finally, we generate an information blocking map by computing . We perform pointwise multiplication between the information blocking map and the high-resolution feature maps. This ensures that low confidence regions get more information from the high-resolution feature maps, while high confidence regions keep their original values in the subsequent pointwise addition operation.
Figure 4 (b) is an example of the information blocking map, and (c) is the confidence map from the model output. As shown in Figure 4 (b), the boundary and clothing have high uncertainty while the inner parts of the foreground and background already have a high confidence score. This indicates that the high uncertainty regions need more detailed local information to reduce uncertainty. However, the inner parts of the face, such as the beard and nose, do not need to get more local information for making a segmentation map. If the local information was embedded, it could be harmful to the global consistency due to nuisance information as noise. In the final confidence map of the model (Figure. 4 (c)), the uncertainty region of the boundary has shrunk, and the confidence score of the inner part is highly improved.
|(a) Spatial Squeeze Block (S2-block).|
|(b) Spatial Squeeze Module (S2-module).|
A multi-path structure have an advantage of high accuracy with less parameters [21, 20, 20, 25], but it suffers from increased latency proportional to the number of sub-paths . The proposed spatial squeeze module (S2-module) resolves this problem and Figure 5 shows the structure. We utilize average pooling for adjusting the size of the receptive field and reducing the latency.
. First, we uses a pointwise convolution to reduce the number of feature maps by half. For further reduction of computation, we use a group pointwise convolution with channel shuffle. The reduced feature maps pass through each S2-block, and the results are merged through concatenation.We also adopt a residual connection between the input feature map and the merged feature map. Finally, PRelu is utilized for non-linearity.
For S2-block, we select average pooling rather than dilated convolution for making a multi-receptive field structure for two reasons. First, the latency time is affected from the dilated rate, as shown in Table 1, and dilated convolution can not be free from the problem of grid effects [15, 22]. Second, the multi-path structure is not friendly to GPU parallel computing . Thus, we squeeze the resolution of each feature map to avoid the sacrifice of the latency time. The S2-block squeezes the resolution of a feature map by an average pooling, with kernel size up to 4. Then, a depthwise separable convolution with the kernel size or
is used. Between the depthwise convolution and the pointwise convolution, we use a PRelu non-linear activation function. Empirically, placing the pointwise convolution before or after the bilinear upsampling does not have a critical effect on the accuracy. Therefore, we put it before the bilinear upsampling to further reduce computation. We also insert a batch normalization layer after the depthwise convolution and the bilinear upsampling.
|3||SB module||[k=3, p=1], [k=5, p=1]|
|4||SB module||[k=3, p=1], [k=3, p=1]|
|5||DSConv+SE||Concat [2, 4], Down sampling|
|6||SB module||[k=3, p=1], [k=5, p=1]|
|7||SB module||[k=3, p=1], [k=3, p=1]|
|8||SB module||[k=5, p=1], [k=3, p=2]|
|9||SB module||[k=5, p=2], [k=3, p=4]|
|10||SB module||[k=3, p=1], [k=3, p=1]|
|11||SB module||[k=5, p=1], [k=5, p=1]|
|12||SB module||[k=3, p=2], [k=3, p=4]|
|13||SB module||[k=3, p=1], [k=5, p=2]|
|14||1x1 conv||Concat [5, 13]|
In this part, we explain the overall structure of SINet. SINet uses S2-modules as bottlenecks and depthwise separable convolution (ds-conv) with stride for reducing the resolution of feature maps. Empirically, applying the S2-module with stride for downsampling improves accuracy, but we found that it has longer latency time than S2-module with stride under the same output size conditions. Therefore, for downsampling, instead of the S2-module with stride 2, we use ds-conv with Squeeze-and-Excite blocks. For the first bottleneck we use two sequential S2-modules and for the second bottleneck we use eight. The detailed setting of the S2-module is described in Table 2. We add a residual connection for each bottleneck, concatenating the bottleneck input with its output. A convolution is used for classification and finally bilinear upsampling is applied to recover the original input resolution.
We found that a weighted auxiliary loss for the boundary part is helpful in improving the accuracy. The final loss is as follows:
Here, is a filter used for the morphological dilation () and erosion () operations. denotes all the pixels of the ground truth, and denotes the pixels in the boundary area as defined by the morphology operation. is a binary ground truth value and is a predicted label from a segmentation model.
is a hyperparameter that controls the balance between the loss terms.
|Method||Parameters (M)||FPS||FLOPs (G)||F1-score||mIoU||mIOU |
|Enet (2016) ||0.355||8.06||0.346||0.917||95.16||96|
|BiSeNet (2018) ||0.124||2.99||2.31||0.908||94.91||95.25|
Annotating data often comes with high costs, and the annotation time per instance varies a lot depending on the task type. For example, the annotation time per instance for PASCAL VOC is estimated to be 20.0 seconds for image classification and 239.7 seconds for segmentation, an order of magnitude difference as mentioned in  . To mitigate the cost of annotation for portrait segmentation, we consider a couple of plausible situations: 1) having images with ground truth human segmentation. 2) having only raw images. We make use of either an elaborate face detector model (case 1) or a segmentation model (case 2) for generating pseudo ground truths to each situation.
When we have human images and ground truths, the only thing we need is a bounding box around the portrait area. We took images from Baidu dataset , which contains 5,382 human full body segmentation images covering various poses, fashions and backgrounds. To get the bounding box and portrait area, we detect the face location of the images using a face detector . Since the face detector tightly bounds the face region, we increase the bounding box size to include parts of the upper body and background before cropping the image and ground truth segmentation.
We also create a second augmentation from portrait images scraped from the web, applying a more heavyweight segmentation model to generate pseudo ground truth segmentation masks. This segmentation model consists of a DeepLabv3+  architecture with a SE-ResNeXt-50 
backbone. The model is pre-trained on ImageNet and finetuned on a proprietary dataset containing around 2,500 fine grained human segmentation images. The model is trained for general human segmentation rather than for the specific purpose of portrait segmentation.
Finally, human annotators just check the quality of each pseudo ground truth image, removing obvious failure cases. This method reduces the annotation effort per instance from several minutes to 1.25 seconds by transforming the segmentation task into a binary classification task.
We evaluated the proposed method on the public dataset EG1800 , which collected images from Flickr with manually annotated labels. The dataset has a total of images and is divided into train and validation images. However, we could access only 1,309 images for train and 270 for validation, since some of the URLs are broken. We built an additional 10,448 images using the proposed data generation method mentioned in Section 3.4.
We trained our model using ADAM optimizer with initial learning rate to , and weight decay to
, for a total of 600 epochs. We followed the data augmentation method in with images. We used a two-stage training method;for the first 300 epochs, we only trained until the encoder with the batch size set to . Then, we initialize the encoder with the best parameters from the previous step, and trained the overall SINet model for an additional 300 epochs with the batch size to . We evaluated our model followed by various ablations using mean intersection over union (mIoU) and F1-score in the boundary part, and compared with SOTA portrait segmentation models including other lightweight segmentation models. To define the boundary region, we subtract the eroded ground truths from the dilated ground truths, using a kernel with size . We demonstrated the robustness of the information blocking decoder on randomly rotated EG1800 validation images, and the importance of multi-receptive structure on EG1800 validation images in Section 4.2 Also, we showed that the proposed method can be used for general tasks by evaluating it on the Cityscapes dataset.
We compared the proposed model to PortraitNet, which has SOTA accuracy in the portrait segmentation field. Since some sample URLs in the EG1800 dataset are missing, we re-trained the PortraitNet following the original method in paper and using the official code on the remaining samples in EG1800 dataset. PortraitNet compared their work to BiseNet and Enet. Therefore, we also re-trained BiSeNet and ENet following the method of PortraitNet for a fair comparison. As shown in Table 3
, the accuracies of the re-trained models are slightly decreased due to the reduced size of the training dataset. We measured latency time on an Intel Core i5-7200U CPU environment with the PyTorch framework on an LG gram laptop.
Among the compared methods, DS-ESPNet has the same structure as ESPNet, with only changing the standard dilated convolutions of the model into depth-wise separable dilated convolutions. For ESPNetV2 (2.0) and ESPNetV2 (1.5), we changed the number of channels of the convolutional layers to reduce the model size as following official code. We also reduced the number of channels for the convolutions in the DS-ESPNet (0.5) by half from the original model to make it less than 0.1M parameters and 0.2G FLOPs. The original ContextNet used 4 pyramid poolings but we used only 3 due to the small feature map size.
From Table 3, we see that our proposed method achieved comparable or better performance than the other models, while having less parameters and FLOPs, and higher FPS. The SOTA PortraitNet showed the highest accuracy in all the experimental results, and has achieved even better performance than the heavier BiSeNet. However, PortraitNet requires a large number of parameters, which is a disadvantage for using it on smaller devices. The proposed SINet has reduced the number of parameters by 95%, and FLOPs by 80% compared to PortraitNet, while maintaining accuracy. ESPNet and ESPNet V2 have similar accuracy, but showed a trade-off between the number of parameters and FLOPs. ESPNet V2 has more parameters than ESPNet, but ESPNet needs more FLOPs than ESPNet V2. Enet shows better performance than both models but requires more FLOPs. In our comparison, the proposed method has less number of parameters and FLOPs, but still achieved better accuracy than ESPNet and ESPNet V2. In particular, our SINet has the highest accuracy in an extremely lightweight environment. Figure 6 shows that the quality of our model is superior to other extremely lightweight models.
We compared the execution speed of the proposed model with SOTA segmentation model MobileNet V3 on an iPhone XS using the CoreML framework. MobileNet V3 has 60.7 FPS, and our SINet has 100.6 FPS. The FLOPs in MobileNet V3 and SINet are similar, but SINet is much faster than MobileNet V3. We conjecture that the SE block and h-swish activation function are the main reasons for the increase in latency in MobileNet V3. In summary, the proposed SINet showed outstanding performance among the various segmentation model in terms of accuracy and speed.
Information blocking decoder: Table 4 shows the accuracy improvement from using the information blocking decoder. We randomly rotated validation images and evaluated mIOU over the whole image. Reverse IB denotes that we multiply the high-resolution feature maps with the confidence score instead of , thus enhancing high-confident pixels rather than low-confidence ones. Remove IB means that we did not use any information blocking, and instead conducted element-wise summation between the low-resolution feature maps and the high-resolution feature maps from a middle layer of a encoder. GAU uses global pooling to enhance high-resolution feature map from low-resolution feature map before applying element-wise summation. GAU has better performance than Reverse IB and Remove IB, but it still fails to get a tight boundary and to get better performances in translated images than IB. From the result, we can see that the information blocking decoder shows outstanding performance compared to the other methods. Qualitatively, it prevents segmentation errors of the background region as shown in Figure 7.
Table 5 shows the performance depending on the multi-receptive structures. SINet used various combinations of kernel sizes for convolution and pooling. We re-designed the S2-module to always use the same kernel sizes within the S2-block for all convolutional and pooling layers respectively. As shown in Table 5, our SINet achieved higher mIOU and F1-score than the other combinations. Therefore, a multi-receptive field structure has an advantage for accuracy than a single-receptive field one.
|2||DSConv + SE||Down sampling|
|3||DSConv + SE||Down sampling|
|4||SB module||[k=3, p=1], [k=5, p=1]|
|5||SB module||[k=3, p=0], [k=3, p=1]|
|6||SB module||[k=3, p=0], [k=3, p=1]|
|7||DSConv + SE||Concat [3, 6] , Down sampling|
|8||SB module||[k=3, p=1], [k=5, p=1]|
|9||SB module||[k=3, p=0], [k=3, p=1]|
|10||SB module||[k=5, p=1], [k=5, p=4]|
|11||SB module||[k=3, p=2], [k=5, p=8]|
|12||SB module||[k=3, p=1], [k=5, p=1]|
|13||SB module||[k=3, p=1], [k=5, p=1]|
|14||SB module||[k=3, p=0], [k=3, p=1]|
|15||SB module||[k=5, p=1], [k=5, p=8]|
|16||SB module||[k=3, p=2], [k=5, p=4]|
|17||SB module||[k=3, p=0], [k=5, p=2]|
|18||1x1 conv||Concat [7, 17]|
We also demonstrate that our proposed method is suitable not only for the binary segmentation problem but also for general segmentation problems by testing the model on the Cityscapes dataset. We increased the number of layers and channels a little bit to cope with the increased complexity compared to the binary segmentation task, and we factorized the depthwise convolution in the S2-blocks for reducing the number of parameters. Here, SINet has only 0.12M parameters and 1.2GFLOPs for input of size , but our model showed better accuracy than any other lightweight segmentaiton model except MobileNet V3 and MobileNet V2. The accuracy of SINet decreases by 2.9% with respect to MobileNet V3, but the number of parameters and FLOPs are much lower than MobileNet V3. Table 7 is a detailed setting of the encoder model for the Cityscape segmentation.
In this paper, we proposed an extremely lightweight portrait segmentation model, SINet, which consists of an information blocking decoder and spatial squeeze modules. SINet executes well in mobile device with 100.6FPS and preserves accuracy. The information blocking decoder prevents nuisance information from high-resolution features and induce the model to concentrate more on high uncertainty regions. The spatial squeeze module has multi-receptive field to handle the various sizes of global consistency in an image. We also proposed a simple data generation framework covering the two situations: 1) having human segmentation ground truths 2) having only raw images. From the experiments on the public portrait segmentation dataset, our model obtained outstanding performance compared to the existing lightweight segmentation models. Not only for the portrait dataset, the proposed method shows appropriate accuracy (66.5 %) with only 0.12M number of parameter and 1.2G FLOP in Cityscape.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2, §2.
ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. arXiv preprint arXiv:1811.11431. Cited by: §1, §2, §3.2, Table 3.