MMNet
Code for Towards Real-Time Automatic Portrait Matting on Mobile Devices
view repo
We tackle the problem of automatic portrait matting on mobile devices. The proposed model is aimed at attaining real-time inference on mobile devices with minimal degradation of model performance. Our model MMNet, based on multi-branch dilated convolution with linear bottleneck blocks, outperforms the state-of-the-art model and is orders of magnitude faster. The model can be accelerated four times to attain 30 FPS on Xiaomi Mi 5 device with moderate increase in the gradient error. Under the same conditions, our model has an order of magnitude less number of parameters and is faster than Mobile DeepLabv3 while maintaining comparable performance. The accompanied implementation can be found at <https://github.com/hyperconnect/MMNet>.
READ FULL TEXT VIEW PDFCode for Towards Real-Time Automatic Portrait Matting on Mobile Devices
Image matting, the task which predicts alpha values of foreground on every pixel, has been studied [12, 13, 15, 8, 14]
. Image matting system offers an opportunity for wide applications in computer vision such as color transformation, stylization, and background edits. It is well-known, however, that image matting is an ill-posed problem
[19] since seven unknown values (three for foreground RGB, three for background RGB and one for alpha) should be inferred from three known RGB values. The most widely used method to alleviate the difficulties of the matting problem is to utilize an additional input which roughly separates an image such as trimap [12, 30] and scribbles [19]. A trimap splits an image into three parts: definite foreground, definite background, and ambiguous blended regions. Scribbles, on the other hand, indicate foreground and background with a few strokes. Even though some of the traditional methods [27, 34, 19, 30] work well if additional inputs are provided, it is hard to extend these methods to various image and video matting applications which require real-time performance due to their high computational complexity as well as the dependency on user-interactive inputs. Other approaches have been studied to automate matting by specifying the object which has to be selected as a foreground [29, 36, 9], for example, portrait matting. Automatic portrait matting showed even better result than the other methods using trimap [29], but the latency is far too high to be used in a real-time application. Zhu et al. [39]released a lightweight model which can perform automatic matting relatively fast on mobile devices, attaining the latency of 62 ms per image on Xiaomi Mi 5. However, the gradient error of lightweight model was more than two times worse than that of the state-of-the-art, which made it less attractive in real-world applications. In this paper, we propose a compact neural network model for automatic portrait matting which is fast enough to run on mobile devices. The proposed model adopts an encoder-decoder network structure
[4] and focuses on devising efficient components of the network. We apply depthwise convolution [31] as the basic convolution operation to extract and downsample features. The depthwise convolution is considerably cheaper than other convolutions even if we take efficient convolutions such as convolution [16] into account as well. The linear bottleneck structure [26]benefits from the efficiency of depthwise convolutions, boosting the performance while maintaining the latency. Building upon these observations, the encoder block of the proposed model, consists of multi-branch dilated convolution with linear bottleneck blocks which can reduce the model size with the linear bottleneck structure while aggregating multi-scale information with multi-branch dilated convolutions. We introduce the width multiplier, a global variable which enlarges or shrinks the number of channels of a convolution, to control the trade-off between the size and the latency of the model. We incorporate multiple losses into our loss function, including a gradient loss which we propose. The proposed model shows better performance than the state-of-the-art method while achieving 30 FPS on iPhone 8 without GPU acceleration. We also evaluate the trade-offs between performance, the latency on mobile devices, and the size of the model. Our model can achieve 30 FPS on Google Pixel 1 and Xiaomi Mi 5 using a single core while suffering roughly 10% degradation of gradient error compared to the state-of-the-art. Our contributions are as follows:
We propose a compact network architecture for automatic portrait matting task which achieves a real-time latency on mobile devices.
We explore multiple combinations of input resolution and width multiplier, which can beat strong baselines for automatic portrait matting on mobile devices.
We demonstrate the capability of each component of the model, including the multi-branch dilated convolution with linear bottleneck blocks, the skip connection refinement block and the enhancement block, through ablation studies.
Image matting problems take input image , which is a mixture of the foreground image and the background image . Each pixel at the -th position is computed as follows:
(1) |
where the foreground opacity determines . Since all the quantities on the right-hand side of the Equation 1 are unknown, the problem is ill-posed. However, we add an assumption that and are identical to in order to reduce the complexity of the problem. Even though the assumption may decrease the performance substantially, the empirical result of our experiments show this assumption is reasonable considering the latency gain we get.
Our model follows a standard encoder-decoder architecture that is widely used in semantic segmentation tasks [20, 24, 4]. Encoder successively reduces the size of the input by downsampling and summarizes the spatial information while capturing higher semantic information. Decoder, in turn, upsamples the image to recover the detailed spatial information and restores the original input resolution. The whole network structure of our model, mobile matting network (MMNet), is depicted in Figure 2. Many modern neural network architectures replace a regular convolution with a combination of several cheaper convolutions [11, 35, 32]. Depthwise separable convolution [16, 11] is one of the examples which consists of a depthwise convolution, applying a single convolutional filter per input channel, and a pointwise convolution (
convolution) that accumulates the results. We not only use depthwise separable convolution for some blocks but also adopt the concept of depthwise separable convolution when designing our encoder block. Depthwise convolution is one such example which we use extensively. All convolution operations are followed by a batch normalization and a ReLU6 non-linearity except the linear projection operation that is placed at the end of the encoder block
[26]. Due to the linear bottleneck structure, the information flow from an encoder block to another is projected to a low-dimensional representation. In the encoder block, the information flowing from the lower layers is expanded by the first multi-branch convolutions. The linear bottleneck compresses the processed image. The data upsampled by the decoder block is concatenated with the refined knowledge through a skip connection. The number of channels for each path are maintained to have the same value. Table 1 details how much each component expands and compresses the information flow. To control the trade-off between model size and model performance, we adopt width multiplier [16]. The width multiplier, is a global hyperparameter that is multiplied to the number of input and output channels to make the layers thinner or thicker depending on the computational budget.
Name |
Component Details | Output Size |
---|---|---|
Initial Block | Conv , S | , 32 |
Encoder 1 | DR , S | , 16 |
Encoder 2 | DR , S | , 24 |
Encoder 3 | DR , S | , 24 |
Encoder 4 | DR , S | , 24 |
Encoder 5 | DR , S | , 40 |
Encoder 6 | DR , S | , 40 |
Encoder 7 | DR , S | , 40 |
Encoder 8 | DR , S | , 40 |
Encoder 9 | DR , S | , 80 |
Encoder 10 | DR , S | , 80 |
Decoder 1 | Upsample (Skip 5) | , 128 |
Decoder 2 | Upsample (Skip 1) | , 80 |
Enhancement 1 | DR , S | , 40 |
Enhancement 2 | DR , S | , 40 |
Decoder 3 | Upsample | , 16 |
Final Block | Conv , Softmax | , 2 |
MMNet encoder block has a multi-branched dilated convolution structure with a linear bottleneck. Input flows to multiple branches which undergo channel expansion followed by a strided convolution and a dilated convolution. The dilation rates are different for all branches following
rates. Multi-branch dilated convolution amounts to sampling spatial information at different scales. The outputs of different branches are concatenated to form a tensor containing multi-scale information. Applying encoder blocks in succession allows the network to capture multi-level information increasingly. As the encoder blocks are consecutively applied, we decrease the number of branches in an encoder block, slowly changing the dilation rates from
to . A linear bottleneck structure is imposed on the encoder block where the output of the encoder block is thinner than the intermediate representations. The final convolution after combining the multi-branch information projects the input to a low-dimensional compressed representation. The linear bottleneck is a decomposition of a regular convolution that connects two encoder blocks into two cheaper convolutions with reduced channels. The encoder block is illustrated in Figure 3.The decoder performs multiple upsampling to restore the initial resolution of the input image. To help decoder with the restoration of low-level features from compressed spatial information, skip connections are employed to directly connect the output of the lower-layer encoder to its corresponding decoder [24]. Instead of using the information provided by the corresponding encoder blocks without any modifications we refine the information by performing a depthwise separable convolution. The resulting refined information is concatenated with the upsampled information. This specific refinement technique is reminiscent of the refinement module proposed in SharpMask [22, 33]. A decoder block with a refinement block is illustrated in Figure 4. In this work, we connect the feature map of encoder #1 and encoder #5 to decoder #2 and decoder #1, respectively. In the final decoder block, we perform upsampling instead of the usual to shorten the decoding pipeline.
As the decoder block keeps upsampling the feature map, there is no way to enhance the predictions of neighboring values. To tackle this problem, we insert two enhancement blocks in the middle of the decoding phase. Rather than designing a new block, we share the same architecture with encoder block. The only difference between enhancement block and encoder block is that depthwise convolution with stride two is removed because the enhancement block should sustain the resolution of a feature map. In the ablation study, we show the effectiveness of the enhancement block.
The alpha loss and the compositional loss are frequently used in matting tasks. The alpha loss , measures the mean absolute difference between the ground truth mask and the mask predicted by the model. The compositional loss , measures the mean absolute difference between the values of ground truth RGB foreground pixels and the model predicted RGB foreground pixels. The compositional loss penalizes the model when the model incorrectly predicts a pixel with high value.
(2) | ||||
(3) |
where the is equal to the width time height, , and
is a vectorized alpha matte where each pixel value is indexed by subscript
. The gt superscript denotes the alpha matte is from ground truth. We use the KL divergence between the ground truth and the model predicted . The KL divergence is defined to be:(4) | ||||
(5) |
The second term is the entropy of the ground truth alpha matte, which is constant with respect to model predicted . Removing the second term leads to optimization of the following loss:
(6) |
Two additional loss terms are included in the loss function. An auxiliary loss [31] , helps with the gradient flow by including an additional KL divergence loss between the downsampled ground truth mask and the output of the encoder block #10. A gradient loss , guides the model to capture fine-grained details in the edges. We use Sobel-like filter
(7) |
to create a concatenation of two image derivatives where is a convolution. The resulting yields a two-channel output that contains the gradient information along -axis and -axis. We apply to both the ground truth mask and the model predicted mask to compute the mean absolute differences. The gradient loss is computed as follows:
(8) | ||||
(9) |
The following Equation 10 depicts the loss function of our proposed network.
(10) |
where we set values to control the influence of each loss terms. We set them to have equal values of one for the following experiments.
Automatic portrait matting takes input image with a portrait and denotes each pixel with a linear mixture of the foreground and the background. We use data provided by Shen et al. [29] which consists of 2,000 images of resolution where 1,700 and 300 images are split as training and testing set respectively. To overcome the lack of training data, we augment images by utilizing scaling, rotation and left-right flip. First, an image is rescaled to the input size of the model and random scaling factor is selected from to . The image is then scaled with the selected factor. Rotation by ,
is applied with a probability of
which means that half of the augmented images are not rotated. Additional cropping is computed to make the size of the image to match the input size of the model. Finally, the left-right flip is also applied with a probability of . To train our model, we optimize our proposed model with respect to the loss function in Equation 10 using Adam optimizer with a batch size of 32 and a fixed learning rate of . Input images were resized to and . The model trained on images are faster but produces worse alpha mattes compared to the model trained on images. Weight decays were set to. All experiments are conducted using a TensorFlow
[3] trained on a single Titan V GPU. Following the work of Zhu et al. [39], we used gradient error to evaluate our model in portrait matting problem. The gradient error as a metric, which is different from gradient loss, is defined as:(11) |
where is the alpha matte predicted by the model, and is the corresponding ground truth and is equal to width height.
denotes the differential operator that is computed by convolving the alpha map with first-order Gaussian derivative filters with variance
[23]. Another metric we use to evaluate our model is the mean absolute differences (MAD). The MAD is defined as follows:(12) |
For a fair comparison with previous methods, we scale the predicted alpha matte to the original size of input images,
in this case, and calculate evaluation metrics. We compare our model to DAPM
[29], LDN+FB[39], and Mobile DeepLabv3 [26]. Mobile DeepLabv3 exploits MobileNetV2 as its feature extractor and has its atrous spatial pyramid pooling (ASPP) module removed as suggested by Sandler et al. [26]. We use Equation 10 to optimize Mobile DeepLabv3 in equal footings as MMNet, but remove the auxiliary loss since it requires a modification to the network architecture.Table 2 compares the result of DAPM [29], LDN+FB [39], Mobile DeepLabv3 [26], and the proposed method. Input images were scaled to or , depending on the hyper-parameter. When smaller images are fed into the network, the latency drops considerably at the expense of the quality of the alpha matte. Input images were rescaled back to their original resolutions before evaluation. The gradient error and the latency for DAPM and LDN+FB were reported by Zhu et al. [39]. For a fair comparison, we compute the latency of the models on a Xiaomi Mi 5 device (Qualcomm Snapdragon 820 MSM8996 CPU), as suggested by Zhu et al. [39]. Since Zhu et al. [39] did not report how much CPU resources they used, we measure the latency by restricting the use to a single core. Specifically, we use TensorFlow Lite [2] benchmark tool to compute the latency of Mobile DeepLabv3 and MMNet by averaging 100 runs of the model inference on a Xiaomi Mi 5 device while restricting the models to use a single thread. Zhu et al. [39] reports that DAPM takes 6 seconds on a computer with Core E5-2600 @2.60Ghz CPU. MMNet-1.0 outperforms DAPM while running orders of magnitude faster on a mobile CPU. When the input image is resized to for faster inference, our model attains real-time inference, surpassing the rate of 30 frames per second. The real-time version of MMNet is still competitive against DAPM with a moderate increase in its gradient error. The visual comparison of alpha matte in Figure 5 illustrates the qualitative differences of different models. MMNet is better able to construct the finer details compared to other models. Even the real-time version of MMNet produces a reasonable alpha matte regardless of its reduced capacity.
Method | Time | Gradient Error |
---|---|---|
(ms) | () | |
Graph-cut trimap | - | 4.93 |
Trimap by [28] | - | 4.61 |
Trimap by FCN [20] | - | 4.14 |
Trimap by DeepLab [6] | - | 3.91 |
Trimap by CRFasRNN [38] | - | 3.56 |
DAPM [29] | - | 3.03 |
LDN+FB [39] | 140 | 7.40 |
MD16-0.75 | 146 | 3.23 |
MD16-1.0 | 203 | 3.22 |
MD16-0.75 | 38 | 3.71 |
MMNet-1.0 | 129 | 2.93 |
MMNet-1.4 | 213 | 2.86 |
MMNet-1.0 | 32 | 3.38 |
To examine the trade-off between execution time and model performance, we explore the model space by varying the width multiplier values and the input resolution. We compare our model with Mobile DeepLabv3 suggested by Sandler et al. [26]. Table 3 details the result of the experiment. The results are sorted by the latency and models with comparable execution time are clustered using horizontal dividers. We see that our proposed model dominates Mobile DeepLabv3 in all clusters in terms of gradient error. Also, note that the number of parameters differs by an order of magnitude. Requiring a small number of parameters is especially appealing if we target a mobile device since end-users do not have to download a bulky model whenever there is an update of the model.
Method | Time | Gradient | MAD | Params |
(ms) | (M) | |||
MD16-0.75 | 146 | 3.25 | 2.31 | 1.327 |
MMNet-1.00 | 129 | 2.93 | 2.48 | 0.199 |
MD8-0.75 | 113 | 3.53 | 2.61 | 1.327 |
MMNet-0.75 | 90 | 2.99 | 2.65 | 0.127 |
MD16-0.50 | 82 | 3.36 | 2.53 | 0.454 |
MD8-0.50 | 66 | 3.61 | 2.85 | 0.713 |
MMNet-0.50 | 61 | 3.17 | 2.83 | 0.069 |
MMNet-1.40 | 55 | 3.38 | 2.72 | 0.369 |
MD16-1.00 | 53 | 3.68 | 2.88 | 2.142 |
MD8-0.35 | 44 | 3.72 | 3.07 | 0.454 |
MD16-0.75 | 38 | 3.77 | 2.96 | 1.327 |
MMNet-1.00 | 32 | 3.44 | 2.97 | 0.199 |
MMNet-1.00Q | 98 | 2.88 | 2.47 | 0.199 |
Figure 1 plots trade-off between gradient error and latency on a mobile device. Note that MMNet develops a Pareto-front in this space and outperforms other models. Latency comparison of Pixel 1 and iPhone 8 are included in the supplementary material.
Our proposed network owes its performance to several building blocks utilized in its model architecture. We analyze the impact of each design choices by performing ablation experiments.
We study the effect of different dilation rates in the encoder block. The proposed model contains a multi-branch dilated convolutions in the encoder block. We analyze the impact of this decision by fixing the dilation rates to one.
Whenever there is a skip connection, we have included a refinement block to improve the decoding quality. The refinement block enhances the result of the encoder block by performing depthwise separable convolution followed by batch normalization and a ReLU6 non-linearity. We remove the refinement block and study its impact on the final result.
The enhancement blocks are intended to give the network a layer to improve the final result before its resolutions are fully recovered. We study the effect of the enhancement block by removing it entirely from the network.
Method | Gradient Error |
---|---|
() | |
No dilation | 3.25 |
No enhancement in decoding | 3.04 |
No refinement in skip connection | 3.07 |
Proposed model | 2.93 |
Table 4 illustrates the results when different components of the model architecture are modified. We see that all the components contribute to the final performance of the proposed model. When the dilation rate is fixed to one, the network has a hard time generalizing due to its limited effective receptive field. Enhancement and refinement in the decoding phase also boost the network performance.
We demonstrate the full pipeline for training a real-time portrait matting model targeting a mobile platform by incorporating quantization of our model. Quantization of model parameters and its activation reduces the bit-width required by the model. The reduction of bit-width allows one to exploit integer arithmetics in boosting the network inference speed. The target model undergoes a quantization-aware training phase via fake quantization [17]. While maintaining full precision weights, tensors are downcasted to fewer bits during the forward pass. On a backward pass, the full precision weights are updated instead of downcasted tensors from which the gradients are computed. Once the training is complete, quantized models are executed using the TensorFlow Lite framework [2]. Table 3 contains the result of 8-bit quantized model. The model enjoys 25% decrease in latency and better gradient error. The details for quantization are included in the supplementary material.
Image matting task has been mostly approached using sampling [12, 13, 15, 27, 34] or propagation-based [8, 14, 19, 30]
ideas. Recently, with the success of convolutional neural networks (CNN) in computer vision tasks, there has been a growing number of works utilizing CNNs.
Cho et al. [10] proposed end-to-end network which relies on other matting algorithms’ outputs, such as the closed form matting [19]and the KNN matting
[8], to produce the final alpha matte. Shen et al. [29] proposed an automatic image matting method leveraging CNN to create a trimap which is fed to closed form matting [19]by backpropagating the matting error back to the trimap network.
Xu et al. [36] take the approach further by directly learning the alpha matte. Chen et al. [9] combine trimap generation and alpha matte generation using a fusion module. Many works on image matting are mainly focused on achieving higher accuracy rather than the real-time inference of models. But recently, researchers are shifting the focus to networks that accommodate real-time inference [39]. Zhu et al. [39] studied real-time portrait matting on mobile devices which is directly comparable to our result. Since the work of Long et al. [20], fully convolutional networks (FCN) have been widely used in various segmentation tasks [37, 18]. Many of the semantic segmentation networks adopt encoder-decoder structure [4]. The proposed model uses skip connections to concatenate the output of an encoder block to a decoder block which has been known to improve the result of semantic pixel-wise segmentation tasks [24]. Chen et al. [6] proposed DeepLab [5, 7] architecture which extensively uses the ASPP module. ASPP module aims to solve the problem of efficient upsampling and handling objects at multiple scales. Our model adopts a multi-branch structure from Inception network [31], together with the dilated convolution of different dilation rates, which resembles the ASPP module. One of the most prominent light-weight neural networks is MobileNet and its variants [16, 26]. Depthwise separable convolution was shown to be extremely effective in creating a light-weight network while keeping the accuracy drop to a tolerable level. ENet, an efficient neural network architecture designed with the intention of tackling a semantic segmentation task, was proposed by Paszke et al. [21]. Our work is inspired by the design choices detailed in their work for creating an efficient neural network.In this work, we have proposed an efficient model for performing automatic portrait matting task on mobile devices. We were able to accelerate the model four times to achieve 30 FPS on Xiaomi Mi 5 device with only 15% increase in the gradient error. Comparison against Mobile DeepLabv3 showed that our model is not only faster when the performance is comparable, but also requires an order of magnitude less number of parameters. Through ablation studies, we have shown that our choice of the multi-branch dilated convolution with a linear bottleneck is essential in maintaining high performance. We also make our implementation available at https://github.com/hyperconnect/MMNet. A general extension of our work is to handle general image matting problem, such as automatic saliency matting. Since we can already achieve real-time, it is natural to extend the work further by tackling the video matting problem as well. Pushing for real-time inference on mobile devices requires a carefully prepared pipeline for it to work in a real-world setting. Distillation to guide the mobile-friendly model in training and even lower-bit quantization for added speedup is highly desired.
Tensorflow: A system for large-scale machine learning.
In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, 2016.Xception: Deep learning with depthwise separable convolutions.
arXiv preprint, pages 1610--02357, 2017.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2001.Conditional random fields as recurrent neural networks.
In Proceedings of the International Conference on Computer Vision, 2015a.We used tensorflow.contrib.quantize
to quantize our model.
Custom implementation of resize_bilinear
operation, optimized using SIMD instructions was deployed.
Since we are using fake quantization [17] for quantization-aware training, additional fake quantization node was inserted after a resize_bilinear
operation. The quantized version of softmax provided by TensorFlow Lite is slow for our use case since it is optimized for a classification task. Our formulation allows us the make an assumption that the output has only two channels. Quantizing the values to 8-bits means that there are only 65,536 valid logit pairs. Instead of explicit computation of softmax, we precompute the values and substitute the calculation with a table lookup.
Method | Pixel 1 | Mi 5 | iPhone 8 |
---|---|---|---|
MD16-0.75 | |||
MMNet-1.00 | |||
MD8-0.75 | |||
MMNet-0.75 | |||
MD16-0.50 | |||
MD8-0.50 | |||
MMNet-0.50 | |||
MMNet-1.40 | |||
MD16-1.00 | |||
MD8-0.35 | |||
MD16-0.75 | |||
MMNet-1.00 | |||
MD16-0.75 | - | - | |
MMNet-1.00Q | - |
Table 5 depicts the latency of different models measured on Pixel 1, Xiaomi Mi 5, and iPhone 8. All measurements are performed with the TensorFlow Lite [2]
benchmark tool on a mobile device while restricting the models to use a single thread. The mean and the standard deviation obtained from 100 runs are included in the table. The measurements were separated apart in time to give the device enough time to cool down. Demo video is available at
https://github.com/hyperconnect/MMNet.Name | Output channels of 1x1 convolution | ||||
---|---|---|---|---|---|
First | Encoder/Enhancement | Decoder | Refinement | Final | |
Initial Block | |||||
Encoder 1 | |||||
Encoder 2 | |||||
Encoder 3 | |||||
Encoder 4 | |||||
Encoder 5 | |||||
Encoder 6 | |||||
Encoder 7 | |||||
Encoder 8 | |||||
Encoder 9 | |||||
Encoder 10 | |||||
Decoder 1 | |||||
Decoder 2 | |||||
Enhancement 1 | |||||
Enhancement 2 | |||||
Decoder 3 | |||||
Final Block |
Table 6 illustrates the number of channels used in each component of MMNet. The initial block outputs a 32 channel feature map, as described in the first row. The numbers in the encoder/enhancement columns represent the number of channels returned by the multi-branch convolutions and the final output of the encoder/enhancement block after the concatenation. For example, encoder #6 will receive a channel input which the convolutions in multiple branches each expand to channels. After the multi-branch, the outputs are concatenated and convoled by a convolution which compresses the number of channels back to . Whenever there is a skip connection, the output of a decoder block is concatenated with the output of a refinement block. Their respective number of channels are delineated in the decoder rows. The final block returns a two-channel output, each for foreground and background.
Comments
There are no comments yet.