## I Introduction

For a color image with foreground pixels and background pixels , the alpha matting problem asks to determine opacities , such that the equality

(1) |

holds. Equation 1 is called the compositing equation. Alpha matting can be seen as an attempt to undo the compositing equation to get the original . In this work we want to focus on the problem of estimating the foreground pixels given the image and the matte . A naive method to compose an image on a new background is to use in place of , obtaining a new image , but this is only sufficient if is close to binary, i.e. is almost 0 or 1. This naive approach results in background colors that bleed through partially transparent regions, as visualized in Figure 1 (c).

## Ii Related Work

The problem of estimating the alpha matte is well-studied in literature. Recently, neural network based methods were introduced to estimate the alpha values. Cho et al.

[cho2016natural]compute alpha in an end-to-end fashion based on outputs of other methods using a convolutional neural network. Xu et al.

[xu2017deep] train an encoder-decoder network to predict alpha and a refinement network to improve the prediction. Lutz et al. [lutz2018alphagan] employ a generative adversarial network. Cai et al. [cai2019disentangled]stack a recurrent neural network onto an autoencoder network to first estimate an optimal ternary segmentation followed by the alpha matte in a multi-task learning setting.

However, these methods were not devised to estimate the foreground colors. In the following we want to give a brief overview over methods that are capable of foreground estimation. Hou et al. [hou2019context] train a network for local features and a network for global context information simultaneously. Tang et al. [tang2019learning] train a chain of three neural networks to successively estimate background, foreground and alpha. Levin et al. [levin2007closed], Chen et al. [chen2013knn] and Aksoy et al. [ifm] globally minimize a quadratic energy function based on a smoothness prior applied to foreground and background, k-nearest neighbors in color and pixel coordinate space and a combination thereof respectively. All of those methods either have high memory requirements, high computation times, or both, which motivates the development of a faster method.

## Iii Method

### Iii-a Notation

We use to denote the intensity of color channel of image at index . In addition, we make use of the notation by Levin et al. [levin2007closed] to denote the gradient of the image towards the x-direction. We use a similar notation for the foreground image and the background image respectively.

### Iii-B Closed-Form Foreground Estimation

In order to estimate both foreground and background images, Levin et al. [levin2007closed] propose to minimize a cost function for each pixel and color channel consisting of three terms, one to constrain the resulting color from the compositing equation (Equation 1) and two to reduce the magnitude of color gradients and in regions of large -gradients and , thereby preserving texture information

(2) |

We find that this closed-form color estimation method can be accelerated greatly with appropriate preconditioning, for example by employing a thresholded incomplete Cholesky decomposition in conjunction with conjugate gradient descent, but solving the resulting -by- linear system still takes in the order of 30 seconds per color channel to converge below a residual error of for an megapixel image on current consumer hardware. This is unsatisfactory for interactive image editing. Our goal for practical applications is a method which runs in a few seconds on multi-megapixel images on common hardware.

### Iii-C Multi-Level Foreground Estimation

A simplified approach might try to approximate the closed-form cost function by only solving it for a small local region instead of finding a global solution. Unfortunately, this does not work because a local solution barely propagates foreground and background colors into the region with non-binary alpha values, even with many iterations. However, a multi-level approach can alleviate this shortcoming, leading to an efficient method to approximate foreground and background colors.

To this end, we start with the cost function by [levin2007closed], which we modify for a local image region centered at the pixel for a fixed color channel . The color gradients are expressed as a sum over the neighboring pixels . Furthermore, by adding a regularization factor , we make the problem well-defined in regions with constant alpha values. Otherwise, foreground colors and background colors would be unconstrained in regions where the translucency is 0 or 1 respectively. In addition, we introduce the constant to control the influence of the alpha gradient

(3) |

This cost function can be expressed in matrix form as

(4) |

where

is a vector of the foreground and background colors,

is a vector describing how to weight the colors and is a vector of the neighboring foreground and background colors(5) |

Furthermore, is a -by- matrix to broadcast the local foreground and background colors to the size of vector

(6) |

and is a -by- block matrix

(7) |

The top-left and bottom-right blocks of with entries of vector encode the -regularization and -gradient constraints.

The derivative of the cost function with respect to is then

(8) |

Setting the derivative to zero and solving for yields the solution vector

(9) |

The matrix is independent of , which means that it only has to be computed once per pixel and can be reused for each color channel.

To solve the problem of slow propagation, we employ a multi-level approach. We begin by solving for the foreground image at a low resolution where the slow spatial propagation problem does not exist. It is sufficient to minimize the local cost function iteratively. Next, we solve the problem at a slightly larger scale by using the solution from the smaller scale as initialization. We repeat this process until the original size of the input image is reached.

### Iii-D Implementation

The input to the multi-level color estimation procedure listed in Algorithm 1 is an RGB image and an alpha matte of resolution pixels.

At the smallest-level, foreground and background images and are initialized to a resolution of pixels. For orientation, see the top of the right pyramid in Figure 2. The values at this point are not important since they will be updated later and converge quickly.

Next, a loop over the various image levels is started and the input image and input alpha matte, as well as the foreground and background images of the previous level, are resized to the current working size. The number of levels is chosen such that the image width and height grows at most by a factor of two between levels to ensure spatial propagation.

At each level and for each iteration, a linear system is constructed for each pixel coordinate from its neighbors (Figure 2, purple, and Algorithm 1, lines 10-24). Coordinates of neighbors which would exceed image bounds are clamped to the valid image region.

The linear system is then solved and applied simultaneously to update all color channels of the current pixel’s foreground and background colors.

It is a good idea to run more iterations at lower resolutions because they are computationally cheaper and the colors are propagated further when the image is resized to a higher resolution. In practice, 2 iterations for medium to high resolutions and 10 iterations for low resolutions are usually sufficient to achieve visually pleasing results, where low is chosen as pixels.

## Iv Experiments

### Iv-a Other Methods

We compare the quality of the estimated foreground of our method with that of the author’s implementations of closed-form foreground estimation [levin2007closed] and KNN foreground estimation [chen2013knn] as well as the ground truth on the dataset [rhemann2009perceptually].

Other methods exist which estimate both alpha and foreground given an input image and a trimap, for example using neural networks [hou2019context, tang2019learning]. However, those methods are not easily comparable since on the one hand, an alpha matte is a more precise input than a trimap, making this a harder problem, but on the other hand, those methods could also trade off error in the alpha matte against error in the foreground estimation. Therefore, we do not include such methods in our evaluation.

Nevertheless, comparing against a neural network-based foreground estimation method is conceptually interesting, which is why we modify the IndexNet [lu2019indices] alpha matting network to predict a foreground estimate instead. We retrained the network on the ground truth dataset by [xu2017deep] using a compositional loss on the unknown image region

(10) |

We otherwise adapt the same training procedure as described by [lu2019indices].

To compare the runtime of our multi-level approach with the other methods, we implement KNN foreground estimation and closed-form foreground estimation using Python with vectorized NumPy and SciPy routines.

### Iv-B Dataset

We use the high resolution input images, alpha mattes and foreground images of the dataset by Rhemann et al. [rhemann2009perceptually] to evaluate the quality of the foreground estimation methods. A selected sample of images is displayed in Figure 5.

It should be noted that, due to noise in the data, the ground truth foreground color does not exactly match the color of the corresponding input image, even in regions where the alpha matte is equal to one. Thus, having exactly zero error is not possible for the given data.

The alpha matting dataset by [rhemann2009perceptually] contains ground truth images and foreground images in linear RGB color space without white point, as well as images in sRGB color space with white point adjustment. Although it would be physically more accurate to apply methods in linear RGB space, it is more common to use the sRGB color space instead. Therefore, we transform the linear RGB ground truth foreground images without white point to sRGB space with white point correction.

The white point parameters are unknown. For this reason, we employ an optimization approach to obtain a matrix to transform from into , denoted by -by-3 matrices and of stacked linear RGB color row vectors respectively. To get from , we apply the inverse gamma correction function

(11) |

element-wise to each entry and color value. We minimize the error function

(12) |

arriving at the 3-by-3 white point correction matrix

(13) |

We can thus obtain by multiplying with each 3-by-1 color vector of and finally applying the gamma correction function

(14) |

to transform from to , corresponding to the ground truth . We compare the output of the various methods against . Likewise, we transform to , which corresponds to the input image . It should be noted that the image can not be used in place of because, although the images are often similar, notable differences exist. For example, the background in image 12 of the dataset is different. The gamma parameter was chosen as to minimize over all images.

We deliberately chose not to evaluate the methods on the Composition 1k dataset because the characteristic artifacts suggest that the foreground images were obtained using a variation of closed-form color estimation, which would give an unfair advantage to methods which are conceptually similar, since they are likely to exhibit the same artifacts. On the other hand, the foreground images in the dataset [rhemann2009perceptually] were computed from multiple photos with varying backgrounds, which should not bias the evaluation towards specific methods.

### Iv-C Error Measures

We report the sum of absolute differences (SAD) between the ground truth foreground and the estimated foreground weighted by the ground truth alpha matte over the translucent region

(15) |

We also report the mean squared error (MSE) which we weight similarly:

(16) |

Furthermore, Rhemann et al. [rhemann2009perceptually] show that the gradient error of the alpha matte correlates with the perceptual quality. We adapt it to compute the error of the gradient of the foreground (GRAD) as

(17) |

where denotes the gradient image

, which is calculated by first-order Gaussian derivative filters with standard deviation

.### Iv-D Qualitative Results

We are mainly interested in the quality of the estimated foreground images where the ground truth alpha matte is not available. For this reason, we compute several alpha mattes using the respective author’s implementation of KNN matting [chen2013knn], IndexNet matting [lu2019indices] and information-flow matting [ifm] (Figure 7, 8, 9, second row). The alpha values estimated by KNN alpha matting are often close to zero or one. Information-flow matting produces smoother values. IndexNet matting seems to be struggling with the mesh structure, which consists of quickly alternating bright and dark colors due to reflection.

Input | CF | IndexNet | KNN | ML | GT |

First, we discuss the raw estimated foreground colors without composing them onto a background (Figure 6). It can be seen that both closed-form foreground estimation and the retrained IndexNet do not propagate the foreground color far into the background region. While this is not an issue when compositing the foreground with the ground truth alpha matte, it could be an issue for incorrectly estimated alpha mattes. KNN foreground estimation propagates the colors further, resulting in a background which is slightly tinted with the foreground colors. Lastly, our multi-level method strongly propagates colors, producing a foreground estimate which is suitable even for inaccurate alpha estimates.

To evaluate the qualitative results, we compose the estimated foreground images onto a white background to make it easier to see if any traces of the background color are showing through and have not been removed satisfactorily.

In the case where the KNN alpha matte is used as input, the estimated foreground colors are usually too dark due to the almost-binary nature of the alpha matte. This can be observed across all tested foreground estimation methods (Figure 7).

For the IndexNet alpha matte, the green and blue background color still shines through due to artifacts in the alpha matte. This effect is greatly diminished for our method due to the strong propagation of foreground colors into background regions (Figure 8, last row).

Information-flow alpha matting slightly overestimates the alpha matte for the wire mesh image (Figure 9, third column), resulting in a green-colored mesh for closed-form foreground estimation as well as dark blotches for the other methods. Otherwise, all methods produce acceptable results.

Input Image | |||
---|---|---|---|

Input Alpha | |||

CF | |||

IndexNet | |||

KNN | |||

ML (Ours) |

Image | |||
---|---|---|---|

Alpha | |||

CF | |||

IndexNet | |||

KNN | |||

ML |

Input | |||
---|---|---|---|

Alpha | |||

CF | |||

IndexNet | |||

KNN | |||

ML (Ours) |

Alpha | Foreground | SAD | MSE | GRAD |
---|---|---|---|---|

Multi-Level (Ours) | 20.9 | 1.44 | 8.89 | |

Closed-Form (Levin) | 21.1 | 1.34 | 8.13 | |

IndexNet (Lu) | 28.8 | 2.33 | 11.1 | |

KNN (Chen) | 32.0 | 3.25 | 16.1 | |

Multi-Level (Ours) | 31.8 | 2.5 | 11.5 | |

Closed-Form (Levin) | 36.6 | 3.51 | 14.2 | |

IndexNet (Lu) | 38.3 | 3.9 | 14.5 | |

KNN (Chen) | 34.6 | 3.22 | 13.0 | |

Multi-Level (Ours) | 47.9 | 5.66 | 15.8 | |

Closed-Form (Levin) | 59.0 | 8.03 | 21.5 | |

IndexNet (Lu) | 62.6 | 8.65 | 21.4 | |

KNN (Chen) | 37.1 | 3.81 | 16.9 | |

Multi-Level (Ours) | 31.6 | 2.44 | 11.4 | |

Closed-Form (Levin) | 37.7 | 3.98 | 15.3 | |

IndexNet (Lu) | 36.4 | 3.93 | 15.7 | |

KNN (Chen) | 33.7 | 2.97 | 13.6 |

### Iv-E Quantitative Results

Figure 3 and Figure 4 visualize the sum of absolute differences (Equation 15) of the estimated foreground for each method applied to the ground truth alpha matte and KNN alpha matte respectively. The plots show that our method produces small errors not only when being applied to the ground truth, but also in the more realistic case when the alpha matte needs to be estimated.

Table I shows the SAD, MSE and Gradient error measures averaged over the dataset by Rhemann et al. [rhemann2009perceptually]

. Our multi-level method performs best with respect to SAD and gradient error for three of the four input alpha mattes. We point out that SAD is more perceptually relevant compared to MSE for image similarity

[sinha2011perceptually] and image restoration [zhao2016loss]. The gradient error has been shown to be superior to both measures in the case of alpha matting [rhemann2009perceptually].### Iv-F Influence of Regularization

We evaluate the influence of regularization on the error of the estimated foreground color (Figure 10) and make two key observations.

Firstly, the alpha gradient term by itself does not contribute much to the overall mean squared error, since the difference between weighting it with either or is small.

Secondly, a pronounced minimum exists with respect to the regularization factor when the alpha gradient term has a small but non-zero contribution.

Based on those observations, we choose a regularization factor of and weight the alpha gradient term by for all experiments.

### Iv-G Runtime and Memory Usage

Setup | Method | Time [s] | Std. dev. [s] |
---|---|---|---|

HPC | Multi-Level (Ours) | 2.04 | 0.296 |

Closed-Form [levin2007closed] | 26.3 | 5.48 | |

IndexNet [lu2019indices] | 74.5 | 10.1 | |

KNN [chen2013knn] | 38.2 | 6.47 | |

Macbook | Multi-Level (Ours) | 1.48 | |

Closed-form [levin2007closed] | |||

IndexNet [lu2019indices] | – | – | |

KNN [chen2013knn] |

We measure the runtime on two different hardware setups. For one, we used a high-performance computer with Intel Xeon Gold 6134 CPU (3.20 GHz) and 196 GB memory. We also run the experiments on a MacBook Pro 2019 with Intel Core i5 (1.40 GHz) and 8 GB memory to to compare to a setup that is more realistic for everyday image processing. To ensure comparability between the different methods, we perform all computations on the CPU on a single thread. Table II compares the computational runtime of the different methods. Our method runs faster than the next best method by over an order of magnitude on both setups.

In addition, we compare the running time over different image sizes (Figure 11). We can observe that, while all three methods scale roughly linearly with the image size, ours has a significantly lower constant factor.

Method | Memory [MB] | Data Type |
---|---|---|

Multi-Level (Ours) | 1 182 | 64-bit float |

Closed-Form [levin2007closed] | 64-bit float | |

IndexNet [lu2019indices] | 32-bit float | |

KNN [chen2013knn] | 64-bit float |

Table III shows the memory usage for different methods. The IndexNet model requires most memory by far, even though its underlying data type is only half as large as that of the other methods. Therefore, this method can not be evaluated on high resolution images on the second setup. The closed-form and the KNN approach still require several gigabytes, but significantly less memory. Finally, our multi-level approach is even more frugal in memory usage, requiring less than a sixth of the memory compared to the next best method.

## V Conclusion

Our proposed multi-level approach clearly outperforms all existing approaches in terms of computational runtime and memory requirements on different hardware setups while being competitive with more computationally expensive methods with respect to the quality of the estimated foreground. Additionally, our method is robust to inaccurate alpha matte estimates. This is a useful property because, for many applications, ground truth alpha mattes are not available. In this case, our method often outperforms other methods with respect to various error measures.

We have shown that our approach scales excellently with the input image size, which allows estimating the foreground of megapixel images on consumer hardware in reasonable time.

Implementations of multi-level foreground estimation for both CPU and GPU are available in the open source PyMatting library

[germer2020pymatting].