Automatic Foreground Extraction using Multi-Agent Consensus Equilibrium

08/24/2018 ∙ by Xiran Wang, et al. ∙ HypeVR Purdue University 4

While foreground extraction is fundamental to virtual reality systems and has been studied for decades, majority of the professional softwares today still rely substantially on human interventions, e.g., providing trimaps or labeling key frames. This is not only time consuming, but is also sensitive to human error. In this paper, we present a fully automatic foreground extraction algorithm which does not require any trimap or scribble. Our solution is based on a newly developed concept called the Multi-Agent Consensus Equilibrium (MACE), a framework which allows us to integrate multiple sources of expertise to produce an overall superior result. The MACE framework consists of three agents: (1) A new dual layer closed-form matting agent to estimate the foreground mask using the color image and a background image; (2) A background probability estimator using color difference and object segmentation; (3) A total variation minimization agent to control the smoothness of the foreground masks. We show how these agents are constructed, and how their interactions lead to better performance. We evaluate the performance of the proposed algorithm by comparing to several state-of-the-art methods. On the real datasets we tested, our results show less error compared to the other methods.



There are no comments yet.


page 2

page 3

page 5

page 7

page 8

page 9

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation and Scope

The proliferation of virtual reality displays and rendering technology over the past decade is rapidly changing the landscape of cinematography [1, 2, 3, 4]. From the traditional motion pictures to the recent 3D animation, it is safe to argue that the next wave in the film-making industry is immersive experience, e.g., head-mounted virtual reality displays. In order to offer sufficient content to these displays, data has to be acquired in special ways, e.g., using 360-degree volumetric imagers [5]. Typically, such videos are high-definition , full frame rate at 60 fps, and are captured using as many as 100 cameras simultaneously. This is an enormous amount of data: A five-minute video sequence using the above configuration is already equal to more than one million images. Efficient image processing of these images is therefore critical.

Alpha Matting Ours Bkgnd Subtraction
Fig. 1: Differences between alpha matting, background subtraction, the subject of this paper. For each subfigure, [Top Left] represents the input raw color image, [Bottom Left] is the side information required, [Right] is the output. For alpha matting, we need a human-labeled trimap for each frame of the video. For background subtraction, we need a stack of adjacent frames. For our method, we need a static plate image.
Method Alpha Matting Ours Background Subtraction
Goal foreground foreground moving object
estimation estimation detection
Given input+trimap input+plate input sequence
Accuracy high high low (binary)
Background no static dynamic
Motion bias no no yes
TABLE I: Comparison of three different segmentation problems

The subject of this paper is the alpha matting / background subtraction problem arising from the virtual reality content creation process. The goal of alpha matting / background subtraction is to segment the foreground object from an image so that the virtual background can be substituted. However, compared to the classical alpha matting problem [6, 7, 8, 9, 10], and the classical background subtraction problem [11, 12, 13], our work lies in the middle ground. Classical alpha matting requires human to provide trimaps, hence it is not fully automatic. Background subtraction is fully automatic, but it produces significantly lower quality results than alpha matting. Our proposed solution achieves comparable performance with alpha matting while maintaining the level of autonomy as background subtraction. Figure 1 and Table I outline the key differences between this paper and the previous work in the literature. The major assumption we make is that a static background image, called the plate image, is available. This plate image can be easily obtained in a filming setting by using the first few frames before an object enters the scene.

(a) Input (b) Plate (c) Frame diff. (d) Trimap (e) DCNN[10] (f) Ours
Fig. 2: Three common issues of automatic foreground extraction. Case I: Vibrating background. Notice the small vibration of the leaves in the background. Case II. Similar foreground / background color. Notice the missing parts of the body of the man, and the excessive large uncertainty region of the trimap. Case III. Auto-exposure. Notice the false alarm in the background of the frame difference map. We compare our method with DCNN [10], a semi-supervised alpha matting method using the generated trimaps. The video data of Case III is from [14].

I-B Challenges of the Plate Images

Readers at this point may argue that the plate assumption is strong: If the plate is available, it seems that a standard frame difference with morphographic operations (e.g., erosion / dilation) would be enough to provide a trimap, and thus a sufficiently powerful alpha matting algorithm would work. However, except for synthetic videos, plate images are never perfect. Below are three typical problems:

  • [leftmargin=*]

  • Background vibration. While we assume that the plate does not contain large moving objects, small vibration of the background generally exists. Figure 2 Case I shows an example where the background tree vibrates.

  • Color similarity. When foreground color is very similar to the background color, the trimap generated will have false alarms and misses. Figure 2 Case II shows an example where the cloth of the man has a similar color to the wall.

  • Auto-exposure. If auto-exposure is used, the background intensity will change over time. Figure 2 Case III shows an example where the background cabinet becomes dimmer when the man leaves the room.

As shown in the examples, error in frame difference can be easily translated to false alarms and misses in the trimap. While we can increase the uncertainty region of the trimap to rely more on the color constancy model of the alpha matting, in general the alpha matting performs worse when the uncertainty region grows. We have also tested more advanced background estimation algorithms, e.g., [15] in OpenCV. However, the results are similar or sometimes even worse.

(a) Input (b) Frame difference (c) Trimap (d) Closed-form (e) Spectral matting
matting [16] [6]
(f) Learning-based (g) K-nearest (h) Comprehensive (i) DCNN [10] (j) Ours
matting [7] neighbors[8] sampling [9] (without trimap)
Fig. 3: Comparison with existing alpha-matting algorithms on real images with a frame-difference based trimap. (a) Input image. (b) Frame difference. (c) Trimap generated by morphographic operation (dilation / erosion) of the binary mask. (d) - (i) Alpha matting algorithms available on (j) Proposed method. This sequence is from the dataset of [14].

I-C Related Work

We briefly describe the available methods in the literature.

  • [leftmargin=*]

  • Alpha Matting. The classical alpha matting formulates the problem as minimizing an energy function associated with the foreground and background color, for example, Poisson matting [17], closed-form matting [16], shared matting [18], Bayesian matting [19], and robust matting [20]

    . More recently, a number of deep neural network based approaches are proposed, e.g.,

    [21] and[10]. One big drawback of the classical alpha matting algorithms is that they require error-free trimaps. Figure 3 illustrates the performance of various alpha matting algorithms applied to a multi-object-moving scene. In this example, we use frame difference to construct a trimap and then apply various alpha matting algorithms to estimate the alpha matte. Although the performance is reasonable, we observe that small error in the trimap causes major issues in the resulting alpha matte.

  • Video Matting. Video matting algorithms extend alpha matting by considering the temporal dimension [22, 23, 24]. The temporal consistency is handled by introducing constraints or regularization functions. Trimaps are still needed for these algorithms, which could become intractable when the video is long. Alternative methods such as [25, 26, 27] identify key frames of the video and propagate the user labeled alpha mattes to adjacent frames. However, the propagation is often error-prone especially in the case of occlusion.

  • Trimap Generation. In the absence of trimaps, there are multi-stage methods to first create the trimap and then perform alpha matting [28]. However, these methods still require initial segmentations using, e.g., GrabCut [29]. Other methods [30, 31] require additional sensor data, e.g., depth measurements, which are not always available.

  • Background Subtraction. Background subtraction methods range from the simple frame difference method to the more sophisticated mixture models [11]. These methods also include classical video segmentation methods, e.g., using saliency [12, 13]. Most background subtraction methods are used to track objects instead of extracting the alpha mattes. They are fully-automated and are real time, but the foreground masks generated are usually of low quality.

I-D Contributions

This paper contributes to the alpha matting and background subtraction literature in two ways.

  • [leftmargin=*]

  • Automatic Alpha Matting. We provide a fully automatic solution to the alpha matting problem assuming that a plate image is available. To the best of our knowledge, this is the first of the kind in the literature.

  • Multi-Agent Consensus Equilibrium. Our solution leverages a newly developed framework called the Multi-agent consensus equilibrium (MACE) [32] which is a generalization of the Plug-and-Play ADMM [33, 34, 35, 36, 32].We present three customized agents for the alpha matting problem. We demonstrate, for the first time, how MACE can be used to tackle non-traditional image restoration problems.

In order to explain all the essential concepts, we organize the paper in a way that the general framework and the specific components are addressed in two different sections. In Section 2 we describe the MACE framework. We will discuss its derivation, the intuition, and the algorithm. Then in Section 3 we go deep into the details of each component of the MACE framework that are specific to foreground extraction. Experimental results are presented in Section 4.

Ii Multi-Agent Consensus Equilibrium

Our proposed method is based on the Multi-Agent Consensus Equilibrium (MACE), recently developed by Buzzard et al. [32]. In this section, we describe the key components of MACE and briefly discuss why it works.

Ii-a Admm

The starting point of MACE is the alternating direction method of multipliers (ADMM) algorithm [37]. The ADMM algorithm aims at solving a constrained minimization:


where , and are mappings, typically a forward model describing the image formation process and a prior distribution of the latent image. ADMM solves the problem by solving a sequence of subproblems as follows:


In the last equation (2c

), the vector

is the Lagrange multiplier associated with the constraint. Under mild conditions, e.g., when and are convex, close, and proper, global convergence of the algorithm can be proved [33]. Recent studies show that ADMM converges even for some non-convex functions [34].

When and are convex, the minimizations in (2a) and (2b) are known as the proximal maps of and , respectively [38]. If we define the proximal maps as


then it is not difficult to see that at the optimal point, (2a) and (2b) become


where are the solutions to the original constrained optimization in (1). (4a) and (4b) shows that the solution can now be considered as a fixed point of the system of equations.

Rewriting (2a)-(2c) in terms of (4a) and (4b) allows us to consider agents that are not necessarily proximal maps, i.e., is not convex or may not be expressible as optimizations. One example is to use an off-the-shelf image denoiser for , e.g., BM3D, non-local means, or neural network denoisers. Such algorithm is known as the Plug-and-Play ADMM [33, 34, 35] (and variations thereafter [36, 32]).

Ii-B MACE and Intuition

MACE generalizes the above ADMM formulation. Instead of minimizing a sum of two functions, MACE minimizes a sum of functions :


In this case, the equations in (4a)-(4b) are generalized to


What does (6) buy us? Intuitively, (6) suggests that in a system containing agents, each agent will create a tension . For example, if is an inversion step whereas is a denoising step, then will not agree with because tends to recover details but tends to smooth out details. The agents will reach an equilibrium state where the sum of the tension is zero. This explains the name “consensus equilibrium”, as the the algorithm is seeking a consensus among all the agents.

How does the equilibrium solution look like? The following theorem, shown in [32], provides a way to connect the equilibrium condition to a fixed point of an iterative algorithm.

Theorem 1 (MACE solution [32]).

Let . The consensus equilibrium is a solution to the MACE equation (6) if and only if the points satisfy


where , and are mappings defined as


where is the average of .

1:Initialize .
2:for  do
3:     % Perform agent updates,
6:     % Perform the data aggregation
8:end for
9:Output .
Algorithm 1 MACE Algorithm

Theorem 1 provides a full characterization of the MACE solution. The operator in Theorem 1 is a consensus agent that takes a set of inputs and maps them to their average . In fact, we can show that is a projection and that is its self-inverse[32]. As a result, (8) is equivalent to . That is, we want the individual agents to match with the consensus agent such that the equilibrium holds: .

The algorithm of the MACE is illustrated in Algorithm 1. According to (8), is a fixed point of the set of equilibrium equations. Finding the fixed point can be done by iteratively updating through the procedure


Therefore, the algorithmic steps are no more complicated than updating the individual agents in parallel, and then aggregating the results through .

The convergence of MACE is guaranteed when is non-expansive [32]summarized in the proposition below.

Proposition 1.

Let and be defined as (9), and let . Then the following results hold:

  1. is firmly non-expansive if all ’s are firmly non-expansive.

  2. is firmly non-expansive.

  3. is non-expansive if and are firmly non-expansive.


See Appendix. ∎

Iii Designing MACE Agents

After describing the MACE framework, in this section we discuss how each agent is designed for our problem.

Iii-a Agent 1: Dual-Layer Closed-Form Matting

The first agent we use in MACE is a modified version of the classic closed-form matting. More precisely, we define the agent as


where and are matrices, and will be explained below. The constant is a parameter.

Review of Closed-Form Matting. To understand the meaning of (13), we recall that the classical closed-form matting is an algorithm that tries to solve


Here, are the linear combination coefficients of the color line model , and is the alpha matte value of the th pixel[16]. The weight is a window of pixel . With some algebra, we can show that the marginalized energy function is equivalent to


where is the so-called matting Laplacian matrix. When trimap is given, we can regularize by minimizing the overall energy function:


where is a binary diagonal matrix with entries being one for pixels that are labeled in the trimap, and zero otherwise. The vector contains specified alpha values given by the trimap. Thus, for large , the minimization in (16) will force the solution to satisfy the constraints given by the trimap.

Dual-Layer Matting Laplacian . In the presence of the plate image, we have two pieces of complementary information: the color image containing the foreground object, and the plate image. Correspondingly, we have alpha matte for , and the alpha matte for . When is given, we can redefine the color line model as


In other words, we ask the coefficients to fit simultaneously the actual image and the plate image . When (17) is assumed, the energy function becomes


where we added a constant to regulate the relative emphasis between and .

Theorem 2.

The marginal energy function


can be equivalently expressed as , where is the modified matting Laplacian, with the th element


Here, is the Kronecker delta, is the color vector at the th pixel. The vector is defined as


and the matrix is


See Appendix. ∎

Because of the plate term in (18), the modified matting Laplacian is positive definite. See Appendix for proof. The original in (15) is only positive semi-definite.

Dual-Layer Regularization . The diagonal regularization matrix in (13) is reminiscent to the binary matrix in (16), but

is defined through a sigmoid function applied to the input

. To be more precise, we define , where


and is the -th element of the vector , which is the argument of . The scalar constant is a user defined parameter specifying the stiffness of the sigmoid function, and is another user defined parameter specifying the center of the transient. Typical values of for our MACE framework are and .

A closer inspection of and reveals that is performing a hard-threshold whereas is performing a soft-threshold. In fact, the matrix has diagonal entries


for two cutoff values and . This hard-threshold is equivalent to the soft-threshold in (23) when .

There are a few reasons why (23) is preferred over (24), especially when we have the plate image. First, the soft-threshold in (23) tolerates more error present in , because the values of represent the probability of having foreground pixels. Second, the one-sided threshold in (23) ensures that the background portion of the image is handled by the plate image rather than the input . This is usually beneficial when the plate is reasonably accurate.

Iii-B Agent 2: Background Estimator

Our second agent is a background estimator, defined as


The reason of introducing is that in , the matrix is determined by the current estimate . While handles part of the error in , large missing pixels and false alarms can still cause problems especially in the interior regions. The goal of is to complement for these interior regions.

Initial Background Estimate . Let us take a look at (25). The first two terms are quadratic. The interpretation is that given some fixed initial estimate and the current input , returns a linearly combined estimate between and . The initial estimate consists of two parts:


where means elementwise multiplication. The first term is the color term, measuring the similarity between foreground and background colors. The second term is the edge term, measuring the likelihood of foreground edges relative background edges. In the followings we will discuss these two terms one by one.

Defining the Color Term . We define by measuring the distance between a color pixel and a plate pixel . Ideally, we would like to be small when is large.

In order to improve the robustness of against noise and illumination fluctuation, we modify by using the bilateral weighted average over a small neighborhood:


where specifies a small window around the pixel . The bilateral weight is defined as




Here, denotes the spatial coordinate of pixel , denotes the th color pixel of the color image , and are the parameters controlling the bilateral weight strength. The typical values of hs and hr are both 5.

We now need a mapping which maps the distance to a vector of numbers in so that the term makes sense. To this end, we choose a simple Gaussian function:



is a user tunable parameter. We tested other possible mappings such as the sigmoid function and the cumulative distribution function of a Gaussian. However, we do not see significant difference compared to (

30). The typical value for is 10.

Fig. 4: Illustration of how to construct the estimate . We compute the distance between the foreground and the background. The distance has a bilateral weight to improve robustness. The actual represents the probability of having a foreground pixel.

Defining the Edge Term . The color term is able to capture most of the difference between the image and the plate. However, it also generates false alarms if there is illumination change. For example, shadow due to the foreground object is often falsely labeled as foreground. See the shadow near the foot in Figure 4.

In order to reduce the false alarm due to minor illumination change, we first create a “super-pixel” mask by grouping similar colors. Our super-pixels are generated by applying a standard flood-fill algorithm [39] to the image . This gives us a partition of the image as


where are the super-pixel index sets. The plate image is partition using the same super-pixel indices, i.e., .

While we are generating the super-pixels, we also compute the gradients of and for every pixel . Specifically, we define and , where (and ) are the two-tap horizontal (and vertical) finite difference at the -th pixel. To measure how far is from , we compute


Thus, is small for background regions because , but is large when there is a foreground pixel in . If we set a threshold operation after , i.e., set if for some threshold , then shadows can be removed as their gradients are weak.

Now that we have computed , we still need to map it back to a quantity similar to the alpha matte. To this end, we compute a normalization term


and normalize by


where denotes the indicator function, and and are thresholds. In essence, (34) says in the -th super-pixel , we count the number of edges that have strong difference between and . However, we do not want to count every pixel but only pixels that already contains strong edges, either in or . Thus, we take the weighted average using as the weight. This defines , as the weighted average is shared among all pixels in the super-pixel . Figure 5 shows a pictorial illustration.

Fig. 5: Illustration of how to construct the estimate .
(a) (b) (c)
Fig. 6: Comparison between , , and .

Why is helpful? If we look at and in Figure 6, we see that the foreground pixels of and coincide but background pixels roughly cancel each other. The reason is that while creates weak holes in the foreground, fills the gap by ensuring the foreground is marked.

Regularization . The last term in (25) is a regularization to force the solution to either 0 or 1. The effect of this term can be seen from the fact that is a symmetric concave quadratic function with a value zero for or . Therefore, it introduces penalty for solutions that are away from 0 or 1. For , one can show that the Hessian matrix of the function is positive semidefinite. Thus, is strongly convex with parameter .

Iii-C Agent 3: Total Variation Denoising

The third agent we use in this paper is the total variation denoising:


where is a parameter. The norm is defined in space-time:


where controls the relative strength of the gradient in each direction. In this paper, for spatial total variation we set , and for spatial-temporal total variation we set .

A denoising agent is used in the MACE framework because we want to ensure smoothness of the resulting matte. The choice of the total variation denoising operation is a balance betweeen complexity and performance. Users can use stronger denoisers such as BM3D. However, these patch based image denoising algorithms rely on the patch matching procedure, and so they tend to under-smooth repeated patterns of false alarm / misses. Neural network denoisers are better candidates but they need to be trained with the specifically distorted alpha mattes. From our experience, we do not see any particular advantage of using CNN-based denoisers. Figure 7 shows some comparison.

(a) Input (b) TV (c) BM3D (d) IRCNN
[40] [41] [42]
Fig. 7: Comparison of different denoisers used in MACE. Shown are the results when MACE converges. The shadow near the foot is a typical place of false alarm, and many denoisers cannot handle.

Iii-D Parameters and Runtime

The typical values for parameters of the proposed method are presented in Table II. and are rarely changed, while determines the denoising strength of Agent 3. has a default value of 0.05. Inceasing causes more binary results with clearer boundaries. and determine the edge term in Agent 2 and are fixed. determines the color term in Agent 2. Large produces less false negative but more false positive. Overall, the performance is reasonably stable to these parameters.

Value 0.01 2 4 0.05 0.01 0.02 10
TABLE II: Typical values for parameters

In terms of runtime, the most time-consuming part is Agent 1 because we need to solve a large-scale sparse least squares problem. Its runtime is determined by the number of foreground pixels. Table III shows the runtime of the sequences we tested. In generating these results, we used an un-optimized MATLAB code on a Intel i7-4770k. The typical runtime is about 1-3 minutes per frame, which can further be improved by porting the algorithms to C++ or GPUs. From our experience working with professional artists, even with professional film production software, e.g., NUKE, it takes 15 minutes to label a ground truth label using the plate and temporal cues. Therefore, the runtime benefit offered by our algorithm is substantial.

time/Fr indoor/ lighting Backgrd green ground
resolution FGD % (sec) outdoor shadow issues vibration camouflage screen truth
Book 540x960 19.75% 231 outdoor
Building 632x1012 4.03% 170.8 outdoor
Coach 790x1264 4.68% 396.1 outdoor
Purdue Studio 480x270 55.10% 58.3 indoor
Dataset Road 675x1175 1.03% 232.9 outdoor
Tackle 501x1676 4.80% 210.1 outdoor
Gravel 790x1536 2.53% 280.1 outdoor
Office 623x1229 3.47% 185.3 indoor
Bootstrap 480x640 13.28% 109.1 indoor
Cespatx 480x640 10.31% 106.4 indoor
Public DCam 480x640 12.23% 123.6 indoor
Dataset Gen 480x640 10.23% 100.4 indoor
[14] Multipeople 480x640 9.04% 99.5 indoor
Shadow 480x640 11.97% 115.2 indoor
TABLE III: Description of the video sequences used in our experiments.

Iv Experimental Results

Iv-a Dataset

To evaluate the proposed method, we create a Purdue dataset containing 8 video sequences using the HypeVR Inc. 360 degree camera as shown in Figure 8. The original image resolution is at a frame rate of 48fps, and these images are then downsampled and cropped to speed up the matting process. In addition to these videos, we also include 6 videos sequences from a public dataset[14], making a total of 14 video sequences. Snapshots of the sequences are shown in Figure 9. All video sequences are captured without camera motion. Plate images are available, either during the first or the last few frames of the video. To enable objective evaluation, for each video sequence we randomly select 10 frames and manually generate the ground truths. Thus totally there are 140 frames with ground truths.

Fig. 8: [Left] The camera system we used for this paper. [Right] Display and headset to view the processed content.
(a) Snapshots of the Purdue Dataset.
(b) Snapshots of a public dataset [14]
Fig. 9: Snapshots of the videos we use in the experiment. Top row: Building, Coach, Studio, Road, Tackle, Gravel, Office, Book. Bottom row: Bootstrap, Cespatx, Dcam, Gen, MP, Shadow.

The characteristics of the dataset is summarized in Table III. The Purdue dataset has various resolution, and the Public dataset has one resolution . The foreground percentage for the Purdue dataset videos ranges from to , whereas that public dataset has similar foreground percentage around . The runtime of the algorithm (per frame) is determined by the resolution and the foreground percentage. In terms of content, the Purdue dataset focuses on outdoor scenes whereas the public dataset are only indoor. The shadow column indicates the presence of shadow. Lighting issues include illumination change due to auto-exposure and auto-white-balance. The background vibration only applies to outdoor scenes where the background objects have minor movements, e.g., moving grass or tree branches. The camouflage column indicates the similarity in color between the foreground and background, which is a common problem for most sequences. The green screen column shows which of the sequences have green screens to mimic the common chroma-keying environment.

Iv-B Competing methods

We categorize the competing methods into three different categories. The key ideas of the competing methods are summarized in Table IV.

Methods Automated? Key idea
BSVS [25] semi-auto bilateral space
video segmentation
Matting [43] trimap Trimap generation
+ alpha matting
Grabcut [29] full-auto iterative graph cuts
video segmentation
ViBe [44] full-auto pixel model based
background subtraction
NLVS [45] full-auto non-local voting
video segmentation
Pbas [15] full-auto non-parametric
background subtraction
TABLE IV: Description of the competing methods.
Fig. 10: Comparison with competing methods: BSVS[25],Trimap + DCNN [10],Grabcut[29],ViBe [44], NLVS [45] and Pbas [15].
  • Full-automatic methods: We consider two background subtraction algorithms Pixel-based adaptive segmenter (Pbas) [15], Visual background extractor (ViBe) [44], one video segmentation method Non-local consensus voting (NLVS) [45] and a modified version of Grabcut[29]. Pbas[15] and ViBe[44] are real-time. However, the generated masks have low quality. NLVS[45] uses optical flow as the main cue and exploits the similarities among super-pixels. Thus it suffers from errors caused by the super-pixel step. The modified version of Grabcut[29] is made to use the plate image and no longer requires user input. It performs reasonably well when background and foreground have different colors, but suffers when the foreground and background colors are similar.

  • Semi-automatic method: We consider the Bilateral space video segmentation (BSVS) [25], a semi-supervised video segmentation algorithm using the bilateral space. BSVS is semi-supervised as it requires the user to provide ground truth labels for key frames.

  • Alpha matting using trimap: We consider one of the state-of-the-art alpha matting algorithm using CNN [43]. The trimaps are generated by applying frame difference between the plate and color images, followed by morphological and thresholding operations.

We emphasize that while these competing methods are possibly the closest methods to our proposed method, they are considerably different. For example, BSVS[25] does not require a plate image, but it requires key frames. NLVS[45] also does not require a plate image, but it cannot handle videos with relatively stable objects as it relies on optical flow.

Iv-C Results

We use intersection over union(IoU) as an evaluation metric. The IoU contains twos parts, intersection and union:

and . In this equation, is the -pixel of the estimated alpha matte, and is that of the ground truth.

Comparison with full-automatic methods: The results are shown in Figure 10, where we plot the overall IoU for different methods. In this chart, it is clear that the proposed method performs the best. Among the video sequences, the competing methods NLVS and Pbas perform most badly for Book, Bootstrap, Coach and studio. In these sequences, the foreground object is stationary or it only has rotational movements so that it is treated as background.

Comparison with semi-automatic method: We compare our method with BSVS [25]. Since BSVS requires ground truth key frames, in this experiment we use 10 key frames for each video sequence. When testing, we feed the video sequence and the key frames to the algorithm. The output is a processed sequence, and the key frames are modified because the algorithm propagates the error.

The results of this experiment show that the proposed algorithm is on average better than BSVS even though BSVS has ground truth key frames. For the Office sequence, BSVS scores a IoU of 0.77 while the proposed method scores 0.95. BSVS performs badly because the background is complex and sometimes it is similar to the foreground color. Even with key frame ground truth, BSVS is still confused. Similar observation can be found for the case of Gravel. BSVS performs better in the Shadow sequence scoring 0.96 in comparison with the proposed method scoring 0.90. This is because the key frame ground truth helps preventing mislabeling the shadows as foreground. On average, the proposed method scores 0.93 which is higher than 0.86 scored by BSVS.

Comparison with trimap + alpha-matting methods: In this experiment we compare with several state-of-the-art alpha matting algorithms. The visual comparison is shown in Figure 3, and the IoU value of DCNN [10] is shown in Figure 10. In order to make this method work, careful tuning during the trimap generation stage is required.

Figure 3 and Figure 10 show that most alpha matting algorithms suffer from false alarms near the boundary, e.g., spectral matting [6], closed-form mating [16], learning-based matting [7] and comprehensive matting [9]. The more recent methods such as K-nearest neighbors matting [8] and DCNN [10] have equal amount of false alarm and miss. Yet, the overall performance is still worse than the proposed method.

Fig. 11: Office sequence results. (a) Input. (b) Ground truth. (c) Ours. (d) BSVS [25]. (e)Trimap + DCNN [10]. (f) Gcut [29].(g) ViBe [44]. (h) NLVS [45].(i) Pbas [15].
Fig. 12: Ablation study of the algorithm. We show the IOU scores by eliminating one of the agents, and replacing the denoising agent with other denoisers.

Iv-D Ablation test and MACE with various denoisers

Since the proposed framework contains three different agents, we conduct an ablation study to verify the relative importance of the individual agents. To do so, we remove one of the three agents while keeping the other two fixed. The result is shown in Figure 12. Based on the result, the matting agent, not surprisingly, has the most impact on the performance, followed by background estimator and denoiser. The drop in performance is most significant for hard sequences such as Book as it contains moving background, and Road as it contains strong color similarity between foreground and background. The proposed method scores 0.98 for Book while in the absent of Agent 1 and 3 it only scores 0.38 and 0.48 respectively. The performance drop is even more for Road with the proposed method scoring 0.91, 0.06 without Agent 1 and 0.11 without Agent 3.

In this ablation study, we also observe spikes of error for some scenes when background estimator is absent in MACE. This is because, without the term from the background estimator, the result will look grayish instead of close-to-binary. This behavior leads to the error spikes. For example, the performance drops from 0.94 for the proposed method to 0.72 when Agent 2 becomes absent. For green screen scenes like Tackle and Coach, since the initial estimate is already close to the ground truth the performance does not suffer when the matting or denoiser agents are absent.

For Agent 3 the denoiser, we observe that the total variation denoiser leads to the best performance for MACE. In a visual comparison shown in Figure 7, we observe that IRCNN[42] produces more detailed boundaries but fails to remove false alarms near the feet. BM3D[41] removes false alarms better but produces less detailed boundaries. TV on the other hand produces a more balanced result. As shown in Figure 12, BM3D performs similarly as IRCNN scoring 0.85 on average, while TV performs the best with an average score of 0.93. BM3D anbd IRCNN score the worst for the Road sequence, as they fail to remove false alarms caused by moving tree branches in the background. This is because the false alarms in the background form a repetitive pattern which could confuse BM3D who denoises by block matching.

V Limitations and Discussion

While the proposed method demonstrates superior performance than the state-of-the-art methods, it also has several limitations.

  • [leftmargin=*]

  • Quality of Plate Image. The plate assumption may not hold when the background is moving substantially. When this happens, a more complex background model that includes dynamic information is needed. However, if the background is non-stationary, additional designs are needed to handle the local error and temporal consistency.

  • Loss of Fine Details. In our proposed method, fine details such as hairs are compromised for robustness. Figure 13 illustrates an example. In some videos, the color difference between foreground and background is similar. This creates holes in the initial estimate , can be filled by a strong denoiser such as total variation. However, total variation is known to oversmooth fine details. To mitigate this issue, an additional post-processing step using alpha matting could bring back the details around the boundary.

  • Strong Shadows. Strong shadows are sometimes treated as foreground, as shown in Figure 14. This is caused by the lack of shadow modeling in the problem formulation. The edge based initial estimate can resolve the shadow issue to some extent, but not when the shadow is very strong. We tested a few off-the-shelf shadow removal algorithms [46, 47, 48], but generally they do not help because the shadow in our dataset can cast on the foreground object which should not be removed.

(a) (b) (c) (d)
Fig. 13: Limitation 1: Loss of fine details. (a) Color input. (b) Our result. (c) Improving our result by generating a trimap from (b). (d) post-processed result by alpha matting using (b).
Fig. 14: Limitation 2: Strong shadows. When shadows are strong, they are easily misclassified as foreground.

Vi Conclusion

We presented a new foreground extraction algorithm based on the multi-agent consensus equilibrium (MACE) framework. MACE is an information fusion framework which integrates multiple weak experts to produce a strong estimator. Equipped with three customized agents: a dual-layer closed form matting agent, a background estimation agent and a total variation denoising agent, MACE offers substantially better foreground masks than state-of-the-art algorithms. MACE is a fully automatic algorithm, meaning that human interventions are not required. This provides significant advantage over semi-supervised methods which require trimaps or scribbles. In the current form, MACE is able to handle minor variations in the background plate image, illumination changes and weak shadows. Extreme cases can still cause MACE to fail, e.g., background movement or strong shadows. However, these could potentially be overcome by improving the background and shadow models.

Vii Acknowledgement

The authors thank the anonymous reviewers for valuable feedback. This work is supported, in part, by the National Science Foundation under grants CCF-1763896 and CCF1718007.

Viii Appendix

Viii-a Proof of Theorem 2


We start by writing (18) in the matrix form


and denotes the index of the -th pixel in the neighborhood . The difference with the classic closed-form matting [16] is the new terms , and (i.e., the second row of the quadratic function above.)



and use the fact that , we can find out the solution of the least-squares optimization:


We now need to simplify the term . First, observe that

where we define the terms , and . Then, by applying the block inverse identity, we have


where we further define and .

Substituting (38) back to , and using (39), we have



The -th element of is therefore


Adding terms in each , we finally obtain

Viii-B Proof: is positive definite


Recall the definition of :

Based on Theorem 2 we have,