I Introduction
Ia Motivation and Scope
The proliferation of virtual reality displays and rendering technology over the past decade is rapidly changing the landscape of cinematography [1, 2, 3, 4]. From the traditional motion pictures to the recent 3D animation, it is safe to argue that the next wave in the filmmaking industry is immersive experience, e.g., headmounted virtual reality displays. In order to offer sufficient content to these displays, data has to be acquired in special ways, e.g., using 360degree volumetric imagers [5]. Typically, such videos are highdefinition , full frame rate at 60 fps, and are captured using as many as 100 cameras simultaneously. This is an enormous amount of data: A fiveminute video sequence using the above configuration is already equal to more than one million images. Efficient image processing of these images is therefore critical.
Alpha Matting  Ours  Bkgnd Subtraction 
Method  Alpha Matting  Ours  Background Subtraction 
Goal  foreground  foreground  moving object 
estimation  estimation  detection  
Given  input+trimap  input+plate  input sequence 
Accuracy  high  high  low (binary) 
Background  no  static  dynamic 
Motion bias  no  no  yes 
The subject of this paper is the alpha matting / background subtraction problem arising from the virtual reality content creation process. The goal of alpha matting / background subtraction is to segment the foreground object from an image so that the virtual background can be substituted. However, compared to the classical alpha matting problem [6, 7, 8, 9, 10], and the classical background subtraction problem [11, 12, 13], our work lies in the middle ground. Classical alpha matting requires human to provide trimaps, hence it is not fully automatic. Background subtraction is fully automatic, but it produces significantly lower quality results than alpha matting. Our proposed solution achieves comparable performance with alpha matting while maintaining the level of autonomy as background subtraction. Figure 1 and Table I outline the key differences between this paper and the previous work in the literature. The major assumption we make is that a static background image, called the plate image, is available. This plate image can be easily obtained in a filming setting by using the first few frames before an object enters the scene.
I  
II  
III  
(a) Input  (b) Plate  (c) Frame diff.  (d) Trimap  (e) DCNN[10]  (f) Ours 
IB Challenges of the Plate Images
Readers at this point may argue that the plate assumption is strong: If the plate is available, it seems that a standard frame difference with morphographic operations (e.g., erosion / dilation) would be enough to provide a trimap, and thus a sufficiently powerful alpha matting algorithm would work. However, except for synthetic videos, plate images are never perfect. Below are three typical problems:

[leftmargin=*]

Background vibration. While we assume that the plate does not contain large moving objects, small vibration of the background generally exists. Figure 2 Case I shows an example where the background tree vibrates.

Color similarity. When foreground color is very similar to the background color, the trimap generated will have false alarms and misses. Figure 2 Case II shows an example where the cloth of the man has a similar color to the wall.

Autoexposure. If autoexposure is used, the background intensity will change over time. Figure 2 Case III shows an example where the background cabinet becomes dimmer when the man leaves the room.
As shown in the examples, error in frame difference can be easily translated to false alarms and misses in the trimap. While we can increase the uncertainty region of the trimap to rely more on the color constancy model of the alpha matting, in general the alpha matting performs worse when the uncertainty region grows. We have also tested more advanced background estimation algorithms, e.g., [15] in OpenCV. However, the results are similar or sometimes even worse.
(a) Input  (b) Frame difference  (c) Trimap  (d) Closedform  (e) Spectral matting 
matting [16]  [6]  
(f) Learningbased  (g) Knearest  (h) Comprehensive  (i) DCNN [10]  (j) Ours 
matting [7]  neighbors[8]  sampling [9]  (without trimap) 
IC Related Work
We briefly describe the available methods in the literature.

[leftmargin=*]

Alpha Matting. The classical alpha matting formulates the problem as minimizing an energy function associated with the foreground and background color, for example, Poisson matting [17], closedform matting [16], shared matting [18], Bayesian matting [19], and robust matting [20]
. More recently, a number of deep neural network based approaches are proposed, e.g.,
[21] and[10]. One big drawback of the classical alpha matting algorithms is that they require errorfree trimaps. Figure 3 illustrates the performance of various alpha matting algorithms applied to a multiobjectmoving scene. In this example, we use frame difference to construct a trimap and then apply various alpha matting algorithms to estimate the alpha matte. Although the performance is reasonable, we observe that small error in the trimap causes major issues in the resulting alpha matte. 
Video Matting. Video matting algorithms extend alpha matting by considering the temporal dimension [22, 23, 24]. The temporal consistency is handled by introducing constraints or regularization functions. Trimaps are still needed for these algorithms, which could become intractable when the video is long. Alternative methods such as [25, 26, 27] identify key frames of the video and propagate the user labeled alpha mattes to adjacent frames. However, the propagation is often errorprone especially in the case of occlusion.

Trimap Generation. In the absence of trimaps, there are multistage methods to first create the trimap and then perform alpha matting [28]. However, these methods still require initial segmentations using, e.g., GrabCut [29]. Other methods [30, 31] require additional sensor data, e.g., depth measurements, which are not always available.

Background Subtraction. Background subtraction methods range from the simple frame difference method to the more sophisticated mixture models [11]. These methods also include classical video segmentation methods, e.g., using saliency [12, 13]. Most background subtraction methods are used to track objects instead of extracting the alpha mattes. They are fullyautomated and are real time, but the foreground masks generated are usually of low quality.
ID Contributions
This paper contributes to the alpha matting and background subtraction literature in two ways.

[leftmargin=*]

Automatic Alpha Matting. We provide a fully automatic solution to the alpha matting problem assuming that a plate image is available. To the best of our knowledge, this is the first of the kind in the literature.

MultiAgent Consensus Equilibrium. Our solution leverages a newly developed framework called the Multiagent consensus equilibrium (MACE) [32] which is a generalization of the PlugandPlay ADMM [33, 34, 35, 36, 32].We present three customized agents for the alpha matting problem. We demonstrate, for the first time, how MACE can be used to tackle nontraditional image restoration problems.
In order to explain all the essential concepts, we organize the paper in a way that the general framework and the specific components are addressed in two different sections. In Section 2 we describe the MACE framework. We will discuss its derivation, the intuition, and the algorithm. Then in Section 3 we go deep into the details of each component of the MACE framework that are specific to foreground extraction. Experimental results are presented in Section 4.
Ii MultiAgent Consensus Equilibrium
Our proposed method is based on the MultiAgent Consensus Equilibrium (MACE), recently developed by Buzzard et al. [32]. In this section, we describe the key components of MACE and briefly discuss why it works.
Iia Admm
The starting point of MACE is the alternating direction method of multipliers (ADMM) algorithm [37]. The ADMM algorithm aims at solving a constrained minimization:
(1) 
where , and are mappings, typically a forward model describing the image formation process and a prior distribution of the latent image. ADMM solves the problem by solving a sequence of subproblems as follows:
(2a)  
(2b)  
(2c) 
In the last equation (2c
), the vector
is the Lagrange multiplier associated with the constraint. Under mild conditions, e.g., when and are convex, close, and proper, global convergence of the algorithm can be proved [33]. Recent studies show that ADMM converges even for some nonconvex functions [34].When and are convex, the minimizations in (2a) and (2b) are known as the proximal maps of and , respectively [38]. If we define the proximal maps as
(3) 
then it is not difficult to see that at the optimal point, (2a) and (2b) become
(4a)  
(4b) 
where are the solutions to the original constrained optimization in (1). (4a) and (4b) shows that the solution can now be considered as a fixed point of the system of equations.
Rewriting (2a)(2c) in terms of (4a) and (4b) allows us to consider agents that are not necessarily proximal maps, i.e., is not convex or may not be expressible as optimizations. One example is to use an offtheshelf image denoiser for , e.g., BM3D, nonlocal means, or neural network denoisers. Such algorithm is known as the PlugandPlay ADMM [33, 34, 35] (and variations thereafter [36, 32]).
IiB MACE and Intuition
MACE generalizes the above ADMM formulation. Instead of minimizing a sum of two functions, MACE minimizes a sum of functions :
(5) 
In this case, the equations in (4a)(4b) are generalized to
(6) 
What does (6) buy us? Intuitively, (6) suggests that in a system containing agents, each agent will create a tension . For example, if is an inversion step whereas is a denoising step, then will not agree with because tends to recover details but tends to smooth out details. The agents will reach an equilibrium state where the sum of the tension is zero. This explains the name “consensus equilibrium”, as the the algorithm is seeking a consensus among all the agents.
How does the equilibrium solution look like? The following theorem, shown in [32], provides a way to connect the equilibrium condition to a fixed point of an iterative algorithm.
Theorem 1 (MACE solution [32]).
Let . The consensus equilibrium is a solution to the MACE equation (6) if and only if the points satisfy
(7)  
(8) 
where , and are mappings defined as
(9) 
where is the average of .
(10) 
(11) 
Theorem 1 provides a full characterization of the MACE solution. The operator in Theorem 1 is a consensus agent that takes a set of inputs and maps them to their average . In fact, we can show that is a projection and that is its selfinverse[32]. As a result, (8) is equivalent to . That is, we want the individual agents to match with the consensus agent such that the equilibrium holds: .
The algorithm of the MACE is illustrated in Algorithm 1. According to (8), is a fixed point of the set of equilibrium equations. Finding the fixed point can be done by iteratively updating through the procedure
(12) 
Therefore, the algorithmic steps are no more complicated than updating the individual agents in parallel, and then aggregating the results through .
The convergence of MACE is guaranteed when is nonexpansive [32]summarized in the proposition below.
Proposition 1.
Let and be defined as (9), and let . Then the following results hold:

is firmly nonexpansive if all ’s are firmly nonexpansive.

is firmly nonexpansive.

is nonexpansive if and are firmly nonexpansive.
Proof.
See Appendix. ∎
Iii Designing MACE Agents
After describing the MACE framework, in this section we discuss how each agent is designed for our problem.
Iiia Agent 1: DualLayer ClosedForm Matting
The first agent we use in MACE is a modified version of the classic closedform matting. More precisely, we define the agent as
(13) 
where and are matrices, and will be explained below. The constant is a parameter.
Review of ClosedForm Matting. To understand the meaning of (13), we recall that the classical closedform matting is an algorithm that tries to solve
(14) 
Here, are the linear combination coefficients of the color line model , and is the alpha matte value of the th pixel[16]. The weight is a window of pixel . With some algebra, we can show that the marginalized energy function is equivalent to
(15) 
where is the socalled matting Laplacian matrix. When trimap is given, we can regularize by minimizing the overall energy function:
(16) 
where is a binary diagonal matrix with entries being one for pixels that are labeled in the trimap, and zero otherwise. The vector contains specified alpha values given by the trimap. Thus, for large , the minimization in (16) will force the solution to satisfy the constraints given by the trimap.
DualLayer Matting Laplacian . In the presence of the plate image, we have two pieces of complementary information: the color image containing the foreground object, and the plate image. Correspondingly, we have alpha matte for , and the alpha matte for . When is given, we can redefine the color line model as
(17) 
In other words, we ask the coefficients to fit simultaneously the actual image and the plate image . When (17) is assumed, the energy function becomes
(18) 
where we added a constant to regulate the relative emphasis between and .
Theorem 2.
The marginal energy function
(19) 
can be equivalently expressed as , where is the modified matting Laplacian, with the th element
(20) 
Here, is the Kronecker delta, is the color vector at the th pixel. The vector is defined as
(21) 
and the matrix is
(22) 
Proof.
See Appendix. ∎
Because of the plate term in (18), the modified matting Laplacian is positive definite. See Appendix for proof. The original in (15) is only positive semidefinite.
DualLayer Regularization . The diagonal regularization matrix in (13) is reminiscent to the binary matrix in (16), but
is defined through a sigmoid function applied to the input
. To be more precise, we define , where(23) 
and is the th element of the vector , which is the argument of . The scalar constant is a user defined parameter specifying the stiffness of the sigmoid function, and is another user defined parameter specifying the center of the transient. Typical values of for our MACE framework are and .
A closer inspection of and reveals that is performing a hardthreshold whereas is performing a softthreshold. In fact, the matrix has diagonal entries
(24) 
for two cutoff values and . This hardthreshold is equivalent to the softthreshold in (23) when .
There are a few reasons why (23) is preferred over (24), especially when we have the plate image. First, the softthreshold in (23) tolerates more error present in , because the values of represent the probability of having foreground pixels. Second, the onesided threshold in (23) ensures that the background portion of the image is handled by the plate image rather than the input . This is usually beneficial when the plate is reasonably accurate.
IiiB Agent 2: Background Estimator
Our second agent is a background estimator, defined as
(25) 
The reason of introducing is that in , the matrix is determined by the current estimate . While handles part of the error in , large missing pixels and false alarms can still cause problems especially in the interior regions. The goal of is to complement for these interior regions.
Initial Background Estimate . Let us take a look at (25). The first two terms are quadratic. The interpretation is that given some fixed initial estimate and the current input , returns a linearly combined estimate between and . The initial estimate consists of two parts:
(26) 
where means elementwise multiplication. The first term is the color term, measuring the similarity between foreground and background colors. The second term is the edge term, measuring the likelihood of foreground edges relative background edges. In the followings we will discuss these two terms one by one.
Defining the Color Term . We define by measuring the distance between a color pixel and a plate pixel . Ideally, we would like to be small when is large.
In order to improve the robustness of against noise and illumination fluctuation, we modify by using the bilateral weighted average over a small neighborhood:
(27) 
where specifies a small window around the pixel . The bilateral weight is defined as
(28) 
where
(29) 
Here, denotes the spatial coordinate of pixel , denotes the th color pixel of the color image , and are the parameters controlling the bilateral weight strength. The typical values of hs and hr are both 5.
We now need a mapping which maps the distance to a vector of numbers in so that the term makes sense. To this end, we choose a simple Gaussian function:
(30) 
where
is a user tunable parameter. We tested other possible mappings such as the sigmoid function and the cumulative distribution function of a Gaussian. However, we do not see significant difference compared to (
30). The typical value for is 10.Defining the Edge Term . The color term is able to capture most of the difference between the image and the plate. However, it also generates false alarms if there is illumination change. For example, shadow due to the foreground object is often falsely labeled as foreground. See the shadow near the foot in Figure 4.
In order to reduce the false alarm due to minor illumination change, we first create a “superpixel” mask by grouping similar colors. Our superpixels are generated by applying a standard floodfill algorithm [39] to the image . This gives us a partition of the image as
(31) 
where are the superpixel index sets. The plate image is partition using the same superpixel indices, i.e., .
While we are generating the superpixels, we also compute the gradients of and for every pixel . Specifically, we define and , where (and ) are the twotap horizontal (and vertical) finite difference at the th pixel. To measure how far is from , we compute
(32) 
Thus, is small for background regions because , but is large when there is a foreground pixel in . If we set a threshold operation after , i.e., set if for some threshold , then shadows can be removed as their gradients are weak.
Now that we have computed , we still need to map it back to a quantity similar to the alpha matte. To this end, we compute a normalization term
(33) 
and normalize by
(34) 
where denotes the indicator function, and and are thresholds. In essence, (34) says in the th superpixel , we count the number of edges that have strong difference between and . However, we do not want to count every pixel but only pixels that already contains strong edges, either in or . Thus, we take the weighted average using as the weight. This defines , as the weighted average is shared among all pixels in the superpixel . Figure 5 shows a pictorial illustration.
(a)  (b)  (c) 
Why is helpful? If we look at and in Figure 6, we see that the foreground pixels of and coincide but background pixels roughly cancel each other. The reason is that while creates weak holes in the foreground, fills the gap by ensuring the foreground is marked.
Regularization . The last term in (25) is a regularization to force the solution to either 0 or 1. The effect of this term can be seen from the fact that is a symmetric concave quadratic function with a value zero for or . Therefore, it introduces penalty for solutions that are away from 0 or 1. For , one can show that the Hessian matrix of the function is positive semidefinite. Thus, is strongly convex with parameter .
IiiC Agent 3: Total Variation Denoising
The third agent we use in this paper is the total variation denoising:
(35) 
where is a parameter. The norm is defined in spacetime:
(36) 
where controls the relative strength of the gradient in each direction. In this paper, for spatial total variation we set , and for spatialtemporal total variation we set .
A denoising agent is used in the MACE framework because we want to ensure smoothness of the resulting matte. The choice of the total variation denoising operation is a balance betweeen complexity and performance. Users can use stronger denoisers such as BM3D. However, these patch based image denoising algorithms rely on the patch matching procedure, and so they tend to undersmooth repeated patterns of false alarm / misses. Neural network denoisers are better candidates but they need to be trained with the specifically distorted alpha mattes. From our experience, we do not see any particular advantage of using CNNbased denoisers. Figure 7 shows some comparison.
IiiD Parameters and Runtime
The typical values for parameters of the proposed method are presented in Table II. and are rarely changed, while determines the denoising strength of Agent 3. has a default value of 0.05. Inceasing causes more binary results with clearer boundaries. and determine the edge term in Agent 2 and are fixed. determines the color term in Agent 2. Large produces less false negative but more false positive. Overall, the performance is reasonably stable to these parameters.
Parameter  
Value  0.01  2  4  0.05  0.01  0.02  10 
In terms of runtime, the most timeconsuming part is Agent 1 because we need to solve a largescale sparse least squares problem. Its runtime is determined by the number of foreground pixels. Table III shows the runtime of the sequences we tested. In generating these results, we used an unoptimized MATLAB code on a Intel i74770k. The typical runtime is about 13 minutes per frame, which can further be improved by porting the algorithms to C++ or GPUs. From our experience working with professional artists, even with professional film production software, e.g., NUKE, it takes 15 minutes to label a ground truth label using the plate and temporal cues. Therefore, the runtime benefit offered by our algorithm is substantial.
time/Fr  indoor/  lighting  Backgrd  green  ground  
resolution  FGD %  (sec)  outdoor  shadow  issues  vibration  camouflage  screen  truth  
Book  540x960  19.75%  231  outdoor  
Building  632x1012  4.03%  170.8  outdoor  
Coach  790x1264  4.68%  396.1  outdoor  
Purdue  Studio  480x270  55.10%  58.3  indoor  
Dataset  Road  675x1175  1.03%  232.9  outdoor  
Tackle  501x1676  4.80%  210.1  outdoor  
Gravel  790x1536  2.53%  280.1  outdoor  
Office  623x1229  3.47%  185.3  indoor  
Bootstrap  480x640  13.28%  109.1  indoor  
Cespatx  480x640  10.31%  106.4  indoor  
Public  DCam  480x640  12.23%  123.6  indoor  
Dataset  Gen  480x640  10.23%  100.4  indoor  
[14]  Multipeople  480x640  9.04%  99.5  indoor  
Shadow  480x640  11.97%  115.2  indoor 
Iv Experimental Results
Iva Dataset
To evaluate the proposed method, we create a Purdue dataset containing 8 video sequences using the HypeVR Inc. 360 degree camera as shown in Figure 8. The original image resolution is at a frame rate of 48fps, and these images are then downsampled and cropped to speed up the matting process. In addition to these videos, we also include 6 videos sequences from a public dataset[14], making a total of 14 video sequences. Snapshots of the sequences are shown in Figure 9. All video sequences are captured without camera motion. Plate images are available, either during the first or the last few frames of the video. To enable objective evaluation, for each video sequence we randomly select 10 frames and manually generate the ground truths. Thus totally there are 140 frames with ground truths.
(a) Snapshots of the Purdue Dataset. 
(b) Snapshots of a public dataset [14] 
The characteristics of the dataset is summarized in Table III. The Purdue dataset has various resolution, and the Public dataset has one resolution . The foreground percentage for the Purdue dataset videos ranges from to , whereas that public dataset has similar foreground percentage around . The runtime of the algorithm (per frame) is determined by the resolution and the foreground percentage. In terms of content, the Purdue dataset focuses on outdoor scenes whereas the public dataset are only indoor. The shadow column indicates the presence of shadow. Lighting issues include illumination change due to autoexposure and autowhitebalance. The background vibration only applies to outdoor scenes where the background objects have minor movements, e.g., moving grass or tree branches. The camouflage column indicates the similarity in color between the foreground and background, which is a common problem for most sequences. The green screen column shows which of the sequences have green screens to mimic the common chromakeying environment.
IvB Competing methods
We categorize the competing methods into three different categories. The key ideas of the competing methods are summarized in Table IV.
Methods  Automated?  Key idea  
BSVS  [25]  semiauto  bilateral space 
video segmentation  
Matting  [43]  trimap  Trimap generation 
+ alpha matting  
Grabcut  [29]  fullauto  iterative graph cuts 
video segmentation  
ViBe  [44]  fullauto  pixel model based 
background subtraction  
NLVS  [45]  fullauto  nonlocal voting 
video segmentation  
Pbas  [15]  fullauto  nonparametric 
background subtraction 

Fullautomatic methods: We consider two background subtraction algorithms Pixelbased adaptive segmenter (Pbas) [15], Visual background extractor (ViBe) [44], one video segmentation method Nonlocal consensus voting (NLVS) [45] and a modified version of Grabcut[29]. Pbas[15] and ViBe[44] are realtime. However, the generated masks have low quality. NLVS[45] uses optical flow as the main cue and exploits the similarities among superpixels. Thus it suffers from errors caused by the superpixel step. The modified version of Grabcut[29] is made to use the plate image and no longer requires user input. It performs reasonably well when background and foreground have different colors, but suffers when the foreground and background colors are similar.

Semiautomatic method: We consider the Bilateral space video segmentation (BSVS) [25], a semisupervised video segmentation algorithm using the bilateral space. BSVS is semisupervised as it requires the user to provide ground truth labels for key frames.

Alpha matting using trimap: We consider one of the stateoftheart alpha matting algorithm using CNN [43]. The trimaps are generated by applying frame difference between the plate and color images, followed by morphological and thresholding operations.
We emphasize that while these competing methods are possibly the closest methods to our proposed method, they are considerably different. For example, BSVS[25] does not require a plate image, but it requires key frames. NLVS[45] also does not require a plate image, but it cannot handle videos with relatively stable objects as it relies on optical flow.
IvC Results
We use intersection over union(IoU) as an evaluation metric. The IoU contains twos parts, intersection and union:
and . In this equation, is the pixel of the estimated alpha matte, and is that of the ground truth.
Comparison with fullautomatic methods: The results are shown in Figure 10, where we plot the overall IoU for different methods. In this chart, it is clear that the proposed method performs the best. Among the video sequences, the competing methods NLVS and Pbas perform most badly for Book, Bootstrap, Coach and studio. In these sequences, the foreground object is stationary or it only has rotational movements so that it is treated as background.
Comparison with semiautomatic method: We compare our method with BSVS [25]. Since BSVS requires ground truth key frames, in this experiment we use 10 key frames for each video sequence. When testing, we feed the video sequence and the key frames to the algorithm. The output is a processed sequence, and the key frames are modified because the algorithm propagates the error.
The results of this experiment show that the proposed algorithm is on average better than BSVS even though BSVS has ground truth key frames. For the Office sequence, BSVS scores a IoU of 0.77 while the proposed method scores 0.95. BSVS performs badly because the background is complex and sometimes it is similar to the foreground color. Even with key frame ground truth, BSVS is still confused. Similar observation can be found for the case of Gravel. BSVS performs better in the Shadow sequence scoring 0.96 in comparison with the proposed method scoring 0.90. This is because the key frame ground truth helps preventing mislabeling the shadows as foreground. On average, the proposed method scores 0.93 which is higher than 0.86 scored by BSVS.
Comparison with trimap + alphamatting methods: In this experiment we compare with several stateoftheart alpha matting algorithms. The visual comparison is shown in Figure 3, and the IoU value of DCNN [10] is shown in Figure 10. In order to make this method work, careful tuning during the trimap generation stage is required.
Figure 3 and Figure 10 show that most alpha matting algorithms suffer from false alarms near the boundary, e.g., spectral matting [6], closedform mating [16], learningbased matting [7] and comprehensive matting [9]. The more recent methods such as Knearest neighbors matting [8] and DCNN [10] have equal amount of false alarm and miss. Yet, the overall performance is still worse than the proposed method.
IvD Ablation test and MACE with various denoisers
Since the proposed framework contains three different agents, we conduct an ablation study to verify the relative importance of the individual agents. To do so, we remove one of the three agents while keeping the other two fixed. The result is shown in Figure 12. Based on the result, the matting agent, not surprisingly, has the most impact on the performance, followed by background estimator and denoiser. The drop in performance is most significant for hard sequences such as Book as it contains moving background, and Road as it contains strong color similarity between foreground and background. The proposed method scores 0.98 for Book while in the absent of Agent 1 and 3 it only scores 0.38 and 0.48 respectively. The performance drop is even more for Road with the proposed method scoring 0.91, 0.06 without Agent 1 and 0.11 without Agent 3.
In this ablation study, we also observe spikes of error for some scenes when background estimator is absent in MACE. This is because, without the term from the background estimator, the result will look grayish instead of closetobinary. This behavior leads to the error spikes. For example, the performance drops from 0.94 for the proposed method to 0.72 when Agent 2 becomes absent. For green screen scenes like Tackle and Coach, since the initial estimate is already close to the ground truth the performance does not suffer when the matting or denoiser agents are absent.
For Agent 3 the denoiser, we observe that the total variation denoiser leads to the best performance for MACE. In a visual comparison shown in Figure 7, we observe that IRCNN[42] produces more detailed boundaries but fails to remove false alarms near the feet. BM3D[41] removes false alarms better but produces less detailed boundaries. TV on the other hand produces a more balanced result. As shown in Figure 12, BM3D performs similarly as IRCNN scoring 0.85 on average, while TV performs the best with an average score of 0.93. BM3D anbd IRCNN score the worst for the Road sequence, as they fail to remove false alarms caused by moving tree branches in the background. This is because the false alarms in the background form a repetitive pattern which could confuse BM3D who denoises by block matching.
V Limitations and Discussion
While the proposed method demonstrates superior performance than the stateoftheart methods, it also has several limitations.

[leftmargin=*]

Quality of Plate Image. The plate assumption may not hold when the background is moving substantially. When this happens, a more complex background model that includes dynamic information is needed. However, if the background is nonstationary, additional designs are needed to handle the local error and temporal consistency.

Loss of Fine Details. In our proposed method, fine details such as hairs are compromised for robustness. Figure 13 illustrates an example. In some videos, the color difference between foreground and background is similar. This creates holes in the initial estimate , can be filled by a strong denoiser such as total variation. However, total variation is known to oversmooth fine details. To mitigate this issue, an additional postprocessing step using alpha matting could bring back the details around the boundary.

Strong Shadows. Strong shadows are sometimes treated as foreground, as shown in Figure 14. This is caused by the lack of shadow modeling in the problem formulation. The edge based initial estimate can resolve the shadow issue to some extent, but not when the shadow is very strong. We tested a few offtheshelf shadow removal algorithms [46, 47, 48], but generally they do not help because the shadow in our dataset can cast on the foreground object which should not be removed.
(a)  (b)  (c)  (d) 
Vi Conclusion
We presented a new foreground extraction algorithm based on the multiagent consensus equilibrium (MACE) framework. MACE is an information fusion framework which integrates multiple weak experts to produce a strong estimator. Equipped with three customized agents: a duallayer closed form matting agent, a background estimation agent and a total variation denoising agent, MACE offers substantially better foreground masks than stateoftheart algorithms. MACE is a fully automatic algorithm, meaning that human interventions are not required. This provides significant advantage over semisupervised methods which require trimaps or scribbles. In the current form, MACE is able to handle minor variations in the background plate image, illumination changes and weak shadows. Extreme cases can still cause MACE to fail, e.g., background movement or strong shadows. However, these could potentially be overcome by improving the background and shadow models.
Vii Acknowledgement
The authors thank the anonymous reviewers for valuable feedback. This work is supported, in part, by the National Science Foundation under grants CCF1763896 and CCF1718007.
Viii Appendix
Viiia Proof of Theorem 2
Proof.
We start by writing (18) in the matrix form
where
and denotes the index of the th pixel in the neighborhood . The difference with the classic closedform matting [16] is the new terms , and (i.e., the second row of the quadratic function above.)
Denote
(37) 
and use the fact that , we can find out the solution of the leastsquares optimization:
(38) 
We now need to simplify the term . First, observe that
where we define the terms , and . Then, by applying the block inverse identity, we have
(39) 
where we further define and .
Substituting (38) back to , and using (39), we have
where
(40) 
The th element of is therefore
(41) 
Adding terms in each , we finally obtain
∎
ViiiB Proof: is positive definite
Proof.
Recall the definition of :
Based on Theorem 2 we have,
Comments
There are no comments yet.