1 Introduction
Reflections and transparency are prevalent in real scenes, and are typically viewed as undesirable. Unfortunately, it is nontrivial to remove them. The observed image can be generally modeled as a linear combination of a transmitted layer (which contains the scene of interest) and a secondary layer (which contains the reflection or transparency). Typical examples include a picture behind a glass cover and a scene blocked by a sheer curtain. Extracting from is a problem that is inherently illposed: we have two unknowns and but only one equation. To make this underconstrained problem more tractable, existing solutions either impose additional priors (e.g., through user inputs or spatial regularization) [16, 17] or use more constraints (e.g., by capturing more photographs) [29, 31, 18, 10].
In this paper, we present a new computational imaging solution by exploiting emerging light field imaging techniques. A light field (LF) captures an array of images from a grid of viewpoints. It can be viewed as a singleshot multiview imaging system. The multiview attribute enables reliable depth estimation [11, 32, 14, 6] that eliminates the need of homography assumption in [29, 9, 31, 10]. Our technique begins with estimating an initial disparity map using SIFT flow [20]. We then warp all LF views to the reference view (in our case, the central camera) to form an image stack. We show that the image stack exhibits lowrank property, and we apply Robust Principle Component Analysis (RPCA) for simultaneous layer separation and disparity refinement.
A unique advantage of our LFbased solution is that we can represent scene geometry as a single disparity map under which the resulting warped image stack will be lowrank. In contrast, the warped image stack in previous multiview approaches is only lowrank when scene geometry is planar (via homographic warping on the cropped common region) and they can break down on complex scenes (Fig. 4 and 5). We conduct experiments on both synthetic and real data. In particular, we construct a mini LF array that is portable and can be controlled by a single tablet. Results on static and dynamic scenes show that our technique is robust and reliable and can handle a broad range of challenging layer separation problems.
2 Related Work
The problem of image layer separation is illposed, and typically relies on additional priors or constraints. Earlier approaches rely on user inputs to provide priors on the two layers. Levin et al. [16] develop a userassisted system to label image gradients to one of the two layers. An automatic method can then be used to search for a decomposition that minimize the total amount of edges and corners, using a database of natural image patches [17].
To automate the layer separation process, more recent techniques use multiple images, either from a fixed viewpoint with varying camera settings (such as flash, focus, and polarization), or from multiple viewpoints through the use of a handheld camera [13, 29, 9, 31, 28, 18, 10]. In the case of the fixed viewpoint approach, [8, 27, 15] exploit the effect of reflection under different rotation angles of a polarizer. Agrawal et al. [4] show how a flash/noflash image pair can be used to remove both reflections and highlights through gradient filtering and integration. Schechner et al. [26] propose to vary the focus of the camera for eliminating reflection artifacts. The use of different modes of capture is complementary to our technique.
Methods for separating layers using multipleviewpoint images are based on the intuition that the transmitted layer and reflection undergo different motions under changing views. Szeliski et al. [29] propose to separate the two layers by estimating global and local motions. Gai et al. [9] study the statistics of natural images to extract both the motion of the two layer motions and their mixing coefficients. In a similar vein, Tsin et al. [31] assume locally planar motion and require dense image capture to estimate both the depth and appearance of each layer through EPI analysis. Sinha et al. [28] speed up the process by adopting piecewise planar scene models and extends the semiglobal matching [13] for reliable layer separation. More recently, Guo et al. [10] correlate all images through homography and then conduct lowrank decomposition to effectively separate the reflection layer from the transmitted layer. Although these techniques are effective, the requirement of capturing multiple and often many images of scene from different viewpoints, and hence time instances, significantly limits their applicability. Further, there is an implicit assumption that the scene is mostly planar and can be rectified via a homography.
We seek a singleshot solution through LF imaging. The concept of LF imaging can be traced by integral photography by Lippmann [19] in which a lenslet array is used to emulate acquisition of multiple viewpoints [3, 24, 21]. Handheld plenoptic cameras are now commercially available [22] and mobile camera arrays [1, 2] will be on the market soon. In our experiments, we use a mini LF camera array to support onsite acquisition. Techniques that capitalize on the availability of such cameras include [12, 11] (variational shape from LF data), [33] (line assisted stereo matching), [30] (depth estimation of glossy surfaces), and [6] (robust stereo matching).
3 Problem Formulation
In our work, we capture the LF of the scene (transmitted layer) that has been superimposed with a secondary layer (e.g., reflection). The inputs are LF images from different viewpoints, and we take the central view as the reference view. Our goal is to separate the layers for the reference view by exploring redundant information that is available from the other views. To account for scene appearance in all the views, we estimate the disparity map of the transmitted layer; this map is used to align all the LF views with respect to the reference to facilitate layer separation. The disparity map estimation and layer separation steps are done iteratively.
We first explain our notations. Our LF consists of a 2D grid of viewpoints, with each image having a resolution of . The th 2D subaperture image is unrolled as a 1D image vector , ; the term maps index to its position within the 2D image grid. We assume the images are uniformly sampled horizontally and vertically with an identical baseline and represents the disparity map of the reference view with respect to its onehop neighbor view. We use to represent the warped result from to the reference using . As with , and are also unrolled into 1D row vectors . Given , we can compute ’s and stack them to form matrix . The warped LF images will now contain the warped transmitted and secondary layers: . We can similarly stack all and into two matrices and . Fig. 2 illustrates the warping process.
Our goal is to recover , , and from a single equation . Since this problem is illposed, we need to impose additional constraints as in [10]. First, the transmitted layer should be the same after disparity warping to the reference view, and therefore should be of low rank. In contrast, the warped secondary layer should have pixelwise low coherence across views because they are warped using the disparity of the transmitted layer rather than their own disparity map, and therefore should be sparse. In addition, the transmitted and secondary layers should be independent and their gradients sparse. Putting all these together, we formulate the layer separation problem as energy minimization:
(1)  
where , , and are , , and Frobenius norm respectively, is an intermediate variable for refining the disparity map , represents the elementwise multiplication, and is the finite difference operator applied to an image on both x and y direction.
In this formulation, the first term forces the rank of matrix to be low. The second and third terms force the gradients of the two layers to be mutually independent. The fourth and fifth terms imposes the sparse gradient prior on natural images. The last two terms employ TV to refine the disparity map . We choose TV instead of TV as the regularization term for two reasons. First, a disparity map is largely piecewise constant. Second, the norm measure is commonly used for evaluating the percentage of bad pixels on disparity maps [25]. Therefore, can be interpreted as the convexification of bad pixel percentage in . We further impose hard constraints that and be nonnegative (). The optimization problem, however, is NPhard. We follow [5] to solve an alternative convex relaxation problem:
(2)  
where nuclear norm replaces the function and norm replaces norm in Eq. 1.
The new formulation now allows convex optimization. However, the 3Dwarping function is still highly nonlinear. In order to linearize the warping function, we further formulate as:
(3) 
where is the image pixel coordinate. In order to convert the objective function into a convex model, we follow [11] to linearize the warped images using first order Taylor approximation on disparity at iteration . For each image, we have:
(4) 
where is
(5) 
Letting , we rewrite the constraint in Eq. 2 as:
(6) 
where , , and is the standard basis for . The constraint can be regarded as linearizing the 3Dwarping operation with respect to the disparity map .
Finally, we combine all priors to simultaneously solve for the transmitted and secondary layers as well as the disparity map by solving the following convex optimization problem:
(7)  
4 Optimization
In this section, we describe how to optimize the objective function defined in Eq. 7. The algorithm is outlined in Algorithm 1 and illustrated in Fig. 3.
4.1 Initialization
Our approach starts by warping the subaperture images to the center view. Previous studies assume global parametric motion (e.g., homographies [29, 9, 10]). Despite its computational efficiency and robustness, this approach is unable to handle more complex parallax. In reality, the transmitted layer is unlikely to be planar and a dense 3D reconstruction would be needed for warping the images. Conceptually, we can apply LF stereo matching such as [32, 14, 6] to first estimate the 3D geometry. However, with the secondary layer corrupting the transmitted layer, direct depth estimation incurs significant errors. In our implementation, we use SIFT flow [20] for correspondence, since it has been shown to be effective for registering reflective scenes [18, 23].
Similar to the optical flow, SIFT flow only allows descriptors to be matched along the flow vector which is composed of the horizontal and the vertical components. This fits well to our model since the relative motion between the subaperture images and the reference image should approximately follow the flow. The initial disparity is then obtained by averaging local flows, i.e.,
(8) 
4.2 Iterative Optimization
Given the initial disparity estimation, we use the recently proposed Augmented Lagrange Multiplier (ALM) with Alternating Direction Minimizing (ADM) strategy [10] to optimize our objective function 7. Specifically, we can separate the objective into individual subproblems by introducing five auxilliary variables: . We also use an intermediate variable to represent . Under our formulation, the augmented Lagrangian function can now be represented as:
(9)  
where , is a positive scalar, and are Lagrange multipliers. The goal of ALM is to find a saddle point of , which approximates the solution of the original problem. We adopt the alternating direction method to iteratively solve the subproblems. The solutions and steps for each subproblems are listed in the Appendix (attached as supplementary material).
Once we obtain the solutions at each iteration, we further update the multipliers as:
(10)  
Algorithm 1 shows the complete process. The termination condition is when the change of the objective function between two consecutive iterations is ultra small (0.1 in our experiments). The inner loop terminates when or the maximum number of iterations is reached.
5 Experiments
We have conducted experiments on both synthetic and real data. All experiments are conducted on an Intel i7 PC (3.2GHz CPU, 16GB RAM) with the same set of parameters. We compared our results to two stateoftheart techniques [18] and [10], by using the authors’ source code with default parameters.
We first add synthetic reflections by superposing an additional layer to the Stanford LF images [7]. The resolution of the synthetic images is of and the motion of the additive layer is set to 20 pixels between adjacent views opposite to camera motion. Fig. 4 shows that our technique outperforms these alternative solutions in both accuracy and visual quality. This illustrates the importance of recovering the 3D shape of the transmitted layer. The multiimage technique of [10] uses homography (i.e., planes) as priors to register multiple images onto a common viewpoint. In our examples (e.g., the Stanford Bunny), the transmitted layer is nonplanar and exhibits complex depth variations. As a result, [10] produces relatively large errors and ghosting artifacts due to image misalignment. In contrast, our technique has significantly less artifacts while recovering a relatively high quality disparity map. To illustrate the limitation of homography in transforming 3D scenes, we compare the transformed layers shown in Fig. 5. Disparity based warping produces more consistency than homography on the transmitted layer.
The technique of [18] is most similar to ours. It also models the transformation of the transmitted layers across different views as a flow field and uses SIFT flow for image warping. Therefore it is expected to better handle nonplanar transmitted layer as shown in column 3 in Fig. 4. However, it computes the flow field only once (at the beginning). Consequently, the separation quality is heavily reliant on the quality of flow initialization. For example, the bunny on the transmittance layer appears blurred in Fig. 4 since the initial disparity estimation is erroneous.
By comparison, our technique incorporates disparity estimation and layer separation into an iterative joint optimization framework. The benefits of our technique can be seen in Fig. 4, with better detail recovery and better overall quality of layer separation.
For real experiments, we need to capture LF images with a reasonable baseline between adjacent viewpoints. We did not use the Lytro [22] because it has an ultrasmall baseline that limits its working range to only about 6 inches, whereas existing camera arrays are too bulky for practical use. We built our own portable LF array consisting of Microsoft LifeCam HD6000 USB cameras on a 3Dprinted grid (Fig. 1). The resolution of each camera is , and the baseline can be set to either 1, 2, or 3 inches. To capture static scenes, we connect all cameras to a Keynice H1088 10port hub powered by an Anker Astro Pro2 external battery pack. A single HP Stream 7 tablet is used to trigger individual cameras and store data. It takes around 1 second to take all 9 shots at full resolution. To capture dynamic scenes, we connect the cameras to a workstation equipped with 3 PCIE USB 3.0 adaptors, each having 4 dedicated 5Gbps channels. This configuration allows us to record HD (720p) LF videos at 30 fps. We precalibrate the camera using the technique described in [34].
For validation, we captured some scenes with a reflective layer and others with a translucent layer. We first capture a LF of a painting within a glass frame using the 3inch baseline. This is a typical problem that [10] aims to solve. Our method produces comparable results. However, it is worth noting that [10] requires users to manually find four corresponding corners in a view for computing the homography. We instead automatically compute the disparity map without any user input. In the second example, we capture a figurine behind a translucent layer of cloth using the 1inch baseline. Our method is able to reliably recover the 3D geometry of the figurine as well as remove the effect of cloth layer. To use [10], we select four feature points on the images and approximate a homography for warping the images. Their results exhibit clear visual artifacts due to their inability to account for arbitrary depth variation.
Next, we capture three objects made of different materials behind a reflective glass. This emulates the museum setting of photographing 3D artifacts. These objects, especially the toy truck, have clear depth variations and the parallax across the LF views violates the homography model. Consequently, both the recovered transmitted layer and the secondary layer from [10] exhibit ghosting artifacts due to misalignment of views. The technique of [18] partially reduces these artifacts as initial SIFT flow better register the images. However, the SIFT flow still has large deviation from the actual disparity map and their results exhibit artifacts on heavily saturated regions due to misalignment.
Our technique is able to generate better results. More importantly, with the help of the disparity map, we are able to align the views and eliminate most of the reflection layers while preserving fine geometric details and texture, as seen in Fig. 6. Our layer separation solution also produces a high quality 3D depth map, with which we can perform IBR effects such as depthguided refocusing (Fig. 7) on the transmitted layer. Fig. 8 shows our results on a dynamic scene with a toy truck moving behind glass. The bottom row shows results of removing the fast moving reflection. To the best of our knowledge, our solution is the first to perform reliable layer separation on dynamic scenes.
We examined our LF camera in a variety of environments, and found that the 1inch baseline provides enough view changes for almost all practical scenes that are 46 feet away. Also, a LF is sufficient for nearly all cases. More views will further improve the lowrank constraint in RPCA optimization but is also more computationally expensive. Our method takes about 7 minutes on average to process one LF video frame (containing 9 views at a resolution of ). The code of [10] takes about 3 minutes to finish a image sequence of the same size. The author of [18] reports a running time of about 5 minutes for a image sequence containing up to 5 images.
As with previous techniques, we assume that the transmitted layer is dominant with the contribution of the secondary layer being relatively small. This ensures that the SIFT flow algorithm will mostly choose feature points from the transmitted layer to produce mostly correct warping. If the assumption is violated, the detected feature points will come from a mixed pool of two layers. Since our iterative refinement process is local, it may not be able to overcome the large errors.
We experimented on a synthetic scene dataset where we control the blending of the two layers with a blending parameter . We apply our layer separation technique for different values of . We compute the percentage of incorrectly recovered pixels in both layers where we use 0.1 (for intensity range [0, 1]) as the threshold to determine if a recovered pixel is incorrect. Fig. 9 shows the layer separation accuracy versus . For small (e.g., in range ), we are able to obtain good results. The performance significantly degrades when is above and our algorithm fails when is above .
6 Conclusion
We have presented a novel technique that automatically separate the transmitted and secondary layers. At the core of our technique is the use of light field imaging to acquire multiview images. With approximate scene depth of the transmitted layer, we can warp all light field views to the reference view to form an image stack. The corresponding transmitted stack is expected to be of low rank, while the secondary layer is of low coherence and hence sparse. We start with SIFT flow to generate the initial depth map and then apply an iterative optimization scheme based on Robust PCA (RPCA) for layer separation and depth map refinement. It is worth noting that our technique handles dynamic scenes (e.g., removing reflections from video), which would be almost impossible for traditional methods using an unstructured collection of viewpoints.
An implicit assumption of our technique is that the transmitted layer is predominant so that SIFT flow can produce a reliable initial estimation of the disparity map of the transmitted layer. We plan to investigate the structure of the secondary layer to relax this assumption. We would also like to try our technique on the Pelican [1] or Light [2] mobile LF camera which is expected to be on the market soon and compare their results with those from our light field setup. Another interesting direction is to estimate 3D shape of the secondary layer as well, by reformulating our problem using two unknown disparity maps.
References
 [1] Pelican imaging mobile camera array. https://www.pelicanimaging.com, 2013 (Accessed on March, 2015).
 [2] Light mobile camera array. https://light.co/, 2015 (Accessed on March, 2015).
 [3] E. H. Adelson and J. Y. A. Wang. Single lens stereo with a plenoptic camera. IEEE TPAMI, 14(2):99–106, 1992.
 [4] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Removing photography artifacts using gradient projection and flashexposure sampling. In ACM TOG, volume 24, pages 828–835, 2005.

[5]
E. J. Candès, X. Li, Y. Ma, and J. Wright.
Robust principal component analysis?
Journal of the ACM, 58(3):11, 2011.  [6] C. Chen, H. Lin, S. B. K. Z. Yu, and J. Yu. Light field stereo matching using bilateral statistics of surface cameras. In CVPR, 2014.
 [7] S. U. Computer Graphics Laboratory. The (new) stanford light field archive. http://lightfield.stanford.edu/, 2008 (Accessed on March, 2015).

[8]
H. Farid and E. H. Adelson.
Separating reflections from images by use of independent component analysis.
JOSA A, 16(9):2136–2145, 1999.  [9] K. Gai, Z. Shi, and C. Zhang. Blind separation of superimposed moving images using image statistics. IEEE TPAMI, 34(1):19–32, 2012.
 [10] X. Guo, X. Cao, and Y. Ma. Robust separation of reflection from multiple images. In CVPR, pages 2195–2202, 2014.
 [11] S. Heber and T. Pock. Shape from light field meets robust pca. In ECCV, pages 751–767. 2014.

[12]
S. Heber, R. Ranftl, and T. Pock.
Variational shape from light field.
In
International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition
, 2013.  [13] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE TPAMI, 30(2):328–341, 2008.
 [14] C. Kim, H. Zimmer, Y. Pritch, A. SorkineHornung, and M. H. Gross. Scene reconstruction from high spatioangular resolution light fields. ACM Trans. Graph., 32(4):73, 2013.
 [15] N. Kong, Y.W. Tai, and S. Y. Shin. Highquality reflection separation using polarized images. IEEE TIP, 20(12):3393–3405, 2011.
 [16] A. Levin and Y. Weiss. User assisted separation of reflections from a single image using a sparsity prior. IEEE TPAMI, 29(9):1647–1654, 2007.
 [17] A. Levin, A. Zomet, and Y. Weiss. Separating reflections from a single image using local features. In CVPR, volume 1, pages I–306, 2004.
 [18] Y. Li and M. S. Brown. Exploiting reflection change for automatic reflection removal. In ICCV, pages 2432–2439, 2013.
 [19] G. Lippmann. La photographie intgrale. ComptesRendus,Acadmie des Sciences, 146:446–451, 1908.
 [20] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. IEEE TPAMI, 33(5):978–994, 2011.
 [21] A. Lumsdaine and T. Georgiev. The focused plenoptic camera. In IEEE ICCP, pages 1–8, 2009.
 [22] Lytro. Lytro camera. https://www.lytro.com, 2012 (Accessed on March, 2015).
 [23] K. Maeno, H. Nagahara, A. Shimada, and R.i. Taniguchi. Light field distortion feature for transparent object recognition. In CVPR, pages 2786–2793, 2013.
 [24] R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and P. Hanrahan. Light field photography with a handheld plenoptic camera. Computer Science Technical Report CSTR, 2(11), 2005.
 [25] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. IJCV, 47(13):7–42, 2002.
 [26] Y. Y. Schechner, N. Kiryati, and R. Basri. Separation of transparent layers using focus. IJCV, 39(1):25–39, 2000.
 [27] Y. Y. Schechner, J. Shamir, and N. Kiryati. Polarization and statistical analysis of scenes containing a semireflector. JOSA A, 17(2):276–284, 2000.
 [28] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and R. Szeliski. Imagebased rendering for scenes with reflections. ACM Trans. Graph., 31(4):100, 2012.
 [29] R. Szeliski, S. Avidan, and P. Anandan. Layer extraction from multiple images containing reflections and transparency. In CVPR, volume 1, pages 246–253, 2000.
 [30] M. W. Tao, T.C. Wang, J. Malik, and R. Ramamoorthi. Depth estimation for glossy surfaces with lightfield cameras. In ECCV Workshop on Light Fields for Computer Vision. 2014.
 [31] Y. Tsin, S. B. Kang, and R. Szeliski. Stereo matching with linear superposition of layers. IEEE TPAMI, 28(2):290–301, 2006.

[32]
S. Wanner and B. Goldluecke.
Variational light field analysis for disparity estimation and superresolution.
2013.  [33] Z. Yu, X. Guo, H. Ling, A. Lumsdaine, and J. Yu. Lineassisted light field triangulation and stereo matching. In ICCV, 2013.
 [34] Z. Zhang. A flexible new technique for camera calibration. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(11):1330–1334, 2000.
Comments
There are no comments yet.