The Domain Transform Solver

05/11/2018 ∙ by Akash Bapat, et al. ∙ University of North Carolina at Chapel Hill 2

We present a framework for edge-aware optimization that is an order of magnitude faster than the state of the art while having comparable performance. Our key insight is that the optimization can be formulated by leveraging properties of the domain transform, a method for edge-aware filtering that defines a distance-preserving 1D mapping of the input space. This enables our method to improve performance for a variety of problems including stereo, depth super-resolution, and render from defocus, while keeping the computational complexity linear in the number of pixels. Our method is highly parallelizable and adaptable, and it has demonstrable scalability with respect to image resolution.



There are no comments yet.


page 2

page 9

page 10

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Edge-aware optimization is a widely utilized tool in computer vision. It is applied to a large variety of tasks, including semantic segmentation 

[3], stereo [4], recoloration [5], and optical flow [6]. This has been motivated by the intuition that similar-looking pixels should have similar properties. For this reason, a wide variety of edge-aware filtering algorithms have been developed, including the bilateral filter [7], anisotropic diffusion [8], and edge-avoiding wavelets [9], all of which identify similar-looking pixels. However, using such filters in optimization frameworks typically leads to slow algorithms, and while high-level groupings like super-pixels can be used to compensate for this sluggishness [10], the color-space clusterings of such approaches are not guaranteed to respect the semantics of the underlying domain, which often leads to processing artifacts.

(a) Low-res. depthmap
(b) Our 16x upsampled result
(c) Color image
(d) Target image
(e) Our result
Figure 1: Our domain transform solver can tackle a variety of problems. (a,b) shows our result for depth super-resolution using a high resolution color image as a reference. (c,d) shows a color image from the Middlebury dataset [11] with an initialization obtained from MC-CNN [12], which is then refined to obtain our result (e) that is blurred in an edge-aware sense.

We propose a general optimization framework that directly operates in the pixel space while maintaining distances in the combined color and pixel space with an edge-aware regularizer. The framework can be applied for a variety of optimization problems, as we demonstrate in Fig. 1 and Sec. 3. Our method achieves competitive performance in applications like stereo optimization (Sec.3.3), rendering from defocus (Sec.3.4), and depth super-resolution (Sec.3.5). This advantage becomes more pronounced with increasing image resolution, as well as a growing number of image channels. At the same time, our approach is independent of blur kernel sizes, which is not the case for existing bilateral solvers. This becomes crucial for applications where the data is of high resolution and high dimensionality, for example satellite imagery where a single image has typical resolutions of more than 100 million pixels with as many as 16 spectral bands.

The remainder of the paper is organized as follows: Sec. 2 describes the traditional approaches the computer vision community has developed for edge-aware filtering and optimization. Sec. 3 derives our domain transform solver (DTS) optimization framework and highlights its similarities as well as dissimilarities with previous work. We also describe how our framework can be adapted for various vision tasks. In Sec. 4, we provide quantitative evaluation as well as validation of the timing performance. Finally, we conclude in Sec. 5 and provide some future directions for expanding on our approach.

2 Related Work

Next, we briefly review the most relevant prior work related to edge-based filtering and optimization, namely: implementations for bilateral filters, optimizations leveraging superpixel segmentation, machine learning for edge-aware filtering, the domain transform and its filtering applications, and bilateral solvers.

Bilateral filters

The bilateral filter was introduced by Tomasi and Manduchi [7] and presented one of the initial edge-aware blurring techniques. The major bottleneck to bilateral filtering is that it is costly to compute, especially at large blur windows. Since its invention, there have been multiple approaches introduced to speed up the bilateral filter [13], [14], [15], [16], [17]: Durand and Dorsey approximate the bilateral filter by a piece-wise linear approximation [18]. Pham and van Vliet proposed to approximate the bilateral filter using two 1-D bilateral kernels [19]. Paris and Durand proposed to treat the image as a 5-D function of color and pixel space and then apply 1-D blur kernels in this high dimensional space[20]. These approximations to the bilateral filter, or the use of 1-D kernels in higher-dimensional spaces, enables decoupling the 2-D adaptive bilateral kernel into 1-D kernel, reducing the computational cost significantly. When used as a post-processing step, the bilateral filter removes noise in homogeneous regions but is sensitive to artifacts such as salt and pepper noise [21]. Our method emphasizes edge-aware concepts in the same spirit as bilateral filters, but our formulation is fundamentally different in that it provides a generalized framework for domain transform optimization.


To combat issues of computational complexity during bilateral optimizations, several approaches leverage superpixels, i.e., they group pixels together based on appearance and location. Superpixel extraction algorithms like SLIC [22] are often used in optimization problems for two major reasons: 1) They reduce the number of variables in the optimization, and 2) they adhere to color and (implicitly) object boundaries. In one application, Bódis-Szomorú et al[23] use sparse Structure-from-Motion(SfM) data, image gradients, and superpixels for surface reconstruction to ensure that the edges of triangles are aligned to the edges in the image. Lu et al[10] use SLIC superpixels to enforce spatially consistent depths in a PatchMatch-based matching framework [24]

to estimate stereo. Using superpixels inherently assumes local consistency and perfect segmentation, which often does not hold in practice. For example, in stereo algorithms, superpixels may cover regions with similar color but drastically different depths. At such regions, the pixels in a superpixel are incorrectly grouped because they have different depths. Although algorithm parameters can be tuned, the trade-off in coherence versus conciseness is a limiting factor in the utility of superpixel approaches.

Machine learning for edge-awareness

Yan et al[25]

used support vector machines (SVM) to mimic a bilateral filter by using the exponential of spatial and color distances as feature vectors to represent each pixel. Traditionally, conditional random fields (CRFs) are used for enforcing pair-wise pixel smoothness via the Potts potential. For example, Krähenbühl and Koltun 

[3] proposed to use the permutohedral lattice data structure [14]

, which is typically used in fast bilateral implementations, to accelerate inference in a fully connected CRF by using Gaussian distances in space and color. With the explosion of compute capacity and convolutional models in the vision community, there are also deep-learning methods that attempt to achieve edge-aware filtering. Chen 

et al[26] presented DeepLab to perform semantic segmentation; there, they use the fully connected CRF from Krähenbühl and Koltun [3]

on top of their convolutional neural network (CNN) to improve the localization of object boundaries, which typically suffers in a CNN setting due multiple max-poolings and the use of low-resolution images. Xu 

et al[27] presented a framework to learn edge-aware operators from the data to mimic various traditional handcrafted filters like the bilateral, weighted median, and weighted least squares filters [28]. However, machine learning approaches require large amounts of training data specific to a task, as well as significant compute power, while our approach works without any task-dependent training and runs efficiently on a single GPU.

The domain transform

Gastal and Oliveira [1] introduced the domain transform, a novel and efficient method for edge-aware filtering that is akin to bilateral filters. The domain transform is defined as a 1-D isometric transformation of a multi-valued 1-D function such that the distances in the range and domain are preserved. (See Sec. 3.2 for more details.) When applied to a 1-D image with multiple color channels, the transformation maps the distances in color and pixel space into a 1-D distance in the transformed space. When the scalar distance is measured in the transformed space, it is equivalent to measuring the vector distance in [R,G,B,X] space. This has the benefit of dimensionality reduction, leading to a fast edge-aware filtering technique which respects edges in color while blurring nearby pixels. To apply the domain transform to a 2-D image, the authors apply two passes, once in the X direction and once in Y. Applying the domain transform to an image results in a filtering effect, while in our case we optimize according to an objective function. Chen et al[2] proposed to perform edge-aware semantic segmentation using deep learning and use the domain transform filter in their end-to-end training of their deep-learning framework. They also alter the definition of what is considered as ‘edge’ by learning an edge prediction network, and they then use the learned edge-map in a domain transform. Their application of the domain transform is in the form of a filter, and hence is similar to one iteration of our method. We use the domain transform in our method in an iterative fashion in an optimization framework because it provides an efficient way to compute the local edge-aware mean.

Bilateral solvers

More recently, Barron et al[29] suggested to view a color image as a function of the 5-D space [Y,U,V,X,Y], which they call the ‘bilateral space’ to estimate stereo for rendering defocus blur. They proposed to transform the stereo optimization problem by expressing the problem variables in the bilateral space and optimize in this new space. We will refer to this method as BL-Stereo. Barron and Poole’s[30] Bilateral Solver (BL-Solver), on the other hand, solves a linear optimization problem in the bilateral space, which is different from BL-Stereo. In this setting, they require a target map to enhance, as well as a confidence map for the target map. The linearization of the problem allows them to converge to the solution faster. (See Sec 3.1 for more details.) Both of these approaches quantize the 5-D space into a grid, where the grid size is governed by the blur kernel sizes. This reduces the number of optimization variables and hence the complexity, leading to fast runtimes.

Our work is closely related to Barron et al[29] and Barron and Poole [30] in that we are targeting the same goal of developing general solvers that are edge-aware and fast. The gridding strategy of these previous methods scales well with higher blur amounts/windows. However, using higher blur windows is not a scalable option as image resolution increases, especially in large-resolution imagery where it is important to maintain fine details, such as multi-camera capture for virtual reality and satellite imagery. In contrast, our method does not require large blur kernels to be efficient. Our method operates on the pixels themselves, and hence inherently has a large number of optimization variables. Despite this, our approach is inherently parallelizable, making it easy to implement it on GPUs. Our method does not depend on the blur kernel size, and hence scales well with higher image resolutions with large and small blur kernels.

In the following sections, we present our general optimization framework and demonstrate its performance on a variety of problems. We present quantitative evaluations to show the competitive accuracy of our method, as well as the significant speed-up that it provides.

3 Approach

Edge-aware filtering techniques like the bilateral filter smooth similar looking regions of the image while preserving crisp edges. This is especially useful for smoothing depthmaps, where we want to preserve sharp discontinuities in depth by not filtering across depth edges while smoothing out planar regions. Although edge-aware filtering has been used for stereo as a post processing step [12], as well as during optimization [4], this approach increases the computational complexity of the algorithm substantially due to the data-dependent smoothing kernel. We consider an algorithm a filtering technique when the filter operates on the input image to produce an output image. On the other hand, we consider an algorithm a solver

, when it uses one or more input images and optimizes towards a goal defined by a cost/loss function.

3.1 Optimization framework

First, we introduce our efficient domain transform solver (DTS), which leverages an efficient way of expressing distances using isometry. The DTS solves the following optimization problem:


Here, the are the values we want to estimate, e.g., disparity and color, at the pixel of an image. The initial target estimate with a confidence are also given for the pixel. This optimization objective has an edge-aware regularizer , which forces the to be similar to the mean of the neighborhood , computed in an edge-aware sense. Hence, the neighborhood’s size changes for each according to the image content. Intuitively, by forcing to be similar to the edge-aware mean , we emulate the bilateral filter’s properties so that is similar to the other which contribute to the mean only when and are similar in color and close in pixel distance. This edge-aware mean is , where takes into account the pixel color similarity as well as pixel distance between pixel and ; see Sec.3.2 for a derivation of . We compute using the domain transform, which enables us to evaluate our pair-wise regularizer faster than traditional approaches; see Sec. 3.2 for more details. is an application-dependent term with a weighting factor of . For example, for stereo, could be the photometric matching cost for the left-right image pair.

In all applications, our method aims to solve Eq.(1). The minimum at the point of the solution necessarily has a zero derivative. Hence, we next seek to characterize this minimum in order to leverage it later in our proposed approach. For simplicity, we first only investigate a simplified version of Eq.(1) that does not contain the problem-specific term . This simplified version can be written as follows:


Taking the gradient of Eq.(2) with respect to and setting it to zero provides


Hence, at the minima of Eq.(2) we have


Now, we highlight the relation to the optimization function of the BL-Solver. Its optimization objective is


Inspecting the derivative at the minimum as we did for Eq.( 2) requires us to compute the gradient of Eq.(5) with respect to and setting it to zero:


The extra factor of 2 with is due to the fact that we have to consider the terms when the roles of and are exchanged. The solutions in Eqn.(4) and Eqn.(6) look very similar. The major difference is that in Eq.(6), the contribution of confidence scores is weighted by and hence it is edge-aware. We also weigh the confidence during gradient descent updates by to mimic its effect in Eq.(6), which provides less weight to target when we have a large support via the similarities expressed by . We compute the confidence scores

by estimating the variance of the

in an edge-aware sense using the domain transform as suggested in [30]:


In this formulation, the domain transform is treated as a local estimate of the mean in an edge-aware sense, while scales the variance to get confidence scores.

In summary, Eq.(2) and Eq.(5) have the same optimal solution, but the solution of Eq.(2) can be computed significantly faster by leveraging parallel computations. The reason is that we replace the pair-wise term in Eq.(5) by the local edge-aware mean, which we can compute in an efficient manner (see Sec. 3.2 for details), and we weigh the contribution of the target s by adapting the input confidence according to the local support .

3.2 Domain transform

Gastal and Oliveira [1] define an isometric transformation, which they call the domain transform (DT) for a 1-D multi-valued function by treating as a curve in . The domain transform is such that it preserves distances between two points on the curve under a given norm. Unlike Gastal and Oliveira [1], we use the norm to define the distances, and hence we derive the domain transform here, which satisfies the constraint for the nearest neighbors and . This derivation follows closely Sec. 4 of Gastal and Oliveira [1]. Using a shorthand notation and assuming a small shift in , we can express the distance in pixels and color equal to the distance of the transform as follows:


Taking the square root and constraining to be monotonic to avoid negative roots, followed by integrating both sides, we obtain


Using this definition of the domain transform of the 4-D space with the curve C defined by RGB color and X denoting the domain, we can express the edge-aware mean as follows:


where , and represents Dirac’s delta function. To see the relation with the simple domain transform blurring [1], it can be seen that setting the confidence scores to zero in Eq.(1) will lead to the same solution as the domain transform filtering. Similarly, setting to zero, and to Gaussian weights in color and space will lead to bilateral filtering. Note that the above derivation is isometric since the function I is multi-valued but with a 1-dimensional domain . By extending the domain to 2-D, the exact isometry is not valid, and Gastal and Oliveira [1] use alternating passes by separately considering the image as a function of X and then Y.

Now, all the terms except the application-specific terms in Eq.(1) are defined. In the following, we present how we apply our optimization framework to a variety of application scenarios where we adapt Eq.(1) by changing function .

3.3 Stereo optimization

Stereo estimation is a well-studied problem [31, 32] in which the task is to estimate a matching correspondence of pixels in the left image to the pixels in the right image. This matching correspondence defines the disparity of the pixels and in turn the depth, and when done for each pixel provides us with a disparity map. Typically, dense search is done along the row of a rectified pair by matching the pixel color similarity known as photometric matching cost. In the following, we refine a disparity map. We obtain the disparity map from MC-CNN [12], which acts as our target (Fig. 2(c)), for which we calculate a confidence score (shown in Fig. 2(d)) using Eq.(7). We use the left color image to define and compute the edge-aware mean and optimize the disparities to obtain a disparity map that is smooth at homogeneous regions but has crisp edges (Fig. 2(e)). Similar to our proposed solver, Barron and Poole [30] show that the BL-Solver works well for a wide variety of optimization problems including stereo. When they apply the BL-Solver to the stereo problem, they achieve faster convergence compared to BL-Stereo because they neglect the physical implication of changing the disparity. In other words, if an optimizer changes the estimate of disparity at a point in left image, this gets reflected in a change in the color of its matching pixel in the right image. Here, we present a method for solving for the disparity in an edge-aware sense while having a photometric penalty for the left-right matching. Our loss for stereo optimization is as follows:


where is the left image and is the right image of the stereo pair. For robust optimization, we use a Charbonnier loss with on the target term , which has been shown to be effective for optical flow [33]. We use Zbontar and LeCun’s MC-CNN [12] as the target for our stereo optimization.

(a) Color image
(b) GT disparity
(c) Target image
(d) Confidence
(e) Our result
(f) Color image
(g) GT disparity
(h) Target image
(i) Our result
Figure 2: Stereo Optimization: The top row shows our result in (e) which is computed using the color image (a) used to define color distance in the domain transform, and target (c) disparity obtained from MC-CNN [12]. The confidence map (d) is used to weigh the target disparity in the optimization (Eq.(1)). (f-i) show a zoomed area of (a-c,e). Notice in the zoomed regions that our results are aligned to the edges of the color image.

Next, we will detail the application of our method to the problem of rendering defocus from depth, which is another application heavily relying on accurate depth edges.

3.4 Synthetic defocus from depth

Interest in creating synthetic defocus from depth is growing, with phones like the Google Pixel 2 and the OnePlus 5T providing a portrait mode where the shallow depth of field effect is mimicked through the estimation of depth. BL-Stereo’s synthetic defocus method is used as part of the Lens Blur feature on Google’s phones [29]. We use our stereo optimization from Sec. 3.3 to estimate depth maps, which retain sharp discontinuities at color edges. Figs. 55, and 5 show the original color image and the defocus rendering produced by using our estimated depthmaps and the ground-truth depthmaps for scenes in the Middlebury dataset [11]. As our stereo optimization is edge-aware, the defocus rendering maintains high quality even at the edges. Notice that in the insets of Fig. 5., MC-CNN has jarring artifacts, especially at the edges, while the rendering using our estimated depthmap is more smooth. In the Jadeplant scene shown in Fig. 5, the background is in focus, and for the same scene Fig. 5 the blue block in the front is kept in focus. In the Playroom scene illustrated in Fig. 5, the front chairs are chosen to be in focus. To render the synthetic defocus, we used the algorithm described in Sec. 6 of the supplementary material of Barron et al[29]. This shows that our results are qualitatively better than MC-CNN, and the most noticeable improvements are because we optimize in an edge-aware sense.

Figure 4: Render from defocus for the Middlebury Jadeplant scene. (a) Results obtained using the MC-CNN depthmap where the front blue block is in focus. (b) Our result. (c) Result computed using the ground-truth disparity. The inset shows details and highlights improvements around the edges.
(a) Original Image
(b) Ours
(c) GT disparity


(a) MC-CNN


(b) Ours


(c) GT disparity
(a) Original Image
(b) Ours
(c) GT disparity
Figure 3: Render from defocus for the Middlebury Jadeplant scene. (a) Original color image. (b) Our result where the background is in focus. (c) Result computed using the ground-truth disparity.
Figure 4: Render from defocus for the Middlebury Jadeplant scene. (a) Results obtained using the MC-CNN depthmap where the front blue block is in focus. (b) Our result. (c) Result computed using the ground-truth disparity. The inset shows details and highlights improvements around the edges.
Figure 5: Render from defocus for the Middlebury Playroom scene. (a) Original color image. (b) Our result where the chairs in the front are in focus. (c) Result computed using the ground-truth disparity.
Figure 3: Render from defocus for the Middlebury Jadeplant scene. (a) Original color image. (b) Our result where the background is in focus. (c) Result computed using the ground-truth disparity.

3.5 Depth super-resolution

The availability of cheap commodity depth sensors like the Microsoft Kinect, Asus Xtion, and Intel RealSense has spurred many avenues of research, including depth super-resolution. Depth super-resolution is important for sensors like these because, often, the color camera is of high resolution, but the depth camera/projector has low-resolution, which leads to crude depth maps [34]. Ferstl et al[35]

adapted the Middlebury dataset for the depth super-resolution task to create a benchmark, on which we evaluate our method, here. For this task, we use simple bicubic interpolation for upsampling the low-resolution depth map and use this map as a target in our optimization; we use the high-resolution color image to compute the domain transform based edge-aware mean and obtain our optimized result (Fig. 

6(d)). We follow Barron and Poole [30] by setting the confidence scores using a Gaussian bump model to represent the contribution of each pixel to the nearby upsampled pixels. We do not use additional penalties in Eq.(1) for this task in the form of .


(a) Color image.


(b) GT disparity


(c) Our result


(d) Barron and Poole [30].
Figure 6: Depth super-resolution: (a) shows original color image, (b) ground truth disparity, (c) our optimized disparity and (d) results using BL-Solver obtained from the author’s website [36]. The inset highlights the details and the amount of smoothness we obtain in homogeneous regions while being edge-aware.

4 Experiments

We now present quantitative evaluation of our framework as well as timing performance.

Stereo Optimization

For the quantitative evaluation of our method, we use the Middlebury dataset [11]. Barron and Poole [30] used MC-CNN [12] as their initialization, and for a fair comparison we also use it as our target disparity map. Table 1 shows our results for the training set, where we present the mean absolute error (MAE), root mean square error (RMSE), time per megapixel, and time normalized by number of disparity hypotheses for non-occluded regions and for all pixels. All of these values were determined by the Middlebury evaluation website, and all of our times include the time to calculate MC-CNN on the target disparity maps. The timing for BL-Solver and our method shows the additional time spent in processing MC-CNN, and the total value in the paranthesis. Note that we obtain a huge performance boost compared to MC-CNN at a marginal overhead in time, and we have similar performance with Barron and Poole [30], especially in non-occluded pixels, while running in only a fraction of their time. The results for the test set show that our method achieves comparable results in non-occluded regions, while having significant computational savings. We used , with RGB colors normalized to a range of [0,1], , and . These parameters were found to work best via grid search strategy on the Middlebury training data. We ran a gradient descent algorithm for 3000 iterations in this experiment with a step size of 0.99 times the gradient. Fig. 8 and Fig. 8 show zoomed regions from the Jadeplant and Pipes scene to highlight that we improve the target disparity maps from MC-CNN [12] to estimate sharp depth edges.

Algorithm MAE(px) RMSE(px) time/MP(s) time/GD(s)
no occ all no occ all
MC-CNN [12] 3.81 11.8 18.0 36.6 83.3 259
MC-CNN + BL-Solver [30] 2.60 6.66 10.2 20.9 42.7 (126) 153 (412)
MC-CNN+DTS (ours) 3.02 9.12 10.8 27.4 5.9 (89.2) 19 (278)
MC-CNN [12] 3.82 17.9 21.3 55.0 112 254
MC-CNN + BL-Solver [30] 2.67 8.19 15.0 29.9 28 (140) 91 (345)
MC-CNN+DTS (ours) 3.78 14.6 17.6 43.4 10 (122) 23 (277)
Table 1: Performance comparison on training images from the Middlebury dataset. The timing for BL-Solver and our method shows the additional time spent in processing MC-CNN, and the total value in the paranthesis. Our method takes a fraction of time as compared to Barron and Poole [30] to obtain a significant reduction in error versus MC-CNN.
(a) Color image
(b) GT disparity
(c) Target image
(d) Our result
(a) Color image
(b) GT disparity
(c) Target image
(d) Our result
Figure 7: Example stereo optimization results for a closeup of the Middlebury Jadeplant scene. (a) Color image. (b) Ground-truth disparity. (c) Target obtained using MC-CNN [12]. (d) Our result.
Figure 8: Example stereo optimization for a closeup for of the Middlebury Pipes scene. (a) Color image. (b) Ground-truth disparity. (c) Target obtained using MC-CNN [12]. (d) Our result.
Figure 7: Example stereo optimization results for a closeup of the Middlebury Jadeplant scene. (a) Color image. (b) Ground-truth disparity. (c) Target obtained using MC-CNN [12]. (d) Our result.

Depth super-resolution

We use the dataset introduced by Ferstl et al[35]to evaluate our method for depth super-resolution. This dataset consists of three scenes (Art, Books, and Moebius) with added noise at 2, 4, 8, and 16x levels of upsampling. We used where is the amount of upsampling. We used with RGB colors normalized to a range of [0,1], , and 10 iterations of the gradient descent with a step size of 0.99. In Table 2, we present the RMS and mean geometric errors for each scene. Data for the bicubic and BL-Solver were produced by using the data and code provided by Barron et al[36]. We also used the same code to evaluate our method. Our method and the BL-Solver used the bicubic upsampling as the target image. Our method is 10x times faster than Barron et al[30] while achieving comparable performance on most images, especially images which have higher upsampling factors. Our time is the average over all images and includes 0.007 seconds required for bicubic upsampling.

Art Books Moebius Avg. Time
2x 4x 8x 16x 2x 4x 8x 16x 2x 4x 8x 16x (px) (s)
Bicubic 5.32 6.07 7.27 9.59 5.00 5.15 5.45 5.97 5.34 5.51 5.68 6.11 5.94 0.007
BL-Solver [30] 3.02 3.91 5.14 7.47 1.41 1.86 2.42 3.34 1.39 1.82 2.40 3.26 2.75 0.234
DTS (Ours) 4.58 5.11 5.81 7.69 1.94 2.34 2.85 3.74 1.97 2.34 2.89 3.89 3.43 0.0215
Table 2: Performance of DTS on depth super resolution task. Our method is 10x faster than BL-Solver while having comparable performance in most images, especially images with higher upsampling factors.


Now we present how our method scales with increasing image resolution and increasing blur kernel sizes. Our method scales linearly with the number of pixels in the image. Fig. 9(a) shows the dependence of time in seconds on the number of pixels in the image. We use the training images from the Middlebury dataset and only show the time consumed by DTS for the stereo task at 3000 iterations.

(a) Time vs. megapixels
(b) Time vs. 
(c) Time vs. 
Figure 9: Our method takes almost the same amount of time even when there is a large difference in blur kernel size, and it linearly scales with image resolution. (a) Time taken for different images in the Middlebury training dataset. (b) Average time taken by DTS for stereo optimization at different spatial blur values. (c) Average time taken by DTS for stereo optimization at different color blur values. All times are at 3000 iterations.

Timing vs blur kernels

The time taken by our method remains mostly constant in comparison to the vastly different blur kernel sizes. Fig. 9(b) shows average time taken when we change , while keeping constant. Fig. 9(c) shows average time taken at different values of with constant. All the times are at 3000 iterations. In fact, there is a weak dependence of time on the blur kernel sizes. The range of kernel sizes is large, hence the small time changes are negligible in practice.


The number of iterations in the gradient descent scheme affects the accuracy. Here, we study this effect for the training images of the Middlebury dataset. Fig. 10 (a) and (b) show the MAE and RMSE, which we calculated for the training set for all the pixels including occlusions in all the training images in the dataset. The 3000 iterations result is the same as the shown in Table 1, but the numbers are different because the Middlebury evaluation internally uses a weighting scheme, which is not used here. Both MAE and RMSE reduce very quickly at approximately 300 iterations, and after that the gains are smaller. The time taken for iterations is linear – see Fig. 10(c). This allows us to easily trade off resolution, quality, and run time depending on the application.

(a) MAE pixel error
(b) RMSE pixel error
(c) Avg. time(s) vs iterations
Figure 10: Dependence of MAE, RMSE and time on the number of iterations. (a-b) The errors reduce quickly for the first 300 iterations and then reduces gradually for further iterations. The average time taken increases linearly, as shown in (c). This provides a trade-off which can be chosen according to the application.

5 Conclusion

We have presented a novel edge-aware solver that achieves scalable performance across a variety of applicable tasks. Our method is faster by an order of magnitude compared to the state of the art while performing at comparable accuracy. The approach is highly parallelizable and scales well w.r.t image resolution, and unlike existing methods, it is independent of blurring kernel size. A future step is to extend our approach to multi-GPU setting, as well as use advanced optimization methods like conjugate gradient descent to obtain faster convergence.


  • [1] Gastal, E.S., Oliveira, M.M.: Domain transform for edge-aware image and video processing. In: ACM Transactions on Graphics (ToG). Volume 30., ACM (2011)  69
  • [2] Chen, L.C., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L.: Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4545–4554

  • [3] Krähenbühl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: Advances in neural information processing systems. (2011) 109–117
  • [4] Bleyer, M., Rhemann, C., Rother, C.: Patchmatch stereo-stereo matching with slanted support windows. In: Bmvc. Volume 11. (2011) 1–11
  • [5] Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: ACM Transactions on Graphics (ToG). Volume 23., ACM (2004) 689–694
  • [6] Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpolation of correspondences for optical flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1164–1172
  • [7] Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Computer Vision, 1998. Sixth International Conference on, IEEE (1998) 839–846
  • [8] Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on pattern analysis and machine intelligence 12(7) (1990) 629–639
  • [9] Fattal, R.: Edge-avoiding wavelets and their applications. ACM Transactions on Graphics (TOG) 28(3) (2009)  22
  • [10] Lu, J., Yang, H., Min, D., Do, M.N.: Patch match filter: Efficient edge-aware filtering meets randomized search for fast correspondence field estimation. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE (2013) 1854–1861
  • [11] Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nešić, N., Wang, X., Westling, P.: High-resolution stereo datasets with subpixel-accurate ground truth. In: German Conference on Pattern Recognition, Springer (2014) 31–42
  • [12] Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research 17(1-32) (2016)  2
  • [13] Weiss, B.: Fast median and bilateral filtering. In: Acm Transactions on Graphics (TOG). Volume 25., ACM (2006) 519–526
  • [14] Adams, A., Baek, J., Davis, M.A.: Fast high-dimensional filtering using the permutohedral lattice. In: Computer Graphics Forum. Volume 29., Wiley Online Library (2010) 753–762
  • [15] Chen, J., Paris, S., Durand, F.: Real-time edge-aware image processing with the bilateral grid. In: ACM Transactions on Graphics (TOG). Volume 26., ACM (2007) 103
  • [16] Yang, Q., Ahuja, N., Tan, K.H.: Constant time median and bilateral filtering. International Journal of Computer Vision 112(3) (2015) 307–318
  • [17] Elad, M.: On the origin of the bilateral filter and ways to improve it. IEEE Transactions on image processing 11(10) (2002) 1141–1151
  • [18] Durand, F., Dorsey, J.: Fast bilateral filtering for the display of high-dynamic-range images. In: ACM transactions on graphics (TOG). Volume 21., ACM (2002) 257–266
  • [19] Pham, T.Q., Van Vliet, L.J.: Separable bilateral filtering for fast video preprocessing. In: Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, IEEE (2005) 4–pp
  • [20] Paris, S., Durand, F.: A fast approximation of the bilateral filter using a signal processing approach. In: European conference on computer vision, Springer (2006) 568–580
  • [21] Zhang, M., Gunturk, B.K.: Multiresolution bilateral filtering for image denoising. IEEE Transactions on image processing 17(12) (2008) 2324–2333
  • [22] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34(11) (2012) 2274–2282
  • [23] Bódis-Szomorú, A., Riemenschneider, H., Van Gool, L.: Superpixel meshes for fast edge-preserving surface reconstruction. In: Proceedings CVPR 2015. (2015) 2011–2020
  • [24] Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics-TOG 28(3) (2009)  24
  • [25] Yang, Q., Wang, S., Ahuja, N.: Svm for edge-preserving filtering. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE (2010) 1775–1782
  • [26] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
  • [27] Xu, L., Ren, J., Yan, Q., Liao, R., Jia, J.: Deep edge-aware filters. In: International Conference on Machine Learning. (2015) 1669–1678
  • [28] Farbman, Z., Fattal, R., Lischinski, D., Szeliski, R.: Edge-preserving decompositions for multi-scale tone and detail manipulation. In: ACM Transactions on Graphics (TOG). Volume 27., ACM (2008)  67
  • [29] Barron, J.T., Adams, A., Shih, Y., Hernández, C.: Fast bilateral-space stereo for synthetic defocus. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 4466–4474
  • [30] Barron, J.T., Poole, B.: The fast bilateral solver. In: European Conference on Computer Vision, Springer (2016) 617–632
  • [31] Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47(1-3) (2002) 7–42
  • [32] Hirschmuller, H., Scharstein, D.: Evaluation of cost functions for stereo matching. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE (2007) 1–8
  • [33] Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE (2010) 2432–2439
  • [34] Khoshelham, K., Elberink, S.O.: Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 12(2) (2012) 1437–1454
  • [35] Ferstl, D., Reinbacher, C., Ranftl, R., Rüther, M., Bischof, H.: Image guided depth upsampling using anisotropic total generalized variation. In: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE (2013) 993–1000
  • [36] Jon, B.: (2008) Online; accessed 8-March-2018.