1. Introduction
The goal of image smoothing is to eliminate unimportant finescale details while maintaining primary image structures. This technique has a wide range of applications in computer vision and graphics, such as tone mapping, detail enhancement, and image abstraction.
Image smoothing has been extensively studied in the past. The early literature was dominated by filteringbased approaches [Perona and Malik, 1990; Tomasi, 1998; Weiss, 2006; Paris and Durand, 2006; Fattal, 2009; Gastal and Oliveira, 2011; Chen et al., 2007] due to their simplicity and efficiency. In recent years, smoothing algorithms based on global optimization have gained much popularity due to their superior smoothing results [Farbman et al., 2008; Min et al., 2014; Xu et al., 2011; Bi et al., 2015; Xu et al., 2012; Liu et al., 2017]. Despite the great improvements, however, their smoothing results are still not perfect, and no existing algorithm can serve as an image smoothing panacea for various applications. Moreover, these approaches are often very timeconsuming. With the increasing power of modern GPUs and the enormous growth of deep convolutional neural networks (CNNs), there is an emerging interest in employing CNNs as surrogate smoothers in lieu of the costly optimizationbased approaches [Xu et al., 2015; Liu et al., 2016; Fan et al., 2017; Chen et al., 2017a]
. These methods train CNNs in a fully supervised manner where the target outputs are generated by existing smoothing methods. While substantial speedup can be achieved, they still produce (approximations of) extant smoothing effects.
In this work, we seek to generate flexible and superior smoothing effects by directly learning from data. We leverage a CNN to do so, such that our method not only features the learned smoothing effects that are more appealing, but also enjoys a fast speed. However, the desired smoothing results (groundtruth labels) for supervising the training are difficult to obtain. Dense manually labeling for a large volume of training images is costly and cumbersome. To circumvent this issue, we design the training signal as an energy function, similar to the optimizationbased methods, and train our method in an unsupervised, labelfree setting.
We carefully designed our energy function to achieve quality smoothing effects in a unified unsupervisedlearning framework. First, to explicitly fortify important image structures that may be weakened by the flattening operator, we include a criterion that minimizes the masked quadratic difference between two guidance maps computed from the input image and the smoothed estimate respectively. The guidance maps are formulated as edge responses, and the masks are computed using simple edge detection heuristics and can be manually modified further if desired. Second, we identified that many previous methods apply a fixed
norm flattening criterion across the entire image, which may be detrimental to the smoothing quality. We instead introduce a spatiallyadaptive flattening criterion whereby the specific value of is varied across images in accordance with the guidance maps. Given that tends to smooth out edges while largely preserves them, the guidance maps allow different image regions to receive different regularizations most appropriate for handling local structural conditions. Importantly, we can apply applicationspecific guidance maps which allow for the seamless implementation of multiple different flattening effects.We test our method on edgepreserving smoothing and various applications including image abstraction, pencil sketching, detail magnification, texture removal and contentware image manipulation to show its effectiveness and versatility. Broadly speaking, the contribution of this paper can be distilled as follows:

We introduce an unsupervised learning framework for image smoothing. Unlike previous methods, we do not need groundtruth labels for training and we can jointly learn from any sufficiently diverse corpus of images to achieve the desired performance.

We are able to implement multiple different image smoothing solutions in a single framework, and obtain results comparable with or better than previous methods. A novel image smoothing objective function is proposed to achieve this which is built upon a spatiallyadaptive flattening criterion and an edgepreserving regularizer.

Our new method is based on a convolutional neural network and its computational footprint is far below most previous methods. For example, processing a 1280720 image takes only 5ms on a modern GPU.


Input 
Ours  SGF  SDF  BTLF  
FGS 
RGF  RTV  WLS  BLF 
2. Related Work
As a fundamental tool for many computer vision and graphics applications, image smoothing has been extensively studied in the past decades. The filtering based approaches, such as anisotropic diffusion [Perona and Malik, 1990], bilateral filter [Tomasi, 1998] and many others [Weiss, 2006; Paris and Durand, 2006; Fattal, 2009; Kass and Solomon, 2010; Gastal and Oliveira, 2011; Chen et al., 2007] have been the dominating image smoothing solutions for a long time. The core idea for such methods lies in filtering each pixel with its local spatial neighbourhood, and these methods are usually very efficient.
Recently, algorithms using mathematical optimization for image smoothing tasks gain more popularity due to their robustness, flexibility and more importantly the superiority of the smoothing results. For example,
[Farbman et al., 2008] proposed an edgepreserving operator in a weighted least square (WLS) optimization framework, which prevents the local image regions from being oversharpened with an norm. Similar schemes have been achieved more efficiently by [Min et al., 2014; Liu et al., 2017]. These works are devoted to extracting and manipulating the image details using image smoothing for various applications such as detail enhancement, HDR tone mapping, etc.On the other hand, [Xu et al., 2011] proposed a sparse gradient counting scheme in an optimization framework by minimizing the norm. The method of [Bi et al., 2015] aimed at producing almost ideally flattening images where sharp edges are also well preserved with norm. These methods are particularly wellsuited for preserving or enhancing the sharp edges, and remove the lowamplitude details. They can be useful for some stylization effects or intrinsic image decompositions.
In the aforementioned applications, the image smoothing algorithms typically exploit only gradient magnitude as the main cue to discriminate primary image structures from details. [Xu et al., 2012] presented the specifically designed relative total variation measures to extract meaningful structure, and [Ham et al., 2015] fuses appropriate structures of static and dynamic guidance images into a nonconvex regularizer. Their goal is to remove finescale repetitive textures where local gradient can still be significant, which can not be easily achieved by the aforementioned smoothing approaches.
This regularization idea can also be interpreted as an image prior formulated in a deep network [Ulyanov et al., 2018] or image denoising engine [Romano et al., 2017]. Discussions about the norm regularization can also be found in [Prasath et al., 2015; Chung and Vese, 2009; Bach et al., 2012]. Interestingly, [Mrázek et al., 2006] observe that even the image filters can all be derived from minimization of a single energy functional with data and smoothness term.
Most of the optimization based approaches are timeconsuming, as they typically require solving largescale linear systems (or others). Therefore, recently some methods such as [Xu et al., 2015; Liu et al., 2016; Fan et al., 2017; Chen et al., 2017a; Gharbi et al., 2017] were proposed to speed up existing smoothing operators. These methods train a deep neural network using the groundtruth smoothed images generated by existing smoothing algorithms. In contrast, our neural network is trained by optimizing an objective function through deep neural network in an unsupervised fashion. Note these previous deep models and ours are fundamentally different in many fields: target goal, training data, essential algorithm logic, etc. Since they aimed at approximating traditional image smoothing algorithms, while ours creates some novel and unique smoothing effects, it makes our results not directly comparable to theirs by the quality of smooth images.
Deep learning has been applied to many image manipulation tasks [Chen et al., 2017b; Fan et al., 2018; He et al., 2018; Chen et al., 2017, 2018]. But most previous work treat deep learning as a regression or classification tool. In this paper, we apply deep neural network as an optimization solution in a labelfree setup.
3. Approach
In this section, we introduce our proposed formulation, including an edgepreserving criterion in Section 3.1 and a spatially adaptive flattening criterion in Section 3.2 which account for structure preservation and detail elimination respectively. Later on we describe how deep learning is leveraged for optimizing the proposed objective in Section 3.3.
3.1. Objective Function Definition
Image smoothing aims at diminishing unimportant image details while maintaining primary image structures. To achieve this using energy minimization, our overall energy function for image smoothing is formulated as
(1) 
where is the data term, is the regularization term and is the edgepreserving term. and are constant balancing weights.
The data term minimizes the difference between the input image and the smoothed image to ensure structure similarity. Denoting the input image by and the output image by , both in RGB color space, a simple data term can be defined as
(2) 
where denotes pixel index and is the total pixel number.
Some important edges may be missed or weakened during the smoothing process since the goal of color flattening naturally conflicts with edge preserving to some extent. To address this issue, we propose an explicit edgepreserving criterion which preserves important edge pixels.
Before presenting this criterion, we first introduce the concept of guidance image, which is formulated as the edge response of an image in appearance. A simple form of edge response is the local gradient magnitude:
(3) 
where denotes the neighborhoods of point and denotes the color channel of the input image . A similar guidance edge map of the output smooth image can also be calculated as .
Our edgepreserving criterion is defined by minimizing the quadratic difference of their edge responses between the guidance edge images and . Let be an binary map where indicates an important edge point and otherwise, our edgepreserving term is defined as
(4) 
where is the total number of important edge points.
The definition of “important edges” is more subjective and varies across different applications. The ideal way to obtain binary maps would be manual labeling with user preference. However, pixellevel manual labeling is rather laborintensive. In this paper, we leverage a heuristic yet effective method to detect edges. Since this process is not our main contribution, we defer the detailed description of this edge detector to the supplemental material. A few examples of the detected major image structure are shown in Section 7.1. Also note that any previous advanced and sophisticated edge detection algorithm can all be certainly incorporated based on user preference.
Given sufficient training images with classified edge points, the deep network will implicitly learn the edge importance through minimizing the edgepreserving term and reflect such information in the smooth images. Figure
3 demonstrates an extremely difficult case of a parachute. As can be seen, without the edgepreserving criterion, some thin yet semantically important structures like the white rope are smoothed out in the output images. In contrast, the result optimized with our full criterions maintains these structures very well.



Input  w/o norm  Ours 


Input  w/o norm  Ours 


Input  w/o EP criterion  Ours 

3.2. Dynamic SpatiallyVariant Flattening Criterion
We now present our new smoothness/flattening term with spatiallyvariant norms on the image in order to gain better quality and more flexibility.
To remove unwanted image details, the smoothing or flattening term advocates smoothness for the solution by penalizing the color differences between adjacent pixels:
(5) 
where denotes the adjacent pixels of in its window, denotes the weight for the pixel pairs and is an norm^{1}^{1}1With slight abuse of terminology, we use norm to refer to the norm raised to the th power, i.e., it will indicate as opposed to ..
The weight can be calculated from either color affinity or spatial affinity (or their combination), which are defined respectively as
(6) 
(7) 
where and
are the standard deviations for the Gaussian kernels computed in either color space or spatial space,
denotes image channel (in this paper we use the YUV color space to compute weights in Equation 6), and denote pixel coordinates.Determining the image regions for different regularizers is not trivial. To help locate these regions in our algorithm, we leverage the guidance images to define the value of and its corresponding weight for each image pixel as
(8) 
where and are two representative values for , and and denote two positive thresholds. We set and throughout this paper. It can be seen that the value distribution is not determined a priori with the input image, but is conditioned on the output image. We explain such a strategy in the following two points.
Suppressing artifacts caused by single regularizer
The intuition behind Equation 8 is that when we minimize the energy function, norm is applied until some oversharpened spurious structures appear in the output image due to the piecewise constant effect caused by regularizer, at which time norm will be applied to suppress the artifact. These spurious structures are identified as the ones whose edge response of pixel on the original image is low (as characterized by ) but is significantly heightened on the output image (per ). In Figure 3, we demonstrate a few smooth results optimized with our objective function. As can be seen, without norm, it achieves strong smoothing effects but also yields staircasing artifacts on the lady’s cheek and shoulder. On the other hand, without norm, the optimized image is very blurry and many important structures are not well preserved due to the regularizer. On the contrary, the results optimized with our proposed full criterions are much more visually pleasing.
Enabling different applications via specialized guidance images
Our guidance image and spatiallyvariant flattening norm also enable us to achieve flexible smoothing effects for different applications. For example, if the goal is to remove a certain type of image structures like smallscale textures, we can simply eliminate all the edge points belonging to these textures in the guidance image by setting their values to zero. This way, the edge responses on these regions of the output image will always be larger, and norm will be applied to remove these textures. Later we show two such applications – texture removal (Section 7.3) and contentaware image manipulation (Section 7.4). All other results shown in this paper are obtained by raw guidance images computed via Equation 3.
Note that we adopt spatial affinity to calculate the weights for regions with an norm, as it is more effective for edge suppression. Color affinity is utilized for norm regions for better flattening effect. Since and norm regularize the images differently, we amplifies the weight of norm with a scale scalar for balance. We empirically determine these two values, which represent the regularization for strong flattening and blurring effects in a general sense. They are replaceable with other alternatives.
It can be seen that our spatially variant norm is not fixed, but dynamically changing in the iterative optimization (training) procedure based on the output image. Although we do not provide a theoretical proof of convergence, we have found empirically that such a procedure converges and the value distribution stabilizes in the end. Note we observe that [Zhu et al., 2014] also employs dataguided sparsity in their work, but differently their regularization is static while ours is dynamically changed from the output image.
3.3. Deep Learning based Optimization
As the whole objective function is derivative to the optimized smooth image, we implement it as the loss layer in a deep learning framework. The loss function is optimized with gradient descent method through a deep neural network. The whole training process is in an unsupervised learning fashion with a large number of unlabeled natural images. The deep network implicitly learns the optimization procedure and once the network is trained, it only requires one forward pass through the deep neural network to predict the smooth image without further optimization steps.
Now we introduce architecture of our deep neural network which is used for minimizing the defined energy function. Inspired by the previous work [Yu and Koltun, 2016] which enlarges the receptive field with dilated convolutions for semantic segmentation, and [Kim et al., 2016]
which uses very deep convolutional neural network equipped with residual learning for superresolution, we design a fully convolutional network (FCN) equipped with dilated convolution and skip connections for our task.
Basic structure description
Figure 4 is a schematic description of our FCN. The network contains 26 convolution layers, all of which use 3
3 convolution kernels and outputs 64 feature maps (except for the last one which produces a 3channel image). All the convolution operations are followed by batch normalization
[Ioffe and Szegedy, 2015]and ReLU activation except for the last layer. The third conv layer downsamples the feature maps by half via using a stride of 2, and the third from last layer is a deconvolution (aka fractionallystrided convolution) layer recovering the original image size. The middle 20 conv layers are organized as 10 residual blocks
[He et al., 2016]. A full description of the detailed network structure is presented in the supplemental material.Large receptive field in dilated convolutions
As image smoothing requires contextual information across wide regions, we increase the receptive field of our FCN by using dilated convolution with exponentially increasing dilation factors except for the last residual block, similar to [Yu and Koltun, 2016]. Specifically, any two consecutive residual blocks share one dilation factor, which is doubled in the next two residual blocks. This is an effective and efficient strategy to increase receptive field without sacrificing image resolution: with exponentially increasing dilation factors, all the points in an grid can be reached from any location in logarithmic steps . Similar strategies have been used in parallel GPU implementation of some traditional algorithms, such as Voronoi diagram [Rong and Tan, 2006] and PatchMatch [Fan et al., 2015].
Residual image learning
In the image smoothing task, the input and output images are highly correlated. In order to ease the learning, instead of directly predicting a smoothed image we predict a residual image and generate the final result via pointwise summation of the residual image and the raw input image. Such a residual image learning design avoids the color attenuation issue observed in previous works (e.g., [Kim et al., 2016]).
4. Implementation Details
Our FCN network and energy function are implemented in the Torch framework and optimized with minibatch gradient descent. The batch size is set as 1. The network weights are randomly initialized using the method of
[He et al., 2015]. The Adam [Kingma and Ba, 2015]algorithm is used for training with the learning rate set as 0.01. We train the network for 30 epoches, which takes about 8 hours on an NVIDIA Geforce 1080 GPU.
Training and testing data
Since our network does not require groundtruth smooth image for training, any image can be used to train it. For better generalization to natural images, we use the PASCAL VOC dataset [Everingham et al., 2010] which contains about 17,000 images to train the network. These images were collected in the Flicker photosharing website and exhibit a wide range of scenes under a large variety of viewing conditions. We crop the images to the size of 224224 to accelerate the training process without jeopardizing the smoothing quality. Once the network is trained, we run it on images outside of PASCAL VOC and evaluate the results.
Parameter specifics:
The parameters in our proposed objective function are specified by default as follows: 0.1 (), 7 (), 1 (), 0.1 (), 5 (), 20 (), 10 (), 21 (). To achieve the optimal performance on each individual application, a small subset of parameters may be tweaked, which is discussed in Section 7. However, note that these parameters are only tuned based on the application type, not on any particular images. All the images in the same application shown in both the paper and supplemental material are generated by the same set of parameter values.
5. method analysis and discussion





Input 
Adam  IRLS  Ours 

In this section, we first compare the smooth images optimized with our deep learning solver and traditional numerical solvers, followed by the analysis of convergence of different optimizers and the potential reason why the deep learning solver achieves more visually pleasing results than the others for our problem.

half , half  




norm 


norm 

half , half 

adaptive norm 


IRLS  Adam  Ours  Ours (overfit) 

5.1. Visual results of different optimizers
To verify the efficacy of our deep learning solver, as opposed to directly minimizing Equation 1 using a traditional optimization algorithm, we compare it against two popular representative approaches, Adam and IRLS. Adam [Kingma and Ba, 2015]
is a stochastic gradient descentbased optimization method that can be generically applied to nonconvex differentiable functions. Since our proposed objective is differentiable
^{2}^{2}2Although norms are not differentiable on a set of measure zero, in such cases subgradients can easily serve as a natural surrogate, Adam is a very straightforward approach for minimizing it, at least locally. Likewise, iterative reweighted least square (IRLS) [Holland and Welsch, 1977] represents a classical tool for minimizing energy functions with nonquadratic forms. Several image smoothing papers [Min et al., 2014; Xu et al., 2012] also employ IRLS allied with objectives regularized by norms ().To utilize IRLS for optimization, a tight quadratic upper bound of the energy has to be defined, which is trivial to accomplish for terms as has been done in the past. However, the proposed nonquadratic edgepreserving criterion cannot be bounded in this way, making IRLS problematic. Therefore for the results reported in this section, the energy function is formed from only the data term and the spatially adaptive flattening term for fair comparison across all three methods. Smooth images optimized by different solvers are shown in Figure 5. Note that Adam, as a gradientbased method requires 100 iterations to converge, while IRLS only requires about 10 iterations given that it applies secondorder information in optimizing the quadratic upper bound.
In general, both the Adam and IRLS results are less satisfactory than our deep neural network solver. For example, with Adam some spurious staircasing edges are still generated as undesirable visual artifacts. Likewise, for IRLS we observe oversharpened side effects, although in places not quite as severe as with Adam. Also, the magnified local regions shown in the bottom of each figure display areas where the color intensity varies gradually, and both Adam and IRLS fail to smooth these areas well. Also, the magnified local region in each figure shows that both Adam and IRLS cannot well smooth the local regions where color varies gradually.
5.2. Performance analysis of different optimizers on fixed distribution
Now we analyze the performance of these three optimizers by comparing their convergence curves. Note since our proposed objective function is adaptively changing based on the output images (per Equation 5 and 8), it is not intuitive for comparing the convergence trend of these different optimizers. Thus we test three representative loss functions with fixed distribution map in , whose optimization difficulty is gradually increasing. They are the variants of our objective function, where we replace the adaptively changed regularizer with only norm, only norm, or fixed combination of these two. Following the previous ablation study, we disable the edge preserving term for IRLS. The loss values are averaged over 40 test images, and are shown in Figure 6. The corresponding visual results are shown in Figure 7. Note these loss curves do not illustrate the training process of our method. Instead, they are constant values computed on the testing images.
With the fixed norm in , the whole objective function becomes convex and quadratic. Thus IRLS is able to achieve the optimal results with one step. Adam is slightly better than our learned deep network. However, the smoothing results of both methods are relatively far from optimal. Accordingly to Figure 7, IRLS demonstrates the most blurry images regularized by norm. When the objective function contains norm in , it becomes nonconvex. From the middle loss figure, we can see IRLS and Adam achieves similar energy value in the end, and the deep learning solver does not obtain a loss value as low as theirs. Figure 7 shows that the smoothing results of all the methods appear to be more piecewise constant. Finally, we demonstrate a case where the flattening criterion contains norm in left half of the image and norm in right half. Different from our dynamicallychanged norm, the distribution in this case is fixed for all different images and iterations. Figure 6 shows that IRLS achieves the lowest energy value^{3}^{3}3To make the results more presentable, we slightly modified the objective function for IRLS and only show its final loss in this case., followed by Adam and deep learning solver.
To understand why the traditional optimizers achieve lower energy values on the above three objective functions, we first illustrate the workflow of both traditional numerical solvers and our deep learning solver in Figure 8. As can be seen, given a particular image, traditional numerical solvers work by iteratively optimizing this single image. In contrast, the deep learning solver directly predicts their results from a oneforwardpass mapping, which is learned based on a large corpus of training data without any prior information on this specific image. Thus traditional optimizers are more advantageous in achieving lower energy with fixed distribution, as verified by the above three loss curves.
To further analyze the above hypothesis, we overfit our network by training on only one single image, and compare both the final loss and the visual results. This way, the deep learning solver works similarly to traditional solvers. As can be seen, our deep learning solver with overfitting is able to achieve much lower energy, especially in the case of norm where our overfitting results are almost identical to Adam. Likewise, Figure 7 shows that the visual result obtained with the overfitting solver are visually much closer to IRLS and Adam. For example, there is a clear separation line between the two regularized regions for the half half case, which is not present in the results of the original deep learning solver.
Note the loss functions defined in this subsection are used to analyze the performance of different solvers. They are not the actual loss function used for our image smoothing task.
5.3. Performance analysis of different optimizers on our dynamic distribution
In our proposed edge flattening criterion, the distribution is adaptively changing based on the output images. The whole objective function is highly complex and loss curve is not guaranteed to converge. Therefore, comparing the loss curves of the different solvers is less informative. The last row of Figure 7 shows an additional set of results obtained under our adaptively changing norm. It can be observed the staircasing artifacts exist in the results of all IRLS, Adam, and the overfittingbased deep learning solver. In contrast, our deep learning solver generates very smooth results with no such artifacts.
Given each specific distribution map in each iteration, the traditional numerical solvers still tend to “overfit” that unique distribution map for its optimal results, which accordingly results in some spurious edges that separate these different regularizations just like the case of half half norm. In contrast, the disadvantage of the deep learning solver exposed in Section 5.2 becomes an advantage in the presence of a dynamically changed norm distribution. Benefited from the large corpus of training data, the deep learning solver incorporates the learned implicit combination of and norm into the deep network and reflects such combination, instead of a fixed regularizer, into each pixel of the smoothed images. It is able to generates more visually pleasing results, as shown in both Figure 5 and 7.
Therefore, we argue that what matters to solve the proposed objective function and obtain better smoothing result is the joint optimization over large corpus of images, instead of any particular image. In this specific problem, the deep learning solver plays a critical role. Note that although many empirical experiments have been conducted above, rigorous theoretical analysis is still lacking. Understanding and explaining deep neural networks are still open problems for the followup research.
6. Experimental Results
In this section, we first conduct some ablation study to analyze the influence of the parameters and network structures to the results. Afterwards, we compare our results with previous methods in both visual quality and running time efficiency.
Input 





w/o Residual Learning 


, we can gradually change the smoothness strength. As can be seen on the left, the result predicted by the model without residual connection exhibits some noticeable color attenuation problems, which does not exist in our results with residual image learning. Photo courtesy of Flickr user Rachel Hinman.
Traditional smoothing algorithms  Approximation CNN  Ours  
SGF  BLF  RGF  Tree  RTV  WLS  SDF  Xu  Liu  Fan  GPU  CPU  
QVGA (320240)  0.05  0.03  0.22  0.05  0.17  0.41  0.70  4.99  32.18  0.23  0.07  0.008  0.003  0.010 
VGA (640480)  0.15  0.12  0.73  0.42  0.66  1.80  3.34  19.19  212.07  0.76  0.14  0.009  0.004  0.011 
720p (1280720)  0.25  0.34  1.87  2.08  2.43  5.74  13.26  66.14  904.36  2.16  0.33  0.010  0.005  0.012 
6.1. Ablation Study
Effect of parameter control
In out method, the main parameters to tune are and . Here we analyze the results of our network trained under different settings of these two parameters, and such a group of visual results are shown in Figure 9. As can be seen from the first row, altering weight of edgepreserving term influences the image structures. From the second row, tweaking the weight for flattening term controls the smoothness of predicted images. While enhancing the smoothness with a large , we observe gradually destructed structures, e.g. the ground tiles.
Effect of residual image learning
We analyze the importance of residual image learning by comparing the results with and without this component in our network. As shown in Figure 9, the smooth image generated without residual learning contains some obvious color degradation issues. It appears more orange compared to the raw input image. In contrast, the smooth images predicted with our complete network structure with residual image learning dose not have this issue, as shown in Figure 9. For the image smoothing task, the input and output image should be highly correlated. However, it can be difficult to maintain well the color information in the image after many convolution operations in a deep neural network like ours. Thus we propose to learn the residual image and combine it with input image to resolve this issue.
Image abstraction  Pencil drawing  




Input 
WLS  Ours  WLS  Ours  

6.2. Comparison with Previous Methods
We compare the proposed method with previous ones in terms of the smoothing effect and speed. More comparisons on different applications can be found in Section 7.
Smoothing effect comparison
Figure 2 compares the smoothing results of our method and ten existing methods: [Zhang et al., 2015] (SGF), [Ham et al., 2015] (SDF), [Bi et al., 2015] (), [Cho et al., 2014] (BTLF), [Min et al., 2014] (FGS), [Zhang et al., 2014] (RGF), [Xu et al., 2012] (RTV), [Xu et al., 2011] (), [Farbman et al., 2008] (WLS) and [Tomasi, 1998] (BLF). Note that these algorithms may be designed for different applications thus their goals may be slightly different. Compared to these methods under this difficult example, our method produced outstanding edgepreserving flattening result: it not only achieved pleasing flattening effects for regions of low amplitude (e.g. the sea), but also well preserved the highcontrast structures, especially the thin but salient edges (e.g. the ropes of the paraglider).
To our knowledge, there is no benchmark or dataset to quantitatively evaluate the performance of image smoothing algorithms. Visual perception is still the principal way for evaluation. To demonstrate the robustness of our method and its good performance for natural images with vastly different contents and capturing conditions, we present the visual results on over 100 images without any parameter tweaking for any particular images in the supplemental material.
Running time comparison
Table 1 compares the running time of our method and some previous methods, including both traditional image smoothing algorithms [Zhang et al., 2015; Tomasi, 1998; Zhang et al., 2014; Bao et al., 2014; Xu et al., 2011; Xu et al., 2012; Farbman et al., 2008; Ham et al., 2015; Bi et al., 2015] and some recent methods [Xu et al., 2015; Liu et al., 2016; Fan et al., 2017] that apply neural networks to approximate the results of the existing smoothing algorithms. Traditional image smoothing methods are based on either filtering techniques [Ham et al., 2015; Tomasi, 1998; Zhang et al., 2014; Bao et al., 2014] or mathematical optimization [Bi et al., 2015; Xu et al., 2011; Farbman et al., 2008; Xu et al., 2012; Ham et al., 2015]. While the latter category draws much attention in recent years and often produces quality results, the optimization procedure (e.g. solving largescale linear systems iteratively) can be very timeconsuming. For example, the stateoftheart method of [Bi et al., 2015] takes about 15 minutes to process a 1280720 image. Compared to these methods, ours runs significantly faster. It takes only a few milliseconds for a 1280720 image at the aid of GPU. However, even on CPU, with efficient parallel implementation of our network structure^{4}^{4}4Our CPU version is implemented in MXNet facilitated with the NNPACK module., it still runs in at most tens of milliseconds, facilitating realtime applications.
Compared to the neural network approximators [Xu et al., 2015; Liu et al., 2016; Fan et al., 2017]^{5}^{5}5The running time of [Fan et al., 2017] reported in their paper includes both generating images of particular sizes (per Table 1) and running the network. We excluded the former for a fair comparison, thus their numbers are lower than reported., our method not only generates novel, unique smoothing effects that allow better results in various applications (see Section 7), but also has a faster running speed. Note except for running a neural network, [Xu et al., 2015] also employs a postprocessed optimization step and [Liu et al., 2016] leverages a recursive 1D filter, both of which slows down their whole algorithm.
7. Applications
In this section, we demonstrate the effectiveness and flexibility of our image smoothing algorithm with a range of different applications including image abstraction, pencil sketching, detail magnification, texture removal and contentaware image manipulation. The tailored methods for these different applications mainly differ in the guidance edge maps used in training the network: the former three applications use the local gradient based edge map (per Equation 3) similar to the previous experiments, while the latter two modify them to achieve particular effects. For all the applications we use the images in the PASCAL VOC dataset [Everingham et al., 2010] for training.



Input image  BLF smooth  WLS smooth  smooth  FGS smooth  Ours smooth 


LLF enhance  BLF enhance  WLS enhance  enhance  FGS enhance  Ours enhance 
7.1. Image Abstraction and Pencil Sketching
Edgepreserving image smoothing can be used to stylize imageries. For example, [Winnemöller et al., 2006] proposed to abstract imagery by simplifying the lowamplitude details and increasing the contrast of visually important structures with differenceofGaussian edges. Following previous works of [Farbman et al., 2008; Xu et al., 2011], we replace the iterative bilateral filter in [Winnemöller et al., 2006] with our learned edgepreserving image smoother, and decorate the extracted edges with random sketches in different directions to generate pencil drawing pictures. Furthermore, the smoothed images are combined with the pencil drawing picture to generate an abstract look [Lu et al., 2012].
Figure 10 presents the image abstraction and pencil sketching results of different methods on two examples. Our method clearly excelled at preserving important image structures, thanks to the proposed energy function which has an explicit edgepreserving constraint. For example, the lamp wires in the first example can be clearly seen in our smoothing results, while they are not well preserved by other methods. Note that these structures are semantically meaningful, without which the images look strange. The abstraction and pencil sketching results of our method are clearly more satisfactory. In the second example, the tree branches are small and thin, but are still visually prominent in this image. Our method well kept the tree structure, while [Farbman et al., 2008] only preserved the limbs and [Farbman et al., 2008] broke some thin branches into pieces.
Note to further overcome the oversharpened effects, we expand the image region regularized by norm to its surrounding 77 pixel neighbourhood.





Input  RTV  RGF  BTLF  SGF  SDF  Ours 
7.2. Detail Magnification
The effect of image detail magnification can be achieved by superposing a smooth base layer and an enhanced detail layer, the latter of which can be obtained by image smoothing algorithms. After extracting the smooth layer, the detail layer can be obtained as the difference between the original image and the smooth layer, which is then enhanced and added back to the smooth layer to generate the final result. An ideal smoothing algorithm for this task should neither blur nor oversharpen the salient image structures [Farbman et al., 2008], as either operation can lead to “ringing” artifacts in the residual image, resulting in halo or gradient reversals in the detailenhanced images. Developing such a smooth filter is challenging as it is difficult to determine the edges to preserve and diminish while avoiding to both over sharpen and smooth these edges.
Figure 11 presents the results on such example obtained by our method and previous ones [Xu et al., 2011; Farbman et al., 2008; Tomasi, 1998; Min et al., 2014; Paris et al., 2011]. It can be observed that in the smoothed images the methods of [Xu et al., 2011; Tomasi, 1998; Min et al., 2014] sharpened some edges that are blurry in the original images due to out of focus. As a result, conspicuous gradient reversal artifacts can be observed clearly on the top of enhanced images. In contrast, [Farbman et al., 2008; Paris et al., 2011] and our method produce better results without noticeable artifacts. Note that the method of [Farbman et al., 2008] applied regularizer over the entire image in their smoothness term to perfectly avoid oversharpening the structures. In our approach, the edgepreserving term enforces a strong similarity between the major image structure of input and output images via minimizing their quadratic differences, preventing the edges from being significantly blurred or excessively sharpened. Moreover, the norm is also partially applied to the potentially oversharpened regions to better avoid gradient reversal artifacts in the exaggerated image. As such, highquality detail magnification results can be obtained as shown in Figure 11.
Note the gradient reversal artifacts are very likely to happen even if the smooth image is only slightly oversharpened by either numerical analysis or visual perception. And such a case is almost unavoidable as for the edgepreserving filters that applies strong regularization, since it always tends to oversharpen the image more or less. Therefore, we do not argue for the perfect detail exaggeration results, but we are able to outperform most previous algorithms that pursue strong smoothing effects [Tomasi, 1998; Min et al., 2014; Xu et al., 2011] with only little effort of tweaking our objective function.
To avoid the gradient reversal effects that are usually caused by oversharpening the smooth layer, we increase the balance weight (), and release the constraint on selection ().



Input 
Background smoothed  Saliency Map 


Input 
Foreground enhanced  Saliency Map 

7.3. Texture Removal
The texture removal task we consider here aims at removing the finescale repetitive patterns from primary image structures. In this task, the smoothing algorithms should be made scaleaware, as the textures to be removed may also have local gradients.
Our method can be easily tailored for this task. To grant a network the ability to distinguish finescale textures from primary image structures and smooth them out after training, we can simply set the edge responses of the texture points on the guidance image to be zero. This way, the corresponding edge responses on the guidance map of the output image will always be larger. Thus with slight modification on the constraint of Equation 8, a smoothness regularizer can be easily enforced on the texture regions, such that the network will learn to diminish them. The way to obtain the texture structure is elaborated in the supplemental material.
Figure 12 shows two examples that contain different types of texture patterns. We compare our results against stateoftheart methods of [Xu et al., 2012; Zhang et al., 2014; Cho et al., 2014; Ham et al., 2015; Zhang et al., 2015] that address the texture removal problem. It can be seen that both [Zhang et al., 2014] and [Cho et al., 2014] tends to blur some largescale major structures, while the method of [Ham et al., 2015] produces some noisy structure boundaries. Compared to these methods, superior results are obtained by our method.
Since this task aims at diminishing textures that are very possibly locally salient, we enlarge the weight () and limit the smooth region ().
7.4. ContentAware Image Manipulation
Different from traditional methods, our proposed algorithm enables us to achieve contentaware image processing, i.e., smoothing a particular category of objects in the image.
In this section, we use the image saliency cue to demonstrate contentaware image manipulations by our method. For example, the proposed objective function can be slightly modified to achieve background smoothing goal, which is smoothing out the background regions for highlighting the foreground (i.e., the salient objects). To this end, we mask out the edge responses of the background regions in the guidance image via the binary saliency masks obtained by recent salient object detection algorithm [Zhang et al., 2017]. By feeding the modified guidance image to the proposed objective function, the norm regularizer can be applied on the background regions during training. Afterwards, we can even set the smoothing weights of foreground regions to relatively small values or even zero to keep the foreground unmodified. Figure 13 presents some example results from our method, from which we can see that our trained network is capable of implementing contentaware image smoothing very well.
Alternatively, our algorithm are also able to smooth out foreground regions via a similar strategy, such that a foreground enhancement effect can be achieved via the approach described in Section 7.2. Figure 13 demonstrates very visuallypleasing exaggeration effect for the foreground objects via our approach.
Note in this application, the smoothness effects and saliency information are jointly learned within our network, while the latter information is reflected in the predicted smooth image. All these results are obtained solely by our trained network without any pre or post processing. We set in Equation 5 as 5 to limit the smoothness only within either the foreground or background region.
8. Conclusion
In this paper, we have presented an unsupervised learning approach for the task of image smoothing. We introduced a novel image smoothing objective function built upon the mixture of a spatiallyadaptive flattening criterion and an edgepreserving regularizer. These criteria not only lead to stateoftheart smoothing effects as demonstrated in our experiments, but also grant our method the flexibility to obtain different smoothing effects within a single framework. We have also shown that training a deep neural network on a large corpus of raw images without ground truth labels can adequately solve the underlying minimization problem and generate impressive results. Moreover, the endtoend mapping from a single input image to its corresponding smoothed counterpart by the neural network can be computed efficiently on both GPU and CPU, and the experiments have shown that our method runs orders of magnitude faster than traditional methods. We foresee a wide range of applications that can benefit from our new pipeline.
8.1. Limitations and Future work
Our algorithm relies on some additional information to optimize the objective function during training, such as the detected structures or textures. Currently we employ some simple heuristic methods to detect the structures, and imperfect detection can influence the smoothing results. Figure 14 shows an example where our algorithm fails to extract some detailed textures perfectly. This issue can mitigated by introducing moderate effort of human interaction for refining the structure maps of the training data, or by synthesizing training images with separate textures and clear images similar to [Lu et al., 2018]. Developing more advanced detection algorithms is also one of our future works.
Due to the adaptively changed and spatially variant flattening term and extra input information required for optimization, the optimization is complex and very challenging for traditional numerical solvers. To our knowledge, this is the first attempt of treating deep network as a numerical solver in the image smoothing field. In the future, we also would like to explore more complex and different tasks, such as multiimage or video processing.
Acknowledgements.
The authors would also like to thank the anonymous reviewers for their valuable comments and helpful suggestions. This work is supported by the Sponsor National 973 Program Rl under Grant No. 2015CB352501, and Sponsor NSFCISF Rl under Grant No. 61561146397.References
 [1]
 Bach et al. [2012] Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. 2012. Structured sparsity through convex optimization. Statist. Sci. 27, 4 (2012), 450–468.
 Bao et al. [2014] Linchao Bao, Yibing Song, Qingxiong Yang, Hao Yuan, and Gang Wang. 2014. Tree filtering: Efficient structurepreserving smoothing with a minimum spanning tree. IEEE Transactions on Image Processing 23, 2 (2014), 555–569.
 Bi et al. [2015] S. Bi, X. Han, and Y. Yu. 2015. An image transform for edgepreserving smoothing and scenelevel intrinsic decomposition. ACM Transactions on Graphics 34, 4 (2015), 78.
 Chen et al. [2017] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. 2017. Coherent online video style transfer. In Proc. Intl. Conf. Computer Vision (ICCV).
 Chen et al. [2017b] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. 2017b. Stylebank: An explicit representation for neural image style transfer. In Proc. CVPR, Vol. 1. 4.

Chen
et al. [2018]
Dongdong Chen, Lu Yuan,
Jing Liao, Nenghai Yu, and
Gang Hua. 2018.
Stereoscopic neural style transfer. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Vol. 10.  Chen et al. [2007] Jiawen Chen, Sylvain Paris, and Frédo Durand. 2007. Realtime edgeaware image processing with the bilateral grid. ACM Transactions on Graphics 26, 3 (2007), 103.
 Chen et al. [2017a] Qifeng Chen, Jia Xu, and Vladlen Koltun. 2017a. Fast image processing with fullyconvolutional networks. In International Conference on Computer Vision. 2497–2506.
 Cho et al. [2014] Hojin Cho, Hyunjoon Lee, Henry Kang, and Seungyong Lee. 2014. Bilateral texture filtering. ACM Transactions on Graphics 33, 4 (2014), 128.
 Chung and Vese [2009] Ginmo Chung and Luminita A Vese. 2009. Image segmentation using a multilayer levelset approach. Computing and Visualization in Science 12, 6 (2009), 267–285.
 Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.
 Fan et al. [2017] Qingnan Fan, Jiaolong Yang, Gang Hua, Baoquan Chen, and David Wipf. 2017. A generic deep architecture for single image reflection removal and image smoothing. In International Conference on Computer Vision. 3238–3247.
 Fan et al. [2018] Qingnan Fan, Jiaolong Yang, Gang Hua, Baoquan Chen, and David Wipf. 2018. Revisiting Deep Intrinsic Image Decompositions. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Fan et al. [2015]
Qingnan Fan, Fan Zhong,
Dani Lischinski, Daniel CohenOr, and
Baoquan Chen. 2015.
JumpCut: nonsuccessive mask transfer and interpolation for video cutout.
ACM Transactions on Graphics 34, 6 (2015), 195.  Farbman et al. [2008] Zeev Farbman, Raanan Fattal, Dani Lischinski, and Richard Szeliski. 2008. Edgepreserving decompositions for multiscale tone and detail manipulation. ACM Transactions on Graphics 27, 3 (2008), 67.
 Fattal [2009] Raanan Fattal. 2009. Edgeavoiding wavelets and their applications. ACM Transactions on Graphics 28, 3 (2009), 22.
 Gastal and Oliveira [2011] Eduardo SL Gastal and Manuel M Oliveira. 2011. Domain transform for edgeaware image and video processing. In ACM Transactions on Graphics, Vol. 30. ACM, 69.
 Gharbi et al. [2017] Michaël Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W Hasinoff, and Frédo Durand. 2017. Deep bilateral learning for realtime image enhancement. ACM Transactions on Graphics (TOG) 36, 4 (2017), 118.
 Ham et al. [2015] Bumsub Ham, Minsu Cho, and Jean Ponce. 2015. Robust image filtering using joint static and dynamic guidance. In IEEE Conference on Computer Vision and Pattern Recognition. 4823–4831.

He
et al. [2015]
Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun.
2015.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In
IEEE International Conference on Computer Vision. 1026–1034.  He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

He
et al. [2018]
Mingming He, Dongdong
Chen, Jing Liao, Pedro V Sander, and
Lu Yuan. 2018.
Deep Exemplarbased Colorization.
ACM Transactions on Graphics (Proc. of Siggraph 2018) (2018).  Holland and Welsch [1977] Paul W Holland and Roy E Welsch. 1977. Robust regression using iteratively reweighted leastsquares. Communications in Statisticstheory and Methods 6, 9 (1977), 813–827.

Ioffe and Szegedy [2015]
Sergey Ioffe and
Christian Szegedy. 2015.
Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In
International Conference on Machine Learning
. 448–456.  Kass and Solomon [2010] Michael Kass and Justin Solomon. 2010. Smoothed local histogram filters. In ACM Transactions on Graphics, Vol. 29. 100.
 Kim et al. [2016] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Accurate image superresolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition. 1646–1654.
 Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (2015).
 Liu et al. [2016] Sifei Liu, Jinshan Pan, and MingHsuan Yang. 2016. Learning recursive filters for lowlevel vision via a hybrid neural network. In European Conference on Computer Vision. 560–576.
 Liu et al. [2017] Wei Liu, Xiaogang Chen, Chuanhua Shen, Zhi Liu, and Jie Yang. 2017. SemiGlobal Weighted Least Squares in Image Filtering. In IEEE International Conference on Computer Vision.
 Lu et al. [2012] Cewu Lu, Li Xu, and Jiaya Jia. 2012. Combining sketch and tone for pencil drawing production. In Proceedings of the Symposium on NonPhotorealistic Animation and Rendering. Eurographics Association, 65–73.
 Lu et al. [2018] Kaiyue Lu, Shaodi You, and Nick Barnes. 2018. Deep Texture and Structure Aware Filtering Network for Image Smoothing. EuropeanConferenceonComputerVision(ECCV)
 Min et al. [2014] Dongbo Min, Sunghwan Choi, Jiangbo Lu, Bumsub Ham, Kwanghoon Sohn, and Minh N Do. 2014. Fast global image smoothing based on weighted least squares. IEEE Transactions on Image Processing 23, 12 (2014), 5638–5653.
 Mrázek et al. [2006] Pavel Mrázek, Joachim Weickert, and Andres Bruhn. 2006. On robust estimation and smoothing with spatial and tonal kernels. In Geometric properties for incomplete data. Springer, 335–352.
 Paris and Durand [2006] Sylvain Paris and Frédo Durand. 2006. A fast approximation of the bilateral filter using a signal processing approach. In European Conference on Computer Vision. 568–580.
 Paris et al. [2011] Sylvain Paris, Samuel W Hasinoff, and Jan Kautz. 2011. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph. 30, 4 (2011), 68–1.
 Perona and Malik [1990] Pietro Perona and Jitendra Malik. 1990. Scalespace and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 7 (1990), 629–639.
 Prasath et al. [2015] VB Surya Prasath, Dmitry Vorotnikov, Rengarajan Pelapur, Shani Jose, Guna Seetharaman, and Kannappan Palaniappan. 2015. Multiscale Tikhonovtotal variation image restoration using spatially varying edge coherence exponent. IEEE Transactions on Image Processing 24, 12 (2015), 5220–5235.
 Romano et al. [2017] Yaniv Romano, Michael Elad, and Peyman Milanfar. 2017. The little engine that could: Regularization by denoising (RED). SIAM Journal on Imaging Sciences 10, 4 (2017), 1804–1844.
 Rong and Tan [2006] Guodong Rong and TiowSeng Tan. 2006. Jump flooding in GPU with applications to Voronoi diagram and distance transform. In The 2006 symposium on Interactive 3D graphics and games. 109–116.
 Tomasi [1998] Carlo Tomasi. 1998. Bilateral filtering for gray and color images. In International Conference on Computer Vision. IEEE, 839–846.
 Ulyanov et al. [2018] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2018. Deep Image Prior. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).
 Weiss [2006] Ben Weiss. 2006. Fast median and bilateral filtering. Acm Transactions on Graphics 25, 3 (2006), 519–526.
 Winnemöller et al. [2006] Holger Winnemöller, Sven C Olsen, and Bruce Gooch. 2006. Realtime video abstraction. In ACM Transactions On Graphics, Vol. 25. 1221–1226.
 Xu et al. [2011] Li Xu, Cewu Lu, Yi Xu, and Jiaya Jia. 2011. Image smoothing via gradient minimization. In ACM Transactions on Graphics, Vol. 30. 174.
 Xu et al. [2015] Li Xu, Jimmy SJ. Ren, Qiong Yan, Renjie Liao, and Jiaya Jia. 2015. Deep edgeaware filters. In International Conference on Machine Learning. 1669–1678.
 Xu et al. [2012] Li Xu, Qiong Yan, Yang Xia, and Jiaya Jia. 2012. Structure extraction from texture via relative total variation. ACM Transactions on Graphics 31, 6 (2012), 139.
 Yu and Koltun [2016] Fisher Yu and Vladlen Koltun. 2016. Multiscale context aggregation by dilated convolutions. International Conference on Learning Representations.
 Zhang et al. [2015] Feihu Zhang, Longquan Dai, Shiming Xiang, and Xiaopeng Zhang. 2015. Segment graph based image filtering: fast structurepreserving smoothing. In IEEE International Conference on Computer Vision. 361–369.
 Zhang et al. [2017] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. 2017. Amulet: Aggregating multilevel convolutional features for salient object detection. In International Conference on Computer Vision. 202–211.
 Zhang et al. [2014] Qi Zhang, Xiaoyong Shen, Li Xu, and Jiaya Jia. 2014. Rolling Guidance Filter. In European Conference on Computer Vision. 815–830.
 Zhu et al. [2014] Feiyun Zhu, Ying Wang, Bin Fan, Shiming Xiang, Geofeng Meng, and Chunhong Pan. 2014. Spectral unmixing via dataguided sparsity. IEEE Transactions on Image Processing 23, 12 (2014), 5412–5427.