Deep Image Smoothing based on Texture and Structure Guidance

12/07/2017 ∙ by Kaiyue Lu, et al. ∙ CSIRO 0

Image smoothing is a fundamental task in computer vision, which aims to retain salient structures and remove insignificant textures. In this paper, we tackle the natural deficiency of existing methods, that they cannot properly distinguish textures and structures with similar low-level appearance. While deep learning approaches have addressed preserving structures, they do not yet properly address textures. To this end, we build a texture prediction network (TPN) that learns from a various of natural textures. We then combine this with a structure prediction network (SPN) so that the final double-guided filtering network (DGFN) is informed where are the textures to remove ("texture-awareness") and where are the structures to preserve ("structure-awareness"). The proposed model is easy to implement and shows excellent performance on real images in the wild as well as our synthetic dataset.



There are no comments yet.


page 7

page 13

page 15

page 16

page 17

page 18

page 20

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image smoothing, a fundamental technology in image processing and computer vision, aims to clean images by retaining salient structures (to the structure-only image) and removing insignificant textures (to the texture-only image), with various applications including denoising [1], detail enhancement [2], image abstraction [3] and segmentation [4].

There are mainly two types of methods for image smoothing: (1) kernel-based methods, that calculate the average of the neighborhood for texture pixels while trying to retain the original value for structural pixels, such as the guided filter (GF) [5], rolling guidance filter (RGF) [6], segment graph filter (SGF) [7] and so on; and (2) separation-based methods, which decompose the image into a structure layer and a texture layer, such as relative total variation (RTV) [8], fast L0 [9], and static and dynamic guidance filter (SDF) [10, 11]. Traditional approaches rely on hand-crafted features and/or prior knowledge to distinguish textures from structures. These features are based entirely on low-level appearance, and generally assume that structures always have larger gradients, and textures are just smaller oscillations in color intensities.

Figure 1: (a) Texture in natural images is often hard to identify due to spatial distortion and high contrast. (b) Illustration of learning “texture awareness”. We generate training data by adding spatial and color variations to natural texture patterns and blending them with structure-only images, and then use the result to train a multi-scale texture network with texture ground-truth. We test the network on both generated data and natural images. (c) Our proposed deep filtering network is composed of a texture prediction network (TPN) for predicting textures (white stripes with high-contrast); a structure prediction network (SPN) for extracting structures (the giraffe’s boundary, which has relatively low contrast to the background); and a texture and structure aware filtering network (TSAFN) for image smoothing. (d)-(i) Existing methods cannot distinguish low-contrast structures from high-contrast textures effectively.

In fact, it is quite difficult to identify textures. The main reasons are twofold: (1) textures are essentially repeated patterns regularly or irregularly distributed within object structures, and they may show significant spatial distortions in an image (as shown in Fig. 1(a)), making it impossible to fully define them mathematically; (2) in some images there are strong textures with large gradients and color contrast to the background, which are easy to confuse with structures (such as the white stripes on the giraffe’s body in Fig. 1(c)). We see from Fig. 1 that GF, RGF, SGF, fast L0, and SDF perform poorly on the giraffe image. The textures are either not removed at all, or suppressed with the structure severely blurred. This is because the hand-crafted nature of these filters makes them less robust when applied to various types of textures, and also leads to poor discrimination of textures and structures. Some other methods [12, 13, 14, 15, 16, 17, 18]

take advantage of deep neural networks, and aim for better performance by extracting richer information. However, existing networks use the output of various hand-crafted filters as ground-truth during training. These deep learning approaches are thus limited by the shortcomings of hand-crafted filters, and cannot learn how to effectively distinguish textures from structures.

A recently-proposed double-guided filter (DGF) [19] addresses this issue by introducing the idea of “texture guidance”, which infers the location of texture, and combines it with “structure guidance” to achieve both goals of texture removal and structure preservation. However, DGF uses a hand-crafted separation-based algorithm called Structure Gradient and Texture Decorrelating (SGTD) [20] to construct the texture confidence map that still cannot essentially overcome the natural deficiency. We argue that this is not true “texture awareness”, because in many cases, some structures are inevitably blurred when the filter tries to remove strong textures after several iterations. As can be seen in Fig. 1(i), although the stripe textures are largely smoothed out, the structure of the giraffe is unexpectedly blurred, especially around the head and the tail (red boxes).

In this paper, we hold the idea that “texture awareness” should reflect both the texture region (where the texture is) and texture magnitude (texture with high contrast to the background is harder to remove). Thus, we take advantage of deep learning and propose a texture prediction network (TPN) that aims to learn textures from natural images. However, since there are no available datasets containing natural images with labeled texture regions, we make use of texture-only datasets [21, 22]. The process of learning “texture awareness” is shown in Fig. 1(b). Specifically, we generate the training data by adding spatial and color variations to natural texture patterns and blending them with the structure-only image. Then we construct a multi-scale network (containing different levels of contextual information) to train these images with texture ground-truth (G.T. in short). The proposed TRN is able to predict textures through a full consideration of both high-level statistics, e.g., repetition, tiling, spatial varying distortion; and low-level appearance, e.g., gradient. The network achieves good performance on our generated testing data, and can also generalize well to natural images, effectively locating texture regions and measuring texture magnitude by assigning different confidences, as shown in Fig. 1(b). More details can be found in Section 3.

For the full problem, we are inspired by the idea of “double guidance” introduced in [19] and propose a deep neural network based filter that learns to predict textures to remove (“texture-awareness” by our TPN) and structures to preserve (“structure-awareness” by HED semantic edge detection [23]). This is an end-to-end image smoothing architecture which we refer to as “Texture and Structure Aware Filtering Network” (TSAFN), as shown in Fig. 1(c). The network is trained with our own generated dataset. Different from the work in [19], we propose more effective methods for generating texture and structure guidance, and replace the hand-crafted kernel filter with a deep learning based one to achieve a more consistent and effective combination of these two types of guidance. Experimental results show that our proposed filter outperforms DGF [19] significantly in terms of both effectiveness and efficiency, achieves state-of-the-art performance on our dataset, and generalizes well to natural images.

The main contributions of this paper are: (1) To the best of our knowledge, we are the first to propose deep neural networks to robustly predict textures in natural images. (2) We present a large dataset that enables training texture prediction and image smoothing. (3) We propose an end-to-end deep neural network for image smoothing that achieves both “structure-awareness” and “texture-awareness”, and outperforms existing methods on challenging natural images.

2 Related Work

Texture extraction from structures

The basic assumption of this type of work is that an image can be decomposed into structure and texture layers (the structure layer is a smoothed version of the input and contains salient structures, while the texture layer contains insignificant details or textures). The pioneering work, Total Variation [24], aims to minimize the quadratic difference between the input and output images to maintain structure consistency with the gradient loss as an additional penalty. Later works retain the quadratic form and propose other regularizer terms or features (gradient loss is still necessary to keep the structures as sharp as possible), such as weighted least squares (WLS) [25], norm smoothing [26, 9], norm smoothing [27], local extrema [28], structure gradient and texture decorrelating (SGTD) [20]. Other works also focuses on accelerating the optimization [29] or improving existing algorithms [30]. There are two general issues that have not been handled effectively in existing work. Firstly, as they are largely dependent on gradient information, these methods lack discrimination of textures and structures, especially when they have similar low-level appearance, particularly in terms of scale or magnitude. Secondly, all the objective functions are manually defined, and may not be adaptive and robust to the huge variety of possible textures, especially in natural images.

Image smoothing with guidance

The guidance image can provide structure information to help repair and sharpen structures in the target image. Since adding guidance into separation-based methods may make it harder to optimize, this idea is more widely used in kernel-based methods. Static guidance refers to the use of a fixed guidance image, such as the bilateral filter [31], joint bilateral filter [32], and guided filter [5]. To make the guidance more structure-aware, existing filters also employ techniques such as leverage tree distance [33], superpixels [7], region covariances [34], co-occurrence matrix [35], propagation distance [36]

, multipoint estimation

[37], fully connected regions [38] and edge maps [39, 40, 41]. In contrast, dynamic guidance methods update the guidance image to suppress more details [6, 10, 11] by iteratively refining the target image. Overall, the aforementioned guidance methods only address structure information, or assume that structures and textures can be sufficiently distinguished with a single guidance. However, in most cases, structures and textures interfere with each other severely. Lu et al. [19] addresses this issue by introducing the concept of “texture guidance”, which infers texture regions by normalizing the texture layer separated by SGTD [20] to construct the texture confidence map. They then naively combine it with structure guidance to form a double-guided kernel filter. However, this method is still largely dependent on hand-crafted features (in particular it relies on the hand-crafted SGTD to infer textures, which is not robust in essence). Structures may be blurred when the filter tries to smooth out strong textures after several iterations.

Deep image smoothing

Deep learning has been widely used in low-level vision tasks, such as super resolution

[42], deblurring [43] and dehazing [44]. Compared with non-learning approaches, deep learning is able to extract richer information from images. In image smoothing, current deep filtering models all focus on approximating and accelerating existing non-learned filters. [12] is the pioneering paper, where the learning is performed on the gradient domain and the output is reconstructed from the refined gradients produced by the deep network. Liu et al. [13] take advantage of both convolutional networks (for perceiving salient structures) and recurrent networks (for producing smoothing output in a data-driven manner). Li et al. [14] fuse the features from the original input and guidance image together and then produce the guided smoothing result (this work is mainly for upsampling). Fan et al. [15] first construct a network called E-CNN to predict the edge/structure confidence map based on gradients, and then use it to guide the filtering network called I-CNN. Similar work can be found in [17] by the same authors. Most recent works mainly focus on extracting richer information from input images ([18] introduces a convolutional neural pyramid to extract features of different scales, and [16] utilizes context aggregation networks to include more contextual information) and yielding more satisfying results. One common issue is all of these approaches have to take the output of existing filters as ground-truth and cannot function as an independent filter. Their focus is limited to how similarly to the learned filter they can perform, and how fast it can accelerate computation. This deviates from the task of image smoothing itself. Moreover, since these methods aim to mimic existing filters, they are unable to overcome their deficiency in discriminating textures.

3 Texture Prediction

In this section, we give insights on textures in natural images, which inspire the design of the texture prediction network (TPN) and the dataset for training.

3.1 What is texture?

Appearance of texture

It is well known that many different types of textures occur in nature and it is difficult to fully define them mathematically. Generally speaking, textures are repeated patterns regularly or irregularly distributed within object structures. For example, in Fig. 1(c), the white stripes on the giraffe’s surface are recognized as textures. In Fig. 2, textures are widely spread in the image on clothes, books, and the table cloth. For cognition and vision tasks, an intuitive observation is that the removal of these textures will not affect the spatial structure of objects. Thus, they can be removed by image smoothing as a preprocessing step for other visual tasks.

Figure 2: Close observation of structures and textures. In contrast with the assumptions used in existing methods, large gradients do not necessarily indicate structures (IV), and small gradients may also belong to structures (III). The challenge to distinguish them motivates us to propose two independent texture and structure guidances.

Textures do not necessarily have small gradients

Existing methods generally assume that textures are minor oscillations and have small gradients. Thus, they can easily hand-craft the filter or loss function. However, in many cases, textures may also have large gradients,

e.g., the white stripes on the giraffe’s body in Fig. 1(b), and the stripes occurring on the books in close-up IV of Fig. 2(c). Therefore, defining textures purely based on local contrast is insufficient.

Mathematically modeling texture repetition is non-trivial

By definition, textures are patterns with spatial repetitions. However, modeling and describing the repetition is non-trivial due to the existence of various distortions (see Fig. 1(a)).

Learn to predict textures

To tackle these issues, we take advantage of deep neural networks. Provided sufficient training examples are available, the network is able to learn to predict textures without explicit modeling.

3.2 Dataset Generation

We aim to provide a dataset so that a deep network can learn to predict textures. Ideally, we would like to learn directly from natural images. However, manually annotating pixel-wise labels plus alpha-matting would be prohibitively costly. Moreover, it would require a full range of textures, each with a full range of distortions in a broad array of natural scenes. Therefore, we propose a strategy to generate the training and testing data. Later, we will demonstrate that the proposed network is able to predict textures in the wild successfully.

We observe that cartoon images have only structural edges filled with pure color, and can be safely considered “structure-only images”. Specifically, we select 174 cartoon images from the Internet and 233 different types of natural texture-only images from public datasets [21, 22]. The data generation process is illustrated in Fig. 3(a). Note that texture images in these datasets show textures only and all have simple backgrounds, so that separating them from the colored background is simple and efficient even using Relative Total Variation (RTV) [8]. The texture layer separated by RTV is normalized to .

Figure 3: Illustration of dataset generation. We blend-in natural texture patterns to structure-only images, adding spatial and color variation to increase texture diversity.

Texture itself can be irregular, and textures in the wild may be distorted because of geometric projection. This arises because textures can appear on planar surfaces that are not orthogonal to the viewing direction, as well as being projected onto object with complex 3D surfaces. Therefore, we apply both spatial and color variation to the regular textures during dataset generation. As shown in Fig. 3(a), we blend-in the texture to the structure-only image. In detail, we rescale all the texture images to and extract texture patterns with RTV. We model spatial variation, capturing projected texture at patch level by performing geometric transforms including rotation, scaling, shearing, and linear and non-linear distortion. We randomly select the geometric transform and parameters for the operation111The detailed process can be found in the supplementary material, and we will provide the data generation code upon publication.. Based on the deformed result, we generate a binary mask .

As for color variation, given the structure-only image , the value of pixel in the channel of the generated image is determined by both and the mask . If , , where is used to control the range of random generation and empirically set as 0.75. Otherwise, . We repeat this by sliding the mask over the whole image without overlapping. The ground-truth texture confidence is calculated by averaging the values of the three channels of the texture layer:



is the sigmoid function to scale the value to

. We use this color variation to generate significant contrast between the textures and the background. Using this method, it is unlikely that two images have the same textures even when the textures come from the same original pattern. Fig. 3(b) shows eight generated image patches.

Finally, we generate 30,000 images in total (a handful of low-quality images have been manually removed). For ground-truth, besides the purely-clean structure-only images, we also provide binary structure maps and texture confidence maps of all the generated images222More examples in the dataset are provided in the supplementary material, and the dataset will be available to the public upon publication..

Figure 4: Our proposed network architecture. The outputs of the texture prediction network (TPN) and structure prediction network (SPN) are concatenated with the original input, and then fed to the texture and structure aware filtering network (TSAFN) to produce the final smoothing result. (,,,) for a convolutional layer means the kernel is in size with

feature maps, and the stride is


3.3 Texture prediction network

Network design

We propose the texture prediction network (TPN), which is trained on our generated dataset. Considering that textures have various colors, scales, and shapes, we employ a multi-scale learning strategy. Specifically, we apply 1/2, 1/4, and 1/8 down-sampling to the input respectively. For each image, we use 3 convolutional layers for feature extraction, with the same size

kernel and different number of feature maps. Then, all the feature maps are resized to the original input size and concatenated to form a 16-channel feature map. They are further convolved with a

layer to yield the final 1-channel result. Note that each convolutional layer is followed by ReLU except for the output layer, which is followed by a sigmoid activation function to scale the values to

. The architecture of TPN is shown in Fig. 4. Consequently, given the input image , the predicted texture guidance is obtained by:


Network training

The network is trained by minimizing the mean squared error (MSE) between the predicted texture guidance map and the ground-truth:


where is the number of pixels in the image, denotes the ground-truth, and represents parameters. More training details can be found in the experiment section.

Figure 5: Texture prediction results. First row: input (including both generated and natural images). Second row: texture extraction results by RTV [8] (we compare it because we use it to extract textures from texture-only images). Third row: texture prediction results by our proposed TPN. The network is able to find textures in both generated and natural images effectively, and indicate the magnitude of textures by assigning pixel-level confidence. RTV performs worse in extracting textures because just like other hand-crafted filters, it also assumes structures have large gradients and has poor discrimination of strong textures, especially in more complicated scenes.

Texture prediction results

We present the texture prediction results on our generated images in Fig. 5(a) and natural images in Fig. 5(b). The network is able to find textures in both the generated and natural images effectively, and indicate the magnitude of textures by assigning pixel-level confidence (the third row). For comparison, we also list the texture extraction results from these examples by RTV [8] in the second row. RTV performs worse on the more complex scenes, and some structures are unexpectedly visible in the texture layer (red arrows). This is because just like other hand-crafted filters, RTV also assumes structures have large gradients and hence has poor discrimination of strong textures.

4 Texture and Structure Aware Filtering Network

As shown in Fig. 4, our deep filtering network consists of three parts:

  1. Texture prediction network TPN, that constructs texture guidance to indicate texture regions and magnitude (texture confidence).

  2. Structure prediction network SPN, that constructs structure guidance to indicate meaningful structures (structure confidence).

  3. Texture and structure aware filtering network TSAFN, that concatenates the two guidance images with the original input and generates the smoothing output.

Since TPN has been discussed in the previous section, we give more details about SPN and TSAFN in the following.

4.1 Structure prediction network

Structure information is an essential cue for image smoothing, that tells the filter which boundaries should be preserved. The ideal structure guidance would give high confidence to meaningful structures, regardless of gradient intensity. We utilize a recently-proposed holistically-nested edge detection (HED) [23] as the structure prediction network (SPN):


where is the side output from the stage (each stage contains several convolutional and pooling layers). The final loss is denoted as . Please refer to the original paper [23] for more details.

4.2 Texture and structure aware filtering network

Network design

Once the structure and texture guidance are generated, the texture and structure aware filtering network (TSAFN) concatenates them with the input to form a 5-channel tensor. TSAFN consists of 4 layers. We set a relatively large kernel (

) in the first layer to take more original information into account. The kernel size decreases in the following two layers (, respectively). In the last layer, the kernel size is increased to again. The first three layers are followed by ReLU, while the last layer has no activation function (transforming the tensor into the 3-channel output). Empirically, we remove all the pooling layers, the same as [12, 14, 15, 16]. We set the filtering network without any guidance as the baseline. The whole process can be denoted as:


Network training

The network is trained by minimizing:


More details can be found in the experiment section.

5 Experiments and Analysis

In this section, we demonstrate the effectiveness of our proposed deep image smoothing network through.

Environment setup

We construct the networks in Tensorflow

[45], and train and test all the data on a single NVIDIA Titan X graphics card.


Because there is no existing texture removal dataset, we perform training using our generated images. More specifically, we select 19,505 images (65%) from the dataset for training, 2,998 (10%) for validation, and 7,497 (25%) for testing (all test images are resized to ). There is no overlapping of the structure-only images between training, validation and testing samples.


We first train the three networks separately. 300,000 patches with the size are randomly and sparsely collected from training images. We use gradient descent with a learning rate of 0.0001, and momentum of 0.9. Finally, we perform fine-tuning by jointly training the whole network with a smaller learning rate of 0.00001, and the same momentum 0.9. The fine-tuning loss is


where we empirically set , and .

Figure 6: Smoothing results on generated images. Our filter can smooth out various types of textures while preserving structures more effectively than other approaches.

5.1 Existing methods to compare

Traditional hand-crafted methods

We compare our filter with 2 classical filters: Total Variation (TV) [24], bilateral filter (BLF) [31], and 9 state-of-the-art filters: L0 [26], Relative Total Variation (RTV) [8], guided filter (GF) [5], Structure Gradient and Texture Decorrelation (SGTD) [20], rolling guidance filter (RGF) [6], fast L0 [9], segment graph filter (SGF) [7], static and dynamic filter (SDF) [11], double-guided filter (DGF) [19]. Note that, BLF, GF, RGF, SGF, DGF are kernel-based, while TV, L0, RTV, SGTD, fast L0, SDF are separation-based. We use the default parameters defined in the open-source code for each method.

Deep learning based methods

We select 5 state-of-the-art deep filtering models: deep edge-aware filter (DEAF) [12], deep joint filter (DJF) [14], deep recursive filter (DRF) [13], deep fast filter (DFF) [16], and cascaded edge and image learning network (CEILNet) [15] . We retrain all the models with our dataset.

5.2 Results

Quantitative results on generated images

We first compare the average MSE, PSNR, SSIM [46], and processing time (in seconds) of 11 hand-crafted filters on our testing data in Table 1. Our method achieves the smallest MSE (closest to ground-truth), largest PSNR and SSIM (removing textures and preserving main structures most effectively), and lowest running time, indicating its superiority in both effectiveness and efficiency. Note that although the double-guided filter (DGF) [19] achieves better quantitative results than other hand-crafted approaches, it runs extremely slowly (more than 50 times slower than ours). This may be due to the complex process of generating two guidances, and the inefficiency of the kernel operation. We also compare the quantitative results on different deep models trained and tested on our dataset in Table 2. Our model achieves the best MSE, PSNR and SSIM, with comparable efficiency to the other methods. We additionally select 4 state-of-the-art methods (SDF [11], DGF [19], DFF [16], and CEILNet [15]) for visual comparison in Fig. 6. The textures in the first example have relatively large scale. SDF, DGF, and CEILNet attempt to remove these textures but the structures are blurred severely as a penalty. In the second example, the textures are densely distributed and have relatively large contrast. SDF performs badly in this example due to the poor texture discrimination. DGF and CEILNet can suppress these textures, but the structures are blurred. Although DFF is able to smooth out almost all the textures, the final results show unexpected artifacts and color shift, and look less similar to the ground-truth than ours. Only our filter performs well in both examples.

Figure 7: Smoothing results on natural images. The first example shows the ability of weak structure preservation and enhancement in textured scenes. The next four examples present various texture types with different shapes, contrast, and distortion. Our filter performs consistently better than state-of-the-art methods in all the examples, demonstrating its superiority in image smoothing and good generality in processing natural images.
TV [24] 0.2791 11.33 0.6817 2.44 RGF [6] 0.2094 15.73 0.7173 0.87
BLF [31] 0.3131 10.89 0.6109 4.31 Fast L0 [9] 0.2068 15.50 0.7359 1.36
L0 [26] 0.2271 14.62 0.7133 0.94 SGF [7] 0.2446 13.92 0.7002 2.26
RTV [8] 0.2388 14.07 0.7239 1.23 SDF [10] 0.1665 16.82 0.7633 3.71
GF [5] 0.2557 12.22 0.6948 0.83 DGF [19] 0.1247 17.89 0.7552 8.66
SGTD [20] 0.1951 16.14 0.7538 1.59 Ours 0.0051 25.07 0.9152 0.16
Table 1: Quantitative evaluation of different hand-crafted filters tested on our dataset

Qualitative comparison on real images in the wild

DEAF [12] 0.0297 20.62 0.8071 0.35 DFF [16] 0.0172 22.21 0.8675 0.07
DJF [14] 0.0352 19.01 0.7884 0.28 CEILNet [15] 0.0156 22.65 0.8712 0.13
DRF [13] 0.0285 21.14 0.8263 0.12 Ours 0.0051 25.07 0.9152 0.16
Table 2: Quantitative evaluation of deep models trained and tested on our dataset
No guidance (Baseline) 0.0316 20.32 0.7934
Only structure guidance 0.0215 21.71 0.8671
Only texture guidance 0.0118 23.23 0.8201
Double guidance (trained separately) 0.0059 24.78 0.9078
Double guidance (fine-tuned) 0.0051 25.07 0.9152
Table 3: Ablation study of image smoothing effects with no guidance, only structure guidance, only texture guidance, and double guidance (trained separately and fine-tuned)

We visually compare smoothing results of 5 challenging natural images with SDF [11], DGF [19], DFF [16], and CEILNet [15] in Fig. 7. In the first example, the leopard is covered with black texture, and it has relatively low contrast to the background (weak structure). Only our filter smooths out all the textures while effectively preserving and enhancing the structure. The next four examples present various texture types with different shapes, contrast, and distortion. Our filter performs consistently well in both preserving structures and removing textures. We analyze the last challenging vase example in more detail. The vase is covered with strong dotted textures, densely distributed on the surface. SDF fails to remove these textures since they are regarded as structures with large gradients. DGF smooths out the black dots more effectively but the entire image looks blurry. This is because just as [19] points out, a larger kernel size and more iterations are required to remove more textures, resulting in the blurred structure as a penalty. Also, the naive combination of structure and texture kernels makes the filter not robust to various types of textures, in which case the structure may not always be retained well even with the proper structure guidance. The two deep filters do not demonstrate much improvement over the hand-crafted approaches because “texture-awareness” is not specially emphasized in their network design, necessitating a trade-off between structure preservation and texture removal. Only our filter removes all the textures without blurring the main structure.

Figure 8: Image smoothing results with no guidance, single guidance, double guidance (trained separately, and fine-tuned). With only structure guidance, the main structures are retained as well as the textures. With only texture guidance, all the textures are smoothed out but the structures are severely blurred. The result with double guidance performs well in both structure preservation and texture removal. Fine-tuning the whole network can further improve the performance.

Ablation study of each guidance

To investigate the effect of guidance, we train the filtering network with no guidance, only structure guidance, only texture guidance, and double guidance respectively. We list the average MSE, PSNR, and SSIM of the testing results compared with ground-truth in Table 3, demonstrating that the results with double guidance have smaller MSE, larger PSNR, and larger SSIM. Also, the fine-tuning process improves the filtering network. Further, we show two natural images in Fig. 8. Compared with the baseline without guidance, the result only with structure guidance retains more structure, as well as the texture (this is mainly because HED may also be negatively affected by strong textures, resulting in a larger MSE loss when training the network). In contrast, the structures are severely blurred with only texture guidance, even though most textures are removed. Combining both structure and texture guidance produces a better result. Fine-tuning further improves the result (in the red rectangle of the first example, the structures are sharper; in the second example, the textures within the red region are further suppressed). All the observations are consistent with the quantitative evaluation in Table 3.

6 Conclusion

In this paper, we propose an end-to-end texture and structure aware filtering network that is able to smooth images with both “texture-awareness” and “structure-awareness”. The “texture-awareness” benefits from the newly-proposed texture prediction network. To facilitate training, we blend-in natural textures onto structure-only cartoon images with spatial and color variations. The “structure-awareness” is realized by semantic edge detection. Experiments show that the texture network can detect textures effectively. And our filtering network outperforms other kernel-based, separation-based, and learning-based filters on both generated images and natural images. The network structure is intuitive and easy to implement, and achieves excellent smoothing ability with comparable efficiency to state-of-the-art methods.


  • [1] Gu, S., Zhang, L., Zuo, W., Feng, X.: Weighted nuclear norm minimization with application to image denoising.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2862–2869

  • [2] Fattal, R., Agrawala, M., Rusinkiewicz, S.: Multiscale shape and detail enhancement from multi-light image collections. ACM Trans. Graph. 26(3) (2007)  51
  • [3] Winnemöller, H., Olsen, S.C., Gooch, B.: Real-time video abstraction. In: ACM Transactions On Graphics (TOG). Volume 25., ACM (2006) 1221–1226
  • [4] Wang, Y., He, C.: Image segmentation algorithm by piecewise smooth approximation. EURASIP Journal on Image and Video Processing 2012(1) (2012)  16
  • [5] He, K., Sun, J., Tang, X.: Guided image filtering. IEEE transactions on pattern analysis and machine intelligence 35(6) (2013) 1397–1409
  • [6] Zhang, Q., Shen, X., Xu, L., Jia, J.: Rolling guidance filter. In: European Conference on Computer Vision, Springer (2014) 815–830
  • [7] Zhang, F., Dai, L., Xiang, S., Zhang, X.: Segment graph based image filtering: Fast structure-preserving smoothing. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 361–369
  • [8] Xu, L., Yan, Q., Xia, Y., Jia, J.: Structure extraction from texture via relative total variation. ACM Transactions on Graphics (TOG) 31(6) (2012) 139
  • [9] Nguyen, R.M., Brown, M.S.: Fast and effective l0 gradient minimization by region fusion. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 208–216
  • [10] Ham, B., Cho, M., Ponce, J.: Robust image filtering using joint static and dynamic guidance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 4823–4831
  • [11] Ham, B., Cho, M., Ponce, J.: Robust guided image filtering using nonconvex potentials. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
  • [12] Xu, L., Ren, J., Yan, Q., Liao, R., Jia, J.: Deep edge-aware filters.

    In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15). (2015) 1669–1678

  • [13] Liu, S., Pan, J., Yang, M.H.: Learning recursive filters for low-level vision via a hybrid neural network. In: European Conference on Computer Vision, Springer (2016) 560–576
  • [14] Li, Y., Huang, J.B., Ahuja, N., Yang, M.H.: Deep joint image filtering. In: European Conference on Computer Vision, Springer (2016) 154–169
  • [15] Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: A generic deep architecture for single image reflection removal and image smoothing. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [16] Chen, Q., Xu, J., Koltun, V.: Fast image processing with fully-convolutional networks. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [17] Fan, Q., Wipf, D.P., Hua, G., Chen, B.: Revisiting deep image smoothing and intrinsic image decomposition. CoRR abs/1701.02965 (2017)
  • [18] Shen, X., Chen, Y., Tao, X., Jia, J.: Convolutional neural pyramid for image processing. CoRR abs/1704.02071 (2017)
  • [19] Lu, K., You, S., Barnes, N.: Double-guided filtering: Image smoothing with structure and texture guidance. In: The IEEE International Conference on Digital Image Computing: Techniques and Applications (DICTA). (Dec 2017)
  • [20] Liu, Q., Liu, J., Dong, P., Liang, D.: Sgtd: Structure gradient and texture decorrelating regularization for image decomposition. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 1081–1088
  • [21] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2014)
  • [22] Dana, K.J., Van Ginneken, B., Nayar, S.K., Koenderink, J.J.: Reflectance and texture of real-world surfaces. ACM Transactions On Graphics (TOG) 18(1) (1999) 1–34
  • [23] Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision. (2015) 1395–1403
  • [24] Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60(1-4) (1992) 259–268
  • [25] Farbman, Z., Fattal, R., Lischinski, D., Szeliski, R.: Edge-preserving decompositions for multi-scale tone and detail manipulation. In: ACM Transactions on Graphics (TOG). Volume 27., ACM (2008)  67
  • [26] Xu, L., Lu, C., Xu, Y., Jia, J.: Image smoothing via l 0 gradient minimization. In: ACM Transactions on Graphics (TOG). Volume 30., ACM (2011) 174
  • [27] Bi, S., Han, X., Yu, Y.: An l 1 image transform for edge-preserving smoothing and scene-level intrinsic decomposition. ACM Transactions on Graphics (TOG) 34(4) (2015)  78
  • [28] Subr, K., Soler, C., Durand, F.: Edge-preserving multiscale image decomposition based on local extrema. ACM Transactions on Graphics (TOG) 28(5) (2009) 147
  • [29] Buades, A., Le, T.M., Morel, J.M., Vese, L.A.: Fast cartoon+ texture image filters. IEEE Transactions on Image Processing 19(8) (2010) 1978–1986
  • [30] Liu, W., Chen, X., Shen, C., Liu, Z., Yang, J.: Semi-global weighted least squares in image filtering. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [31] Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Computer Vision, 1998. Sixth International Conference on, IEEE (1998) 839–846
  • [32] Petschnigg, G., Szeliski, R., Agrawala, M., Cohen, M., Hoppe, H., Toyama, K.: Digital photography with flash and no-flash image pairs. ACM transactions on graphics (TOG) 23(3) (2004) 664–672
  • [33] Bao, L., Song, Y., Yang, Q., Yuan, H., Wang, G.: Tree filtering: Efficient structure-preserving smoothing with a minimum spanning tree. IEEE Transactions on Image Processing 23(2) (2014) 555–569
  • [34] Karacan, L., Erdem, E., Erdem, A.: Structure-preserving image smoothing via region covariances. ACM Transactions on Graphics (TOG) 32(6) (2013) 176
  • [35] Jevnisek, R.J., Avidan, S.: Co-occurrence filter. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
  • [36] Rick Chang, J.H., Frank Wang, Y.C.: Propagated image filtering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 10–18
  • [37] Tan, X., Sun, C., Pham, T.D.: Multipoint filtering with local polynomial approximation and range guidance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2941–2948
  • [38] Dai, L., Yuan, M., Zhang, F., Zhang, X.: Fully connected guided image filtering. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 352–360
  • [39] Yang, Q.: Semantic filtering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4517–4526
  • [40] Cho, H., Lee, H., Kang, H., Lee, S.: Bilateral texture filtering. ACM Transactions on Graphics (TOG) 33(4) (2014) 128
  • [41] Zang, Y., Huang, H., Zhang, L.: Guided adaptive image smoothing via directional anisotropic structure measurement. IEEE transactions on visualization and computer graphics 21(9) (2015) 1015–1027
  • [42] Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residual network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
  • [43] Nah, S., Hyun Kim, T., Mu Lee, K.:

    Deep multi-scale convolutional neural network for dynamic scene deblurring.

    In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
  • [44] Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing 25(11) (2016) 5187–5198
  • [45] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  • [46] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4) (2004) 600–612
  • [47] Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)

7 Overview

The purpose of this supplementary material is to provide more analysis of our method and experimental results. Specifically, we first give more details about adding spatial variation to training data. Then, we provide the details about how to train the structure prediction network. After that, we give more examples of image smoothing results, as well as qualitative comparison with other methods. We also present a challenging case and analyze potential reasons behind it. Finally, we apply our method to, and show results for, three typical applications of image smoothing.

8 Details of Adding Spatial Variations to Training Data

Textures can appear in spatially varying forms in natural images as the texture is formed over objects and projected by the imaging process. We add this property when generating our training data. We mainly use four types of geometric transformation in the generating process: scaling, shearing, rotation, and free-form distortion. According to the geometric theory of computer vision

[47], combining the first three operations can be used to yield a weak perspective projection, which is consistent with the formation of natural images by cameras (assuming the camera is not too close to the scene). Free-form distortion is used to represent projection onto more arbitrary surfaces such as the body of a giraffe, or texture printed on a sheet of waving material.

Suppose the original coordinate is , and the transformed one is , we give formulas in the following. We also give a stripe texture example for illustration in Fig. 9.


resizes an image in or/and directions:


In our implementation, both and are randomly generated and fall into the range of , as shown in Fig. 9(b).


stretches the image in or/and directions:

  • direction:

  • direction:


In our implementation, and are randomly generated and fall into the range of , as shown in Fig. 9(c).

Figure 9: Illustration of spatial variations.


rotates an image in the image plane by an angle :


In our implementation, is randomly generated in the range of , as shown in Fig. 9(d).

Free-form distortion:

we add a free-form distortion in spatial coordinates to make the textures look more like in natural scenes. In our method, we randomly switch pixel values within a kernel :


In our implementation, is the kernel size and randomly selected from , as shown in Fig. 9(e).

One possible combined result is shown in Fig. 9(f). Compared with the original input, it has more variation in spatial coordinates, which is closer to some of the more distorted textures that we see in the nature.

9 Details of Training Structure Prediction Network

Network architecture

We use the HED [23] architecture to construct the structure prediction network (SPN). As shown in Fig. 3, SPF has 5 stages (each stage contains several convolutional and pooling layers) and is embedded in VGGNet (trims it by adding side output to the last convolutional layer at each stage, and replaces the fully connected layers with fully convolutional layers at the last stage). The side outputs are then fused to form the final output. The whole process can be expressed as the following function:


where is the side output from the stage (each stage contains several convolutional and pooling layers). Please find more details about the architecture in the original paper [23] and published code333

Network training

During training, we randomly and sparsely collect 300,000 patches with size from the 25,176 training images (as mentioned in the main body). Since we have binary edge maps, we follow the steps in [23] to re-train the network by considering both side output loss and fusion loss. The side loss of the stage is defined as


where and denote the edge (1) and non-edge (0) ground-truth labels respectively, represents the proportion of edge labels, and is the set of parameters. The total side output loss is the sum of five stages:


The fusion loss is calculated by the cross entropy loss between the fused image and ground-truth:


The total loss is the combination of side output loss and fusion loss:


We replace the Adam optimizer with the gradient descent algorithm (learning rate 0.0001, and momentum 0.9).

10 Additional Experiments on Texture Extraction

Texture extraction results on our test data

As shown in Fig. 3, when generating the data, we also provide ground-truth for texture prediction. Thus, we can investigate the texture extraction abilities of different methods by comparing the extracted textures with texture ground-truth (we actually aim to compare our single TPN with other methods). We present the texture extraction abilities of our method along with 6 typical texture separation algorithms that we select for comparison for comparison: Total Variation (TV) [24], L0 [26], Relative Total Variation (RTV) [8], Structure Gradient and Texture Decorrelation (SGTD) [20], fast L0 [9], static and dynamic filter (SDF) [11], and normalize their texture layers as the final results. We report the average MSE of different methods tested on our 7,497 testing data in Table 4. Our TPN achieves the smallest MSE among all the methods, showing its superiority in extracting textures.

Methods TV [24] L0 [26] RTV [8] SGTD [20] Fast L0 [9] SDF [11] Ours
MSE (texture extraction) 0.2175 0.2246 0.1954 0.1315 0.2369 0.1738 0.0196
MSE (image smoothing) 0.2791 0.2271 0.2388 0.1951 0.2068 0.1665 0.0051
Table 4: Quantitative evaluation of texture extraction results tested on our dataset
Methods TV [24] L0 [26] RTV [8] SGTD [20] Fast L0 [9] SDF [11] Ours
MSE (texture extraction) 0.2331 0.2494 0.2017 0.1608 0.2433 0.1795 0.0212
MSE (image smoothing) 0.2880 0.2539 0.2375 0.1974 0.2342 0.2190 0.0074
Table 5: Quantitative evaluation of texture extraction results tested on 100 new images

Texture extraction results on a new dataset

To further verify the generality of our proposed TPN to different types of textures, we make another small dataset specially for this testing. Specifically, we select 50 natural texture images from another public dataset444The dataset is from Signal and Image Processing Institute, University of Southern California. It is available at:, and 100 other structure-only cartoon images from the Internet. We blend-in these new textures to structure-only images in the same way as mentioned in the main body. In Table 5, we report the average MSE tested on the 100 new images. Unsurprisingly, our TPN achieves the smallest MSE again, indicating its adaptation to different types of textures. This result also helps explain why our TPN and filtering networks generalize well to natural image processing. We also give two examples from the new dataset for qualitative comparison in Fig. 10.

Figure 10: Image smoothing results on new images.

11 Ablation Study

In this section, we investigate the smoothing effect that is contributed by the two parts of the guidance. (No guidance, only texture guidance, only structure guidance, and double guidance).

Training and validation loss

We train the four networks (without guidance, only with structure guidance, only with texture guidance, and with double guidance) separately, and plot the MSE loss in 100 epochs in Fig. 

11. Compared with the results without guidance or with single guidance, the loss with double guidance is the smallest in both training and validation process. It can be seen that both parts of the guidance make an important contribution to overall performance. It further indicates the effectiveness of applying double guidance into image smoothing.

Qualitative comparison

We show several examples, including both generated (Fig. 12) and natural images (Fig. 13), to visually compare the smoothing results with different guidance. Overall, compared with the baseline (without any guidance), the results with only structure guidance can retain structures, as well as those of some strong textures. In contrast, the results with only texture guidance can smooth out textures, both strong and weak, more effectively. However, the main structures are obviously blurred. With double guidance, the filter takes advantage of the two properties and performs well in both preserving structures and removing textures.

(a) Training MSE loss
(b) Validation MSE loss
Figure 11: MSE loss of training and validation of four networks (without guidance, only texture guidance, only structure guidance, and double guidance). Overall, the loss with double guidance is the smallest in both training and validation process. It further indicates the effectiveness of using double guidance rather than single guidance or no guidance in image smoothing.

12 Comparison with Other Methods

We visually compare our smoothing results with Total Variation (TV) [24], L0 [26], Relative Total Variation (RTV) [8], guided filter (GF) [5], Structure Gradient and Texture Decorrelation (SGTD) [20], rolling guidance filter (RGF) [6], fast L0 [9], segment graph filter (SGF) [7], static and dynamic filter (SDF) [10], and double-guided filter (DGF) [19]. We use the default parameters defined in their open-source code.

Fig. 14 and Fig. 15 show image smoothing results on our generated and natural images respectively. They show that our filter performs consistently well in both circumstances in terms of structure preservation and texture removal.

To investigate the image smoothing performance of different deep models [12, 14, 13, 16, 15], we additionally give two challenging examples in Fig. 16. Note that all the models are trained on our dataset. It turns out that our model can remove textures while preserving structures more effectively.

13 Challenging Case

We give a challenging case in Fig. 17, where the eyes, nose, and number of the runner are totally removed as textures. But actually, they have important semantic meaning in the real world. The HED we use for constructing structure guidance pays more attention to the object boundary, rather than details within the object, so it does not give reasonable confidence to these important details. Also, our texture prediction network cannot distinguish them as well. Thus, there is still a long way to go before achieving ideal smoothing results.

14 Applications

Image smoothing is a fundamental technology in image processing and computer vision with a broad range of applications. In the following, we mainly study three typical applications: image abstraction, detain enhancement, and edge detection.

Image abstraction

Image abstraction aims to create a cartoon-like style from an input image. We use the method in [3] for image abstraction, which involves smoothing the input and retaining main structures, detecting difference-of-Gaussian edges, and abstracting the image with soft color quantization. Fig. 18 lists four examples, where we study the abstraction results of the original input and the smoothed image respectively. Obviously, after smoothing, the abstraction results have less noise and artifacts. Further, the structures are sharpened, indicating the effectiveness of image smoothing.

Detail enhancement

Suppose is the input image, and is the smoothed output. We define detail enhancement as: , where controls the extent ( in this case). The results with different methods are shown in Fig. 19. Our method is able to boost the details without affecting the overall color tone and without causing halos near structures.

Edge detection

Image smoothing can also function as an essential pre-processing step in many other visual tasks, like edge detection. In Fig. 20, we show the outcome of applying Canny edge detection to the original input and its smoothed version for the different guidance components. It is clear that with image smoothing, the Canny edges are clearer and more refined with less influence by insignificant details. We expect image smoothing to play a more significant role in other tasks. We will focus on this in future work.

Figure 12: Image smoothing results with different guidance on generated images.

Figure 13: Image smoothing results with different guidance on natural images.

Figure 14: Generated image smoothing results.

Figure 15: Natural image smoothing results.

Figure 16: Image smoothing results with different deep networks trained on our dataset. Our model performs better in removing textures and preserving structures at the same time.

Figure 17: Challenging case. The number, eyes, nose of the runner are smoothed out. Ideally, these should be preserved as they have significant semantic meaning.

Figure 18: Image abstraction results. Compared with the results to the original input directly, image smoothing can help to suppress more noise and artifacts, and sharpen the structures.

Figure 19: Detail enhancement results. Our filter can boost the details without affecting the overall color tone and causing halos near structures.

Figure 20: Canny edge detection results. After smoothing, the edges are clearer and more refined with less influence by insignificant details.