Concurrently Extrapolating and Interpolating Networks for Continuous Model Generation

01/12/2020 ∙ by Lijun Zhao, et al. ∙ BEIJING JIAOTONG UNIVERSITY 7

Most deep image smoothing operators are always trained repetitively when different explicit structure-texture pairs are employed as label images for each algorithm configured with different parameters. This kind of training strategy often takes a long time and spends equipment resources in a costly manner. To address this challenging issue, we generalize continuous network interpolation as a more powerful model generation tool, and then propose a simple yet effective model generation strategy to form a sequence of models that only requires a set of specific-effect label images. To precisely learn image smoothing operators, we present a double-state aggregation (DSA) module, which can be easily inserted into most of current network architecture. Based on this module, we design a double-state aggregation neural network structure with a local feature aggregation block and a nonlocal feature aggregation block to obtain operators with large expression capacity. Through the evaluation of many objective and visual experimental results, we show that the proposed method is capable of producing a series of continuous models and achieves better performance than that of several state-of-the-art methods for image smoothing.



There are no comments yet.


page 1

page 4

page 7

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To intelligently analyze image content and precisely identify scene objects, the image boundary is employed to provide many fundamentally vital clues that can generally be obtained via edge detection techniques [1, 2, 3, 4]. However, directly using these techniques always enable the detected edges to acquire some small yet unimportant discontinuity with strong gradients, which highly affects the performance of high-level real-world applications when using edge detection to extract low-level structural features. To resolve this problem, image structure extraction, namely, image smoothing, which is achieved by eliminating repeated texture elements, is usually treated as a prefiltering operation ahead of edge detection.

To the best of our knowledge, image smoothing, which is also called texture removal, has very wide application prospects in the fields of image processing, computer graphics, pattern recognition, computational vision, etc

[5, 6, 7]. When we wish to simultaneously conduct the image smoothing-oriented task and the texture removal-oriented task at the same time, these tasks can be considered as a special case of image decomposition. Decomposing undesirable textures from salient structures is a challenging and fundamentally ill-posed issue [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] since there is no clear definition of textures and structures. In fact, the prominent structures may consist of small textures on a certain scale, when viewed from a distance, but part of these small textures may possibly becomes remarkable structures when we observe them up close. Consequently, it is a hard problem to clearly identify them during image smoothing. The key challenges of texture removal or image smoothing issues can be summarized from several aspects as follows: (i) Distinguishing texture and structure is fundamentally ill-posed since there is no specific definition of them; (ii) Obtaining ground-truth labels with human-perceptual scores is a time-consuming and laborious task, which leads to a limited number of available training data; (iii) Domain gaps exists between the synthetic (made by the texture and cartoon-like image) and real images datasets, when implicitly differentiating salient structures and fine details/textures; and (iv) When viewing the edge information of the same region at different scales, different people may regard them as textures or structures. Since these challenging problems still exist, we should deeply study these issues and make more contributions to the topic of image smoothing in theory and practice.

Classical convex optimization facilitates many classical approaches capable of achieving some excellent performances for image smoothing, but their algorithm complexity tends to be extremely high for optimal solutions, especially in an iterative way [5, 6, 7, 8]. Several recent works [1, 4, 14, 21, 22, 23, 24]

have witnessed that the great progress is being made by deep learning techniques for low-level and high-level computer tasks. However, the learning of a deep neural network system depends entirely on the specific input and output labels; therefore, system can only obtain filtered images with a specific effect. When learning image smoothing operators, it is practically required that each learned model can generate various filtering effects of different magnitudes. However, most of existing deep learning approaches are not capable of achieving this functionality.

Although all of the problems discussed above for image smoothing require deep study, there are several fundamental yet key issues that require more urgent solution for most of the CNN-based approaches for practical applications, including the structure design of the deep image smoothing network with a strong capacity of image mapping, a training strategy, continuous model generation, etc. In this paper, we mainly study the problems of continuous network interpolation as well as learning multiple image smoothing operators without training multiple-times. Our contributions are summarized as follows:

  • We generalize neural network interpolation as a more powerful model generation tool by tuning network weighting parameters. This tool facilitates the option of choosing a certain desired effect.

  • A simple yet effective model generation strategy is proposed to form a sequence of models with a single training, which only requires a set of specific-effect label images.

  • We propose a double-state aggregation (DSA) module to fuse the information of different stages or states, which can be easily inserted into most of the current neural network architectures.

  • We design a DSA image decomposition network with a local feature aggregation block and a nonlocal feature aggregation block to obtain deep image operators with large expression capacity by optimization learning.

  • Through many objective and visual quality comparisons, it is demonstrated that the proposed method achieves better performance than that many state-of-the-art image smoothing methods.

The rest of this paper is organized as follows: 1) The related works are given in Section II, and the problem formulation is presented in Section III; 2) The proposed generalized multiple-model generation framework is introduced in Section IV; 3) We detail the structures of double-state aggregation module in Section V, after which we provide a detailed description of double-state aggregation neural network in Section VI; and 4) The experimental simulation is presented in Section VII, followed by the conclusion in Section VIII.

Ii Related works

As discussed above, there are many thorny problems for image smoothing. To resolve these problems, we first would like to comprehensively look back to traditional image smoothing techniques and deep image smoothing operators. Since deep network interpolation (DNI) topic plays a prominent part in learning image smoothing operator for continuous model generations, we also thoroughly review many literatures about this topic.

Ii-a Traditional image smoothing

To remove the low contrast edges and maintain remarkable boundaries, Xu et al. formulated the image smoothing issue as a global localization problem of important edges with a strategy of counting the number of major boundary pixels [5], which can suppress weak-gradient details. However, the smoothing strategy of this method was intrinsically to remove small nonzero gradients, which unavoidably led to the retention of certain isolated pixels with strong gradient boundary information. Following this work, Cheng et al. proposed a new approximation algorithm [15] to minimize the L0 gradient for two tasks of image smoothing and surface smoothing [5]. Later, by collecting and observing 200 images with types of structure-plus-texture, Xu et al. first verified that an inherent feature difference exists between textural and structural regions, which were discriminated by a measure of relative total variation (RTV). According to this measure, a global objective function was built with the data term and RTV regularizer [6], whose solution is obtained by an iterative numerical solver. Most recently, Guo et al. introduced a concept of relative structure to identify the nature of mutual structure considering inconsistent structure as well as flat structure [19], which was formulated as a nonconvex optimization problem similar to the relative total variation [6]. At the same time, Li et al. presented an efficient guided image smoothing by soft clustering as a kind of catalyzer to promote smoothing, which considered the spatial dependency between neighboring pixels as done in bilateral filtering.

Although these methods can resolve the problem of image smoothing, they cannot separate the input image as a group of images having different scale structures. Since different scale structures provide the nonlocal clues for scene content, it is of great importance to represent image structures at scales. Observing that Gaussian filtering can remove small structure edges but blur large-scale structures, Zhang et al. employed Gaussian filtering as an engine to distinguish image structure scales by dynamically rolling guidance filtering (RGF) [7]

. However, RGF has a severely intractable problem of accurately locating the object discontinuity position for scale-aware filtering at each scale. To tackle this issue, Zhao et al. leveraged a local activity measurement, that is, a clipped and normalized variance or standard deviation, to drive the relative total variation (RTV) for smoothing, namely, LAD-RTVs

[8], which can obtain multiscale representation images with better sharp boundary preservation than that of RGF. Borrowing some rules from the rolling guidance filter [7]

, Ham et al. jointly used image structural information from the static image and dynamically updated the image for a wide variety of applications, including image decomposition, flash/nonflash denoising, and depth super-resolution, etc


Ii-B Deep image smoothing operators

Most optimization-based approaches computed in an iteratively updating manner can hardly perform fast image filtering within a given time, whose parallel acceleration cannot be realized by using the graphics processing block (GPU) hardware. In contrast, DNN learning-based filtering methods are always able to be quickly run with the aid of the GPU. Xu et al. first proposed a learning system of boundary protection smoothing operators based on a DNN [9]. This system only needed the input and output of each operator to learn corresponding filtering models of various operators without the need of considering nonconvex optimization and the essential principle or theory of image smoothing.

Unlike two-stage image filtering [9], Zhao et al. proposed to learn deep image filtering operators for simultaneous image smoothing and edge detection [14]. As in the literature [14], since the image texture smoother and structural edge extractor helped each other achieve better performance than a single image processing task, Guo et al. jointly considered them following a principle of iteratively extracting salient edges and then removing the fine details based on the salient edges [17]. To carry out edge-preserving filtering in real time, Liu et al. treated the class of low-level vision issues as recursive image filtering [10]

, whose network was built upon the deep convolutional neural network as well as recurrent neural networks. According to the key idea of rolling guidance filter, Li et al. cast the image structure-texture separation task as deep joint image filtering by feeding the output from the previous iteration as the input of the current iteration

[11]. Similar to [11], Pan et al. directly applied the trained models of depth image denoising to scale-aware filtering and removed the small-scale structural details of the input images according to the rolling guidance strategy [13].

For general deep image smoothing operators, their neural network parameters should be trained using strong supervision with explicit structure-texture pairs as label images. This explicit supervision led to the problem of having only fixed-style smoothed images predicted by deep operators. For this problem, Kim et al. used a DNN to get a deep variational prior and inserted it into an iterative smoothing process by using a fast algorithm of alternating minimization [16]

. For learning-based texture estimation, Lu et al. formed a large dataset by merging numerous natural texture images together with some clean yet structure-only images

[25]. They developed a deep texture prediction network and a semantic structure prediction network to accurately distinguish the texture from the structure for structure-awareness preservation.

Since, in general, no public dataset was developed for an objective comparison of different algorithms, Zhu et al. established a new dataset to form a benchmark [12]. By using this dataset, a new class loss of the weighted root mean squared error and weighted mean absolute error was introduced to train a general model by measuring the distances between the predicted image and a series of smoothed image labels, which was selected by 14 volunteers from several algorithms with different parameter settings. However, only one deep image smoothing operator can be learned in [12]

when we use this dataset , which is labeled by humans according to visual perception. Although the above methods can learn deep image operator by data-driven training, it is troublesome to train a model for the same algorithm with different hyperparameters every time.

Fig. 1: The visualization of the smoothed image predicted with the models -0, -1 and -2 generated by concurrent extrapolation and interpolation (CEI) tool, when two models and are given ( is a control parameter to adjust the smoothness of generated image).

Ii-C Deep network interpolation

In the early days, Upchurch et al. linearly interpolated pretrained deep convolutional features for automatic high-level semantic transformation [26], which can be coarsely called a special kind of network interpolation. Later, Su et al. modified standard convolutions as a pixel-adaptive convolution (PAC) operation to spatially vary the convolution kernel for effectively learning guidance information [27]. After comparing the operations, it was found that they explore the harmonization of convolutional features, but they are two different tasks, among which the PAC is not designed for DNI.

To avoid repetitive training, Fan et al. first presented a mechanism of learning multiple deep parameterized image operators at the same time, which dynamically updated the deep base network’s weights according to the weight learning network for learning multiple image operators [28]. To further reduce the computational cost, they also extended it to change the weights of only a single layer dynamically in the base network. In [29], Kim et al. systematically reported CNN-based operator by carefully exploring its work principles. In the same period, Wang et al. gave a simple yet efficient strategy for DNI [30]

, in which two or more correlated networks were linearly interpolated to smoothly control diverse imagery effects because almost all of the works targeted learning a deterministic mapping for the desired imagery effect. They widely apply DNI in super-resolution, image restoration, JPEG artifact removal and image-to-image translation as well as style transfer, etc.

Similarly, He et al. designed an adaptive feature modification, that is, AdaFM-Net, and it modified the channel-wise feature by adjusting an interpolation coefficient of a basic model and a modulation layer between a start and an end level [31], since it is not easy to generalize deep neural network models toward continuously restoring contaminated images with unseen distortion-levels. The PAC and AdaFM-Net share some ideological similarities, for instance, both of them use a convolution layer to adapt it to a specific domain, but there are prominent differences between them. The former focuses on learning spatially adaptive guidance for deep joint image filtering, while the latter mainly studies a generalized model for continuously unseen distortion-level restoration without the need of an additional training stage [27]. Similar to AdaFM-Net, Wang et al. designed a CFSNet for image restoration, which controls latent convolutional features by adaptively learning coupling coefficients of diverse layers and channels [21].

Fig. 2: The visualization of the smoothed image predicted with the models generated by concurrent extrapolation and interpolation (CEI) tool, when only a set of specific-effect label images is given.

Iii Problem formulation

In theory, it is mathematically assumed that a captured image is able to be decomposed as a linear combination of a texture layer and a structure layer , i.e., . Given a natural image, an infinite number of solutions can be obtained according to a variety of priori constraints. In [12], all the ground-truth smoothing results obtained by using many classically advanced image smoothing algorithms for training and testing are manually selected by 14 volunteers. According to human subjective perception, a high weight is assigned for each high-quality ground-truth smoothing image to form the quantitative measures [12]. Although a fixed-effect human-perceptual model is capable of being trained by using this dataset as the training one, there is no other available human-perceptual model with continuous imagery generation; therefore, the needs of different users cannot be met. Consequently, we should extend this fixed human-perceptual model to multiple models to satisfy various requirements. When directly training different models with different perceptual labels, we must have more ground-truth smoothing images with various assigned scores after these images are repeatedly watched and assessed by many volunteers. However, it is almost impossible to carefully and accurately label each image, especially for many datasets with millions of images, which is a labor-intensive and time-consuming process. Meanwhile, there is no efficient tool to quantify the human consistent assessment when continuous-perception scores are required.

To resolve these problems, we should fully take account of multiple model generation techniques such as the DNI, when putting forward an image smoothing approach based on artificial neural network. Given two correlated models based on a deep neural network, parameterized by various groups of parameters and , a model generation of the DNI technique [30], whose parameters are , can be formulated as:


where is a weighting coefficient to control continuously specific degrees between imagery effect- and imagery effect-, whose corresponding models are labeled as and , e.g., image smoothness for texture removal. When any one model such as is trained with a group of specific-effect training datasets, the left model can be fine-tuned to adapt to a new group of specific-effect training datasets and vise versa.

Fig. 3: The diagram of the generalized multiple model generation (Each pentagonal star represents a specific model using the same network architecture and , , , and are some examples of continuous modes).

As exhibited above, the new model has the same network structure as the models and , which has layers. The intermediate-effect feature maps can be generated by linear parameter interpolation, when the feature maps in the -th layer is fed into the -th layer of the interpolated model. The network interpolation process can be represented as follows:


in which is the convolution operation. After simplification, we can obtain:


This equation can be further rewritten as:


where and are respectively, the predicted results from and using convolutional kernels of and

. From this equation, it is easy to understand that the intermediate-effect images can be obtained by the linear weighting method. When each layer is handled by the piecewise linear activation function, e.g., using the rectified linear unit (ReLU) function or its variants, such as the parametric ReLU and leaky ReLU function, to a certain extent, the interpolated network can maintain the characteristic of linearity for the output of each layer, which finally leads to continuous intermediate-effect images created by the new interpolated networks when continuously varying the trade-off parameter

. The majority of the above deep methods solely consider the study of deep network interpolation, and almost no work studies the topic of deep network extrapolation. Given only a set of specific-effect label images, there is no available technique for concurrent network interpolation and extrapolation. Thus, it is necessary to conduct a more in-depth study for continuous model generation.

Iv Proposed Generalized Multiple Model Generation Framework

Currently, many image smoothing operators are always trained repetitively, when different explicit structure-texture pairs are employed as label images for each algorithm with different parameters. This kind of training often takes several days or even two weeks, and it also consumes massive equipment resources. To resolve these challenging issues, we generalize the DNI technique as a more powerful model generation tool, namely, the concurrent extrapolation and interpolation (CEI) tool, to obtain more novel models. In other words, these models can produce a sequence of new images, in which the predicted effects are less than/more than the two given models, rather than only continuous imagery intermediate-effect transition, in comparison to the DNI technique. Our CEI tool refers to concurrent deep network extrapolation (DNE), which has a forward DNE mode and back DNE mode , and deep network interpolation to generate continuous models, as shown in Fig. 3, whose predicted images are depicted in Fig. 1 and can be written as:


Given two models (less smooth) and (more smooth) of image smoothing as well as an extrapolating parameter or , for instance, we apply the forward DNE mode, which means that predicted images become smoother when gradually increases. To clearly observe it, we can rewrite the forward DNE part of the above equation as , in which is far greater than 1 since is restricted between 0 and 1. As a result, the effect predicted by the generated model , whose smoothness is denoted as , is smoother than with a smoothness of , that is, . Similarly, we can obtain for back DNE mode. As described in the last section, when applying deep networks interpolation.

Since there is no further training of deep extrapolated networks from the given models, some problems such as color drift may appear in the predicted image, when directly using the above tool. To alleviate these problems, we first interpolate the intermediate-effect imagery with the generated model between the imagery effect- from a given model and imagery effect- from a given model . Since the similarity between the models and / is higher than that of the models and , we use the models and / to form continuous models for deep network extrapolation. That is, as shown in Fig. 3, we can reformulate it as follows:


in which we propose a two-step (TS) deep network extrapolation to predict a series of images, as shown in Fig. 1. To infer this equation for the two-step deep network extrapolation, we should firstly obtain the intermediate model via the deep network interpolation operation, after which the deep networks extrapolation operation is fulfilled to form extrapolated models. For our actual usage, we can directly use the formula in Eq. 9 rather than to generate extrapolated models step-by-step.

Fig. 4: The diagram of double-state aggregation module (DSA module, in which , and are two inputs and the output of this module respectively).

This tool solely supports image smoothing operator learning to obtain two models from two given sets of datasets, so we cannot interpolate a series of new models directly when only a set of specific-effect label images is given. To be capable of learning multiple model generation operators for this case, we propose a simple yet effective model generation strategy to form a sequence of models, that is, mapping the input image back to the input image to continue training a new operator, after learning an operator of a set of specific-effect labels. Then, we can use these models to concurrently extrapolate and interpolate networks to obtain new models, whose predicted images are shown in Fig. 2, toward continuous imagery transition for image smoothing.

Fig. 5: The diagram for the proposed deep double-state aggregation neural network (DSAN).

V Double-State aggregation (DSA) module

In the past years, a general deep cascaded network has been generally built up by a sequence of convolution layers and activation functions [32, 33]. However, since the derivative of each convolution layer will exponentially vary to be very large or very small, the gradients cannot adequately transmit from very deep layers to the shallow layers, and thus this cascaded network is not able to be well trained for task prediction when there is no use of expert knowledge tricks, including pretraining, layer-by-layer initialization, etc [34]

. To relieve these issues, a series of network structures with a short-cut connection or residual connections such as ResNet and Inception-ResNet are explored to improve the accuracy of high-level and low-level computational vision task prediction

[24, 23]

. These networks always focus on the study of a better approach to gradient backpropagation and residual learning. However, they lose sight of the importance of the features fusion of different stages or different layers.

Although the recurrent neural network (RNN) and its variants such as long short-term memory (LSTM), and the gated recurrent neural network (GRU) have two inputs and two outputs, these recycled structures are designed for sequence data and are unsuitable for image data. To fully leverage diverse features at different stages or at various layers, we propose a double-state aggregation module, as shown in Fig.

4. Different from RNN and its variants, our module structure modulates double states of inputs as multiple states and combines them together in a summation approach, after which these states are activated to obtain triple residual states. Finally, these triple residual states are summarized up to form a final fusion output for our DSA module.

Given two input states: and , four pairs of dual convolution summation (DCS) units fuse these inputs together by the convolutional operation to permutate and combine these input features in the form of addition. Here, to avoid wasting the convolutional features with negative values, the outputs of the first DCS unit in Fig. 4 are multiplied by and before they are activated by the leaky ReLU function (LRelu, ), that is,


in which , and are two weights and the bias of the first DCS unit. After the activation of feature maps, they are respectively multiplied by tanh () activated output features and of the second DCS unit and third DCS unit in an elementwise manner, and then we can obtain and , that is,


in which , and are two weights and the bias of the 2nd/3rd DCS unit. Additionally,

denotes the elementwise multiplication. In the meantime, the outputs of the last DCS unit are activated by the sigmoid function

, which is multiplied by tanh-activated input-b features in an elementwise manner. This operation can be written as:


in which , and are two weights and the bias of the last DCS unit, respectively. Finally, three kinds of features are combined utilizing addition, which can be written as:


in which is the final output of the module.

Vi Deep Double-State Aggregation Neural Network

In this paper, we design a deep double-state aggregation neural network (DSAN) to learn image smoothing operators, as depicted in Fig. 5. It is composed of four parts: input feature initialization (IFI) block, a local feature double-state aggregation (LFA) block, a non-local feature double-state aggregation (NLA) block, and an upsampling-reorganization (URO) block. Since the IFI block is responsible for extracting -channel feature maps from the input image , only one convolutional layer with a spatial kernel size of is used to obtain them, followed by the leaky ReLU function , which can be written as , in which is the parameter of the IFI block. After feature initialization, we use a local feature aggregation block and a nonlocal feature aggregation block to obtain local and nonlocal features from distinct receptive field regions. As described above, we present a double-state aggregation (DSA) module, which can efficiently fuse features of different-stages or different-layers. Note that it can be easily inserted into most current network architectures. In the following, we will introduce the proposed LFA block, NLA block and URO block in detail.

Vi-a Local feature double-state aggregation (LFA) block

Before introducing the LFA block, the ResNet-like structure with our DSA module is first defined as Res-DSA, which is marked as , as shown in Fig. 5. When it is the -th times to use the structure of Res-DSA, this process is denoted as . In the Res-DSA, the short-cut connection and two consecutive operations of convolutional layers form two states as the inputs and of a DSA module. Different from ResNet, we use our DSA module to merge double-state together, rather than direct summation together. In the LFA, two Res-DSA are first cascaded together to extract local features, while both of them have various outputs in different stages of the LFA block, from which we can obtain two local states and . Finally, these two states are merged by a new DSA module for different-stage information (DSI) aggregation, which is presented as:

Fig. 6: The diagram of multiple stages atrous-convolution (MSAC).

Vi-B Nonlocal feature double-state aggregation (NLA) block

Since the receptive field of the LFA block is limited within certain local regions, we use multiple-stage atrous-convolution (MSAC) to obtain nonlocal features, which is inspired by [22]. We denote this operation as when it is the -th time to use this network structure. Within the structure of MSAC, three parallel LReLU-activated atrous convolution layers with a spacing of 1, 4, and 8 between kernel elements is followed by a concatenation operation and a standard convolution layer with a spatial size of 1x1, namely, parallel connected atrous convolution (P-CAN), which can capture larger field features than that using several standard convolutions, as shown in Fig. 6. To nonlocally perceive spatial correlations, we use three cascaded P-CAN to extract features. Before concatenating multiple-stage features from these three P-CAN together, we use three standard convolution layers with a spatial size of 1x1 to rearrange these features respectively. Finally, we use standard convolution layers to shrink the channel number after concatenation.

However, global feature extraction on the full-resolution feature maps with high computation complexities always takes very large memory. As a result, the spatial size of the input feature maps of

should be diminished by a downsampling operation

, that is, using a standard convolution layer with a stride of

, whose operation can be written as: . After that, we combine the MRAC together with Res-DSA as:


By combining MRAC and Res-DSA twice, two non-local states are formed as the inputs of a DSA module for DSI aggregation, that is,


Vi-C Upsampling-Reorganization (URO) block

After local and nonlocal feature aggregation, outputs from both of the LFA block and the NLA block are concatenated and then fused by our URO block. Since the spatial size of the feature map is less than that of , we cannot concatenate them along the channel dimension. Thus, the transposed convolutional operation is used to upsample to full resolution, which can be written as: . After the concatenation of and , we use a standard convolution to halve the channel number of concatenated feature maps, . To well reorganize these features , we use the Res-DSA, which is followed by a convolution layer to reconstruct a output , which is the final predicted image .

Learning Strategy B B B B B A2B B2A2B
DSI aggregation N Y N N Y Y Y
TABLE I: The setting of our entire model of DSAN (specific-effect B) and its a series of its variant models (Y=yes, N=no).
Learning Strategy B2A B2A B2A A A B2A A2B2A
DSI aggregation N Y N N Y Y Y
TABLE II: The setting of our entire model of DSAN (specific-effect A) and its a series of its variant models (Y=yes, N=no).
1:Given two sets of image pairs: and ; Given a network structure of our DSAN
2:Two specific-effect models and corresponding continuous interpolated or extrapolated continuous models: , and ;
3:Randomly initialize the weights of each convolutional layers in the network of DSAN for a specific effect ;

Optimize the loss function of Eq. (

17) with for a specific effect to get a start model of ;
5:Initialize the network for a specific effect using the parameters of ;
6:Optimize the loss function of Eq. (17) with for a specific effect to get a model of ;
7:Initialize the network for a specific effect using the parameters of ;
8:Optimize the loss function of Eq. (17) with for a specific effect to get a model of ;
9:extrapolate and interpolate simultaneously for new continuous models using the proposed CEI tool according to Eq. (9);
10:return , and ;
Algorithm 1 Learning two deep image smoothing operators and generating continuous models with the proposed CEI tool

Vi-D Learning strategy

Given a pair of input images with a size of , our target is to learn a DSAN network for a specific image smoothing operator, e.g., to predict a specific effect- image using as its label counterpart, whose loss function of training our framework can be formulated as:


in which is the L1 norm, is the pixel set of , and is the calculation of the structural similarity index (SSIM) of each pixel between and . In addition, is a predefined trade-off hyperparameter to harmonize a balance between L1-restricted data loss and SSIM loss when learning an image smoothing operator.

Recently, it has been proven that the predicted accuracy of fine-tuned models can be greatly improved [35, 36]. To learn two correlated image smoothing operators and using the same network structure, for example, when we have first obtained a CNN model for a specific effect by learning, all of the parameters of the operator can be initialized by the model of and are trained to obtain a model of a specific effect . Compared with , the learned is trained with a set of parameters as a pretrained model, and thus, a better model of the specific effect can be further trained to obtain a new model of , when the parameters of are used as the weight initialization of the network. Given two learned models acquired by sharing the same network structure, a group of continuous effect operators can be simultaneously extrapolated and interpolated using our CEI tool according to Eq. (9). To clearly see it, we summarize this procedure in Algorithm-1. For ease of expression, we denote the learning strategy in Algorithm-1 as A2B2A. Similarly, we can use B2A to represent this strategy, when we first obtain a CNN model of , which is followed by training a new model . If is learned by using the parameters of as its initialized counterpart after the learning of B2A, we refer it as B2AB.

Vii Experimental Simulation

To verify the effectiveness of the proposed method, we give a qualitative and quantitative evaluation of image smoothing. The measurements of PSNR and SSIM are calculated as quality metrics to compare the performance of several comparative methods. Just as [37], the quantity of SSIM is transformed to decibels to make the quality factor legible according to because the values of SSIM always tend to be close to each other.

Fig. 7: The objective comparison between our entire model of DSAN and its a series of its variant models.

Vii-a Implementation Details

To train our model, we use 400 images as the training dataset from [12], of which 100 images of the testing images (Denoted as TIP100) are utilized to validate the efficiency of the proposed method. Meanwhile, 126 images are picked from the testing dataset of CUFED5 from [38]. For simplification of testing, we resize the 226 testing images as 528*400, which can be found in this website111 To augment the training data, we flip each image horizontally and vertically in a random style when training all the models. The proposed networks are trained using the Adam optimizer with =0.1 and =0.999. The initial learning rate is set to

. The learning rate is decreased in a stepped-descent manner, with an attenuation factor of learning rate set as 0.5 and the learning rate decay period is set as 100. We implemented our deep model learning in the PyTorch framework and these models are trained using an NVIDIA GeForce GTX 2080TI GPU. It takes approximately ten hours to train our single DSAN model.

Vii-B Ablation Study

To demonstrate the rationality of the network structure, we give a series of performance comparisons of our double-state aggregation neural network, when discarding some components and replacing them with the popular ResNet. Here, the specific-effect (A/B) of the L0 gradient minimization (denoted as L0GM) are learned, with a smoothness controlling parameter of set as . At the same time, let of the L0GM method to be 2, which is a recommended parameter to dominate the image boundary sharpness of natural images. To clearly observe and compare the divergences between our DSAN and its variants, TABLE. I and TABLE. II are provided. Note that the models of our DSAN and its variants retain similar numbers of network parameters.

The objective quality comparison of each model is given when testing on the TIP2019 dataset in Fig. 7, from which it be found that, without the restriction of SSIM loss for training, both of SSIM(D) and PSNR of the entire DSAN model drop off up to 0.69dB/0.43dB and 1.3/0.78, compared to DSAN2, when using the L0GM method to generate label images with . The performance of DSAN’s variants such as DSAN1(B), DSAN2(B) and DSAN3(B) degrades when removing some components such as the MSAC, Res-DSA, and DSI aggregation within the network structure of the proposed DSAN. From objective comparisons between DSAN5 and DSAN (or between DSAN6 and DSAN), it is apparent that the performance of the proposed method can be greatly improved when using the proposed learning strategy described in Algorithm-1.

To-be-learned Models L0GM(0.00431) L0GM(0.02)
Dataset Method/Measurement PSNR SSIM(D) PSNR SSIM(D)
DEAF [9] 31.25 14.40
PR2019 [14] 32.62 18.63
PAMI2019 [28] 38.26 17.77 35.66 16.36
TIP100 VDCNN(TIP2019) [12] 37.62 15.62 31.65 11.59
ResNet(TIP2019) [12] 41.17 21.19 35.05 17.42
DnCNN(19,256) [39] 41.05 20.56 33.91 15.44
DSAN(w/o SSIM loss) 41.94 21.55 36.07 18.21
DSAN 42.37 22.08 36.29 18.63
DEAF [9] 31.22 13.85
PR2019 [14] 31.60 14.99
PAMI2019 [28] 39.13 17.45 35.93 16.20
CUFED5 VDCNN(TIP2019) [12] 37.25 13.68 31.34 10.42
ResNet(TIP2019) [12] 41.39 19.79 35.00 16.29
DnCNN(19,256) [39] 40.99 18.54 33.61 13.95
DSAN(w/o SSIM loss) 42.22 20.41 36.16 17.06
DSAN 42.67 20.60 36.39 17.21
TABLE III: The objective quality comparison of image smoothing results predicted by different learned operators(COLOR has the best performance, COLOR is the second one, and COLOR is the third one).

Vii-C Quality Comparison

To validate the efficiency of the proposed DSAN model for image smoothing, we compare our entire model of DSAN and DSAN(w/o SSIM loss) with several existing state-of-the-art approaches such as DEAF [9], PR2019 [14], PAMI2019 [28], VDCNN(TIP2019) [12], ResNet(TIP2019) [12] and DnCNN(19,256) [39] in terms of PSNR and SSIM(D), as displayed in TABLE. III. In this table, the proposed method of DSAN(w/o SSIM loss) does not make use of SSIM loss during training in contrast to DSAN.

Fig. 8: The visual comparison of image smoothing results when using different network architectures.

TABLE. III shows that our DSAN consistently has the highest objective quality measurements for image smoothing, while the objective quality of DSAN(w/o SSIM loss) ranks second, when comparing these latest approaches and DSAN(w/o SSIM loss). In other words, these objective results, to a certain extent, reflect that the design of the DSAN network architecture is rational and the proposed algorithm is effective in training. Meanwhile, the performance of ResNet(TIP2019) [12] on most occasions is better than that of DEAF [9], PR2019 [14], PAMI2019 [28], VDCNN(TIP2019) [12], and DnCNN(19,256) [39] since ResNet(TIP2019) [12] uses the residual convolution (Res-conv) with the skipping connection, which is conducive to the fast convergence of the network with the help of easy gradient backpropagation. Similar to ResNet(TIP2019), the residual reconstruction of VDCNN(TIP2019) [12] and DnCNN(19,256) [39] in a skip-connection manner only predicts the residual information rather than direct prediction with the cascaded convolutional network, but they have less capacity than does ResNet(TIP2019) with Res-conv as its fundamental building blocks since VDCNN(TIP2019) [12] and DnCNN(19,256) [39] only use the skip connection one time. Although the network of PAMI2019 [28] without using residual reconstruction is a cascaded neural network composed of three standard convolutional layers and seven Res-conv blocks that are followed by three standard convolutional layers, its predicted images are less similar to the ground-truth images than those of ResNet(TIP2019); however, PAMI2019 [28] achieves higher performance than that of VDCNN (TIP2019) [12] in the terms of PSNR and SSIM(D). In these methods, VDCNN(TIP2019) [12] has the worst performance on SSIM(D) measurements, that is, other methods can better preserve the edge structures. In addition, all of the other methods achieve higher objective quality than that of DEAF [9] when measuring the PSNR of the results predicted by these methods.

Fig. 9: The visualization of smoothed images (in the first row) predicted with models generated by our CEI tool and the residuals (in the second row) between these images and the input image, when only a set of specific-effect label images generated from [12] is given for training.

To more clearly see the performances of these methods, we also give a visual comparison of these approaches, as provided in Fig. 8. From these figures, it can be seen that the image smoothing results of our DSAN(w/o SSIM loss) and our DSAN are the most similar to the ground truth compared with all of the other methods, while DSAN can better preserve image structural contours than DSAN(w/o SSIM loss). At the same time, the smoothness of the results predicted by DnCNN(19,256) [39] and VDCNN(TIP2019) [12] is less than that of all of the other comparative approaches since the architectures of DnCNN(19,256) [39] and VDCNN(TIP2019) [12] directly cascade convolutional layers one-by-one. Among these methods, the DEAF [9] is a special approach, which first uses a convolution neural network to estimate smoothed image gradients, and then these gradients are used to remove image textures by a traditional optimization technique. Apparently, the results of DEAF [9] are affected by both the estimated gradients and the corresponding optimization technique. From Fig. 8, it can be easily found that the smoothed image predicted by DEAF [9] has some false-boundary artifacts around the image’s outermost regions in Fig. 8, which do not appear in the images derived from the other methods.

As discussed above, F. Zhu et al. provided a benchmark to learn both VDCNN and ResNet models with the help of subjectively perceptive weighted loss functions for edge-preserving image smoothing, in which seven classical image smoothing algorithms are used to produce the ground-truth label image. Although the trained models of VDCNN and ResNet from [12] are able to obtain some satisfactory results, they can only produce fixed-effect smoothed images, which cannot produce some similar results to meet the requirements of different users. To generate a series of smoothed images, we use their model-predicted images from the ResNet model of [12] as our labels to train the proposed network of DSAN. The visual results from continuous models generated by our model generation tool are shown in Fig. 9, from which it is seen that the proposed framework can produce a series of continuous models by simultaneously extrapolating and interpolating neural networks for image smoothing. Compared to that of DNI [30], the priority of the proposed model generation tool lies in the capability of our framework to create a group of new images with continuous effects when only the image dataset and its corresponding labeling images with specific effects are given. In this case, the original DNI fails to form multiple continuous models. It is noteworthy that our model generation tool can produce more models than DNI when concurrently extrapolating and interpolating networks. However, DNI can only obtain the intermediate-effect images, which is bound to restrict its wide practical application.

Viii Conclusion

In this paper, we propose a powerful model generation tool by generalizing continuous network interpolation. Meanwhile, a simple yet effective model generation strategy is given to form a sequence of models when only a set of specific-effect label images is provided. In addition, we present a double-state aggregation (DSA) module to learn image smoothing operators, which can be easily inserted into most current network architectures. Based on this module, we propose a double-state aggregation neural network structure with large expression capacity to learn image smoothing operators. To validate the rationality of our network design, we conduct many experiments and provide the experimental results for our entire model of DSAN and its series of variant models. Numerous objective and visual experimental results show that the proposed method is better than several novel methods in terms of PSNR and SSIM. Note that the proposed method, which is not restricted to textural removal, has wide practical applications including image style transfer, image-to-image translation and so on.


  • [1] Y. Aksoy, T. H. Oh, S. Paris, M. Pollefeys, and W. Matusik, “Semantic soft segmentation,” ACM Transactions on Graphics, vol. 37, no. 4, p. 72, 2018.
  • [2] D. Acuna, A. Kar, and S. Fidler, “Devil is in the edges: Learning semantic boundaries from noisy annotations,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, Jun. 2019.
  • [3] J. He, S. Zhang, M. Yang, Y. Shan, and T. Huang, “Bi-Directional Cascade Network for Perceptual Edge Detection,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun. 2019.
  • [4] L. C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille, “Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform,” in IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun. 2016.
  • [5] L. Xu, C. Lu, Y. Xu, and J. Jia, “Image smoothing via L0 gradient minimization,” ACM Transactions on Graphics, vol. 30, no. 6, pp. 1–12, 2011.
  • [6] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via relative total variation,” ACM Transactions on Graphics, vol. 31, no. 6, pp. 1–10, 2012.
  • [7] Q. Zhang, X. Shen, L. Xu, and J. Jia, “Rolling guidance filter,” in European conference on computer vision, Zurich, Sep. 2014.
  • [8] L. Zhao, H. Bai, J. Liang, A. Wang, B. Zeng, and Y. Zhao., “Local activity-driven structural-preserving filtering for noise removal and image smoothing,” Signal Processing, vol. 157, pp. 62–72, 2019.
  • [9] L. Xu, J. Ren, Q. Yan, R. Liao, and J. Jia, “Deep edge-aware filters,” in

    International Conference on Machine Learning

    , Lille, Jun. 2015.
  • [10] S. Liu, J. Pan, and M. H. Yang, “Learning recursive filters for low-level vision via a hybrid neural network,” in European conference on computer vision, Amsterdam, Oct. 2016.
  • [11] Y. Li, J. B. Huang, N. Ahuja, and M. H. Yang, “Joint image filtering with deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, pp. 1909–1923, 2019.
  • [12] F. Zhu, Z. Liang, X. Jia, L. Zhang, and Y. Yu, “A Benchmark for Edge-Preserving Image Smoothing,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3556–3570, 2019.
  • [13] J. Pan, J. Dong, J. S. Ren, L. Lin, J. Tang, and M. H. Yang, “Spatially Variant Linear Representation Models for Joint Filtering,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, Jun. 2019.
  • [14] L. Zhao, H. Bai, J. Liang, B. Zeng, A. Wang, and Y. Zhao, “Simultaneous color-depth super-resolution with conditional generative adversarial networks,” Pattern Recognition, vol. 88, pp. 356–369, 2019.
  • [15] X. Cheng, M. Zeng, and X. Liu, “Feature-preserving filtering with L0 gradient minimization,” Computers Graphics, vol. 38, pp. 150–157, 2014.
  • [16] Y. Kim, B. Ham, M. N. Do, and K. Sohn, “Structure-Texture Image Decomposition Using Deep Variational Priors,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2692–2704, 2018.
  • [17] X. Guo, S. Li, L. Li, and J. Zhang, “Structure-Texture Decomposition via Joint Structure Discovery and Texture Smoothing,” in IEEE International Conference on Multimedia and Expo, San Diego, Jul. 2018.
  • [18] B. Ham, M. Cho, and J. Ponce, “Robust guided image filtering using nonconvex potentials,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 1, pp. 192–207, 2018.
  • [19] X. Guo, Y. Li, J. Ma, and H. Ling, “Mutually guided image filtering,” IEEE transactions on pattern analysis and machine intelligence, vol. PP, no. 99, pp. 1–1, 2018.
  • [20] L. Li, X. Guo, W. Feng, and J. Zhang, “Soft clustering guided image smoothing,” in IEEE International Conference on Multimedia and Expo, San Diego, Jul. 2018.
  • [21] W. Wang, R. Guo, Y. Tian, and W. Yang, “CFSNet: Toward a Controllable Feature Space for Image Restoration,” in arXiv:1904.00634, 2019.
  • [22] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in International Conference on Learning Representations, San Juan, May 2016.
  • [23] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in

    In Thirty-First AAAI Conference on Artificial Intelligence

    , San Francisco, Feb. 2017.
  • [24] Z. Wu, C. Shen, and A. D. Van, “Wider or deeper: Revisiting the resnet model for visual recognition,” Pattern Recognition, vol. 90, pp. 119–133, 2019.
  • [25] K. Lu, S. You, and N. Barnes, “Deep Texture and Structure Aware Filtering Network for Image Smoothing,” in European conference on computer vision, Munich, Sep. 2018.
  • [26]

    P. Upchurch, J. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Weinberger, “Deep feature interpolation for image content changes,” in

    IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul. 2017.
  • [27] H. Su, V. Jampani, D. Sun, O. Gallo, and J. Kautz, “Pixel-Adaptive Convolutional Neural Networks,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun. 2019.
  • [28] Q. Fan, D. Chen, L. Yuan, G. Hua, N. Yu, and B. Chen, “A General Decoupled Learning Framework for Parameterized Image Operators,” IEEE transactions on pattern analysis and machine intelligence, vol. PP, no. 99, pp. 1–1, 2019.
  • [29] S. W. Kim, S. J. Cho, K. H. Uhm, S. W. Ji, S. W. Lee, and S. J. Ko, “Evaluating Parameterization Methods for Convolutional Neural Network (CNN)-Based Image Operators,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun. 2019.
  • [30] X. Wang, K. Yu, C. Dong, X. Tang, and L. C. Change, “Deep Network Interpolation for Continuous Imagery Effect Transition,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun. 2019.
  • [31] J. He, C. Dong, and Y. Qiao, “Modulating Image Restoration with Continual Levels via Adaptive Feature Modification Layers,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun. 2019.
  • [32] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Neural Information Processing Systems, Montr al, Dec. 2015.
  • [33] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Multiple description convolutional neural networks for image compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2494–2508, 2019.
  • [34] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” ACM Transactions on Graphics, vol. 521, no. 7553, p. 436, 2015.
  • [35] K. Yanai and Y. Kawano, “Food image recognition using deep convolutional network with pre-training and fine-tuning,” in IEEE International Conference on Multimedia and Expo Workshops, Long Beach, Jun. 2015.
  • [36] Y. Wang, H. Bai, L. Zhao, and Y. Zhao, “Cascaded reconstruction network for compressive image sensing,” EURASIP Journal on Image and Video Processing, vol. 1, no. 77, pp. 1–16, 2018.
  • [37]

    M. D. S. S. H. S. J. . J. N. Ball , J., “Variational image compression with a scale hyperprior,” in

    International Conference on Learning Representations, Vancouver, Apr. 2018.
  • [38] Z. Zhang, Z. Wang, Z. Lin, and H. Qi, “Image Super-Resolution by Neural Texture Transfer,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun. 2019.
  • [39] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.