Wide Activation for Efficient and Accurate Image Super-Resolution

08/27/2018 ∙ by Jiahui Yu, et al. ∙ adobe University of Illinois at Urbana-Champaign ByteDance Inc. Snap Inc. Stevens Institute of Technology 0

In this report we demonstrate that with same parameters and computational budgets, models with wider features before ReLU activation have significantly better performance for single image super-resolution (SISR). The resulted SR residual network has a slim identity mapping pathway with wider (2× to 4×) channels before activation in each residual block. To further widen activation (6× to 9×) without computational overhead, we introduce linear low-rank convolution into SR networks and achieve even better accuracy-efficiency tradeoffs. In addition, compared with batch normalization or no normalization, we find training with weight normalization leads to better accuracy for deep super-resolution networks. Our proposed SR network WDSR achieves better results on large-scale DIV2K image super-resolution benchmark in terms of PSNR with same or lower computational complexity. Based on WDSR, our method also won 1st places in NTIRE 2018 Challenge on Single Image Super-Resolution in all three realistic tracks. Experiments and ablation studies support the importance of wide activation for image super-resolution. Code is released at: https://github.com/JiahuiYu/wdsr_ntire2018

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

wdsr_ntire2018

Code of our winning entry to NTIRE 2018 super-resolution challenge. http://www.vision.ee.ethz.ch/ntire18/


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) have been successfully applied to the task of single image super-resolution (SISR) 

[kim2016accurate, lim2017enhanced, liu2016robust, 2018arXiv180208797Z]. SISR aims at recovery of a high resolution (HR) image from its low resolution (LR) counterpart (typically a bicubic downsampled version of HR). It has many applications in security, surveillance, satellite, medical imaging [peled2001superresolution, thornton2006sub] and can serve as a built-in module for other image restoration or recognition tasks [fan2018wide, liu2017robust, wang2016studying, yu2018free, yu2018generative].

Previous image super-resolution networks including SRCNN [dong2014learning], FSRCNN [dong2016accelerating], ESPCN [shi2016real] utilized relatively shallow convolutional neural networks (with its depth from 3 to 5). They are inferior in accuracy compared with later proposed deep SR networks (e.g., VDSR [kim2016accurate], SRResNet [ledig2016photo] and EDSR [lim2017enhanced]). The increasing of depth brings benefits to representation power [cohen2016expressive, eldan2016power, liang2016deep, scarselli1998universal] but meanwhile under-use the feature information from shallow layers (usually represent low-level features). To address this issue, methods including SRDenseNet [tong2017image], RDN [2018arXiv180208797Z], MemNet [tai2017memnet] introduce various skip connections and concatenation operations between shallow layers and deep layers, formalizing holistic structures for image super-resolution.

In this work we address this issue in a different perspective. Instead of adding various shortcut connections, we conjecture that the non-linear ReLUs impede information flow from shallow layers to deeper ones [sandler2018inverted]. Based on residual SR network, we demonstrate that without additional parameters and computation, simply expanding features before ReLU activation leads to significant improvements for single image super-resolution, beating SR networks with complicated skip connections and concatenations including SRDenseNet [tong2017image] and MemNet [tai2017memnet]. The intuition of our work is that expanding features before ReLU allows more information pass through while still keeps highly non-linearity of deep neural networks. Thus low-level SR features from shallow layers may be easier to propagate to the final layer for better dense pixel value predictions.

Figure 1: Left: vanilla residual block. Middle WDSR-A: residual block with wide activation. Right WDSR-B: residual block with wider activation and linear low-rank convolution. We demonstrate different residual building blocks for image super-resolution networks. Compared with vanilla residual blocks used in EDSR [lim2017enhanced], we introduce WDSR-A which has a slim identity mapping pathway with wider ( to ) channels before activation in each residual block. We further introduce WDSR-B with linear low-rank convolution stack and even widen activation ( to ) without computational overhead. In WDSR-A and WDSR-B, all ReLU activation layers are only applied between two wide features (features with larger channel numbers).

The central idea of wide activation leads us to explore efficient ways to expand features before ReLU, since simply adding more parameters is inefficient for real-time image SR scenarios [goto2014super]. We first introduce SR residual network WDSR-A, which has a slim identity mapping pathway with wider ( to ) channels before activation in each residual block. However when the expansion ratio is above , channels of the identity mapping pathway have to be further slimmed and we find it dramatically deteriorates accuracy. Thus as the second step, we keep constant channel numbers of identity mapping pathway, and explore more efficient ways to expand features. We first consider group convolution [xie2017aggregated] and depthwise separable convolution [chollet2016xception]. However, we find both of them have unsatisfactory performance for the task of image super-resolution. To this end, we propose linear low-rank convolution that factorizes a large convolution kernel into two low-rank convolution kernels. With wider activation and linear low-rank convolutions, we construct our SR network WDSR-B. It has even wider activation ( to ) without additional parameters or computation, and boosts accuracy further for image super-resolution. The illustration of WDSR-A and WDSR-B is shown in Figure 1. Experiments show that wider activation consistently beats their baselines under different parameter budgets.

Additionally, compared with batch normalization [ioffe2015batch] or no normalization, we find training with weight normalization [salimans2016weight] leads to better accuracy for deep super-resolution networks. Previous works including EDSR [lim2017enhanced], BTSRN [fan2017balanced] and RDN [2018arXiv180208797Z] found that batch normalization [ioffe2015batch] deteriorates the accuracy of image super-resolution, which is also confirmed in our experiments. We provide three intuitions and related experiments showing that batch normalization, due to 1) mini-batch dependency, 2) different formulations in training and inference and 3) strong regularization side-effects, is not suitable for training SR networks. However, with the increasing depth of neural networks for SR (e.g. MDSR [lim2017enhanced] has depth around 180), the networks without batch normalization become difficult to train. To this end, we introduce weight normalization for training deep SR networks. The weight normalization enables us to train SR network with an order of magnitude higher learning rate, leading to both faster convergence and better performance.

In summary, our contributions are as follows. 1) We demonstrate that in residual networks for SISR, wider activation has better performance with same parameter complexity. Without additional computational overhead, we propose network WDSR-A which has wider ( to ) activation for better performance. 2) To further improve efficiency, we also propose linear low-rank convolution as basic building block for construction of our SR network WDSR-B. It enables even wider activation ( to ) without additional parameters or computation, and boosts accuracy further. 3) We suggest batch normalization [ioffe2015batch] is not suitable for training deep SR networks, and introduce weight normalization [salimans2016weight] for faster convergence and better accuracy. 4) We train proposed WDSR-A and WDSR-B built on the principle of wide activation with weight normalization, and achieve better results on large-scale DIV2K image super-resolution benchmark. Our method also won 1st places in NTIRE 2018 Challenge on Single Image Super-Resolution in all three realistic tracks.

2 Related Work

2.1 Super-Resolution Networks

Deep learning-based methods for single image super-resolution significantly outperform conventional ones [park2003super, yang2010image] in terms of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). SRCNN [dong2014learning] was the first work utilizing an end-to-end convolutional neural network as a mapping function from LR images to their HR counterparts. Since then, various convolutional neural network architectures were proposed for improving the accuracy and efficiency. In this section, we review these approaches under several groups.

Upsampling layers Super-resolution involves upsampling operation of image resolution. The first super-resolution network SRCNN [dong2014learning] applied convolution layers on the pre-upscaled LR image. It is inefficient because all convolutional layers have to compute on high-resolution feature space, yielding times computation than on low-resolution space, where is the upscaling factor. To accelerate processing speed without loss of accuracy, FSRCNN [dong2016accelerating] utilized parametric deconvolution layer at the end of SR network [dong2016accelerating], making all convolution layers compute on LR feature space. Another non-parametric efficient alternative is pixel shuffling [shi2016real] (a.k.a., sub-pixel convolution). Pixel shuffling is also believed to introduce less checkerboard artifacts [odena2016deconvolution] than the deconvolutional layer.

Very deep and recursive neural networks The depth of neural networks is of central importance for deep learning [he2016deep, simonyan2014very, szegedy2017inception]. It is also experimentally proved in single image super-resolution task [fan2017balanced, kim2016accurate, ledig2016photo, lim2017enhanced, tai2017memnet, tong2017image, 2018arXiv180208797Z]. These very deep networks (usually more than 10 layers) stack many small-kernel (i.e., ) convolutions and have higher accuracy than shallow ones [dong2016accelerating, shi2016real]. However, the increasing depth of convolutional neural networks introduces over-parameterization and difficulty of training. To address these issues, recursive neural networks [kim2016deeply, tai2017image] are proposed by re-using weights repeatedly.

Skip connections On one hand, deeper neural networks have better performance in various tasks [simonyan2014very], on the other hand low-level features are also important for image super-resolution task [2018arXiv180208797Z]. To address this contradictory, VDSR [kim2016accurate] proposed a very deep VGG-like [simonyan2014very]

network with global residual connection (i.e. identity skip connection) for SISR. SRResNet 

[ledig2016photo] proposed a ResNet-like [he2016deep] network. Densely connected networks [huang2017densely] are also adapted for SISR in SRDenseNet [tong2017image]. MemNet [tai2017memnet] integrated skip connections and recursive unit for low-level image restoration tasks. To further exploit the hierarchical features from all the convolutional layers, residual dense networks (RDN) [2018arXiv180208797Z] are proposed. All these works benefit from additional skip connections between different levels of features in deep neural networks.

Normalization layers As image super-resolution networks going deeper and deeper (from 3-layer SRCNN [dong2014learning] to 160-layer MDSR [lim2017enhanced]), training becomes more difficult. Batch normalization layers are one of the cures for this problem in many tasks [he2016deep, szegedy2017inception]. It is also introduced in SISR networks in SRResNet [ledig2016photo]. However, empirically it is found that batch normalization [ioffe2015batch] hinders the accuracy of image super-resolution. Thus, in recent image SR networks [fan2017balanced, lim2017enhanced, 2018arXiv180208797Z], batch normalization is abandoned.

2.2 Parameter-Efficient Convolutions

In this subsection, we also review several related methods proposed for improving efficiency of convolutions.

Flattened convolution Flattened convolutions [jin2014flattened] consist of consecutive sequence of one-dimensional filters across all directions in 3D space (lateral, vertical and horizontal) to approximate conventional convolutions. The number of parameters in flattened convolution decreases from to , where is the number of input planes, and denote filter width and height.

Group convolution Group convolutions [xie2017aggregated] divide features into groups channel-wisely and perform convolutions inside the group individually, followed by a concatenation to form the final output. In group convolutions, the number of parameters can be reduced by times, where is the group number. Group convolutions are the key components to many efficient models (e.g. ResNeXt [xie2017aggregated]).

Depthwise separable convolution Depthwise separable convolution is a stack of depthwise convolution (i.e. a spatial convolution performed independently over each channel of an input) followed by a pointwise convolution (i.e. a 1x1 convolution) without non-linearities. It can also be viewed as a specific type of group convolution where the number of groups is the number of channels. The depthwise separable convolution formulates the basic architecture in many efficient models including Xception [chollet2016xception], MobileNet [howard2017mobilenets] and MobileNetV2 [2018arXiv180104381S].

Inverted residuals Another work [2018arXiv180104381S] expands features before activation for image recognition tasks (named inverted residuals). The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. The inverted residual shares similar merits with our proposed wide activation, however we found the inverted residual proposed in [2018arXiv180104381S] has unsatisfactory performance on the task of image SR. In this work we mainly explore different network architectures to improve the accuracy and efficiency for the task of image super-resolution with the central idea of wide activation.

3 Proposed Methods

3.1 Wide Activation: Wdsr-A

In this part, we mainly describe how we expand features before ReLU activation layer without computational overhead. We consider the effects of wide activation inside a residual block. A naive way is to directly add channel numbers of all features. However, it proves nothing except that more parameters lead to better performance. Thus, in this section, we design our SR network to study the importance of wide features before activation with same parameter and computational budgets. Our first step towards wide activation is extremely simple: we slim the features of residual identity mapping pathway while expand the features before activation, as shown in Figure 1.

Two-layer residual blocks are specifically studied following baseline EDSR [lim2017enhanced]. Assume the width of identity mapping pathway (Fig. 2) is and width before activation inside residual block is . We introduce expansion factor before activation as thus . In the vanilla residual networks (e.g., used in EDSR and MDSR) we have and the number of parameters are in each residual block. The computational (Mult-Add operations) complexity is a constant scaling of parameter numbers when we fix the input patch size. To have same complexity , the residual identity mapping pathway need to be slimmed as a factor of and the activation can be expanded with times meanwhile.

This simple idea forms our first widely-activated SR network WDSR-A. Experiments show that WDSR-A is extremely effective for improving accuracy of SISR when is between 2 to 4. However, for larger than this threshold the performance drops quickly. This is likely due to the identity mapping pathway becoming too slim. For example, in our baseline EDSR (16 residual blocks with 64 filters) for super-resolution, when is beyond 6, will be even smaller than the final HR image representation space (we use pixel shuffle as upsampling layer) where is the scaling factor and 3 represents RGB. Thus we seek for parameter-efficient convolution to further improve accuracy and efficiency with wider activation.

3.2 Efficient Wider Activation: Wdsr-B

To address the above limitation, we keep constant channel numbers of identity mapping pathway, and explore more efficient ways to expand features. Specifically we consider convolutions. convolutions are widely used for channel number expansion or reduction in ResNets [he2016deep], ResNeXts [xie2017aggregated] and MobileNetV2 [2018arXiv180104381S]. In WDSR-B (Fig. 1) we first expand channel numbers by using and then apply non-linearity (ReLUs) after the convolution layer. We further propose an efficient linear low-rank convolution which factorizes a large convolution kernel to two low-rank convolution kernels. It is a stack of one convolution to reduce number of channels and one

convolution to perform spatial-wise feature extraction. We find adding ReLU activation in

linear low-rank convolutions significantly reduces accuracy, which also supports wide activation hypothesis.

3.3 Weight Normalization vs. Batch Normalization

In this part, we mainly analyze the different purposes and effects of batch normalization (BN) [ioffe2015batch] and weight normalization (WN) [salimans2016weight]. We offer three intuitions why batch normalization is not appropriate for image SR tasks. Then we demonstrate that weight normalization does not have these drawbacks like BN, and it can be effectively used to ease the training difficulty of deep SR networks.

Batch normalization

BN re-calibrates the mean and variance of intermediate features to solve the problem of

internal covariate shift [ioffe2015batch] in training deep neural networks. It has different formulations in training and testing. For simplicity, here we ignore the re-scaling and re-centering learnable parameters of BN. During training, features in each layer are normalized with mean and variance of the current training mini-batch:

(1)

where is the features of current training batch, is a small value (e.g. 1e-5) to avoid zero-division. The first order and second order statistics are then updated to global statistics in a moving average way:

(2)
(3)

where means assigning moving average. During inference, these global statistics are used instead to normalize the features:

(4)

As shown in the formulations of BN, it will cause following problems. 1) For image super-resolution, commonly only small image patches (e.g. ) and small mini-batch size (e.g. 16) are used to speedup training [fan2017balanced, kim2016accurate, ledig2016photo, lim2017enhanced, tai2017memnet, tong2017image, 2018arXiv180208797Z], thus the mean and variance of small image patches differ a lot among mini-batches, making theses statistics unstable, which is demonstrated in the section of experiments. 2) BN is also believed to act as a regularizer and in some cases can eliminate the need for Dropout [ioffe2015batch]. However, it is rarely observed that SR networks overfit on training datasets. Instead, many kinds of regularizers, for examples, weight decaying and dropout, are not adopted in SR networks [fan2017balanced, kim2016accurate, ledig2016photo, lim2017enhanced, tai2017memnet, tong2017image, 2018arXiv180208797Z]. 3) Unlike image classification tasks where softmax (scale-invariant) is used at the end of networks to make prediction, for image SR, the different formulations of training and testing may deteriorate the accuracy for dense pixel value predictions.

Weight normalization

Weight normalization, on the other hand, is a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. It does not introduce dependencies between the examples in a mini-batch, and has the same formulation in training and testing. Assume the output

is with the form:

(5)

where is a k-dimensional weight vector, is a scalar bias term, is a k-dimensional vector of input features. WN re-parameterizes the weight vectors in terms of the new parameters using

(6)

where v is a k-dimensional vector, g is a scalar, and denotes the Euclidean norm of . With this formalization, we will have , independent of parameters . As shown in [salimans2016weight], the decouples of length and direction speed up convergence of deep neural networks. And more importantly, for image SR, it does not introduce troubles of BN as described above, since it is just a reparameterization technique and has exact same representation ability.

It is also noteworthy that introducing WN allows training with higher learning rate (i.e. ), and improves both training and testing accuracy.

3.4 Network Structure

Figure 2: Demonstration of our simplified SR network compared with EDSR [lim2017enhanced].

In this part, we overview the WDSR network architectures. We made two major modifications based on EDSR [lim2017enhanced] super-resolution network.

Global residual pathway Firstly we find that the global residual pathway is a linear stack of several convolution layers, which is computational expensive. We argue that these linear convolutions are redundant (Fig. 2) and can be absorbed into residual body to some extent. Thus, we slightly modify the network structure and use single convolution layer with kernel size that directly take LR RGB image/patch as input and output HR counterparts, where is the scale. This results in less parameters and computation. In our experiments we have not found any accuracy drop with our simpler form.

Upsampling layer Different from previous state-of-the-arts [lim2017enhanced, 2018arXiv180208797Z] where one or more convolutional layers are inserted after upsampling, our proposed WDSR extracts all features in low-resolution stage (Fig. 2). Empirically we find it does not affect accuracy of SR networks while improves speed by a large margin.

4 Experimental Results

We train our models on DIV2K dataset [timofte2017ntire] since the dataset is relatively large and contains high-quality (2K resolution) images. The default splits of DIV2K dataset consist 800 training images, 100 validation images and 100 testing images. We use 800 training images for training and 10 validation images for validation during training. The trained models are evaluated on 100 validation images (testing images are not publicly available) of DIV2K dataset. We mainly measure PSNR on RGB space. ADAM optimizer [kingma2014adam] is used with , and . The batch size is set to 16. The learning rate is initialized the maximum convergent value (10-4 for models without weight normalization and 10-3 for models with weight normalization). The learning rate is halved at every iterations.

We crop RGB input patches from HR image and its bicubic downsampled image as training output-input pairs. Training data is augmented with random horizontal flips and rotations following common data augmentation methods [fan2017balanced, lim2017enhanced]. During training, the input images are also subtracted with the mean RGB values of the DIV2K training images.

4.1 Wide and Efficient Wider Activation:

In this part, we show results of baseline model EDSR [lim2017enhanced] and our proposed WDSR-A and WDSR-B for the task of image bicubic x2 super-resolution on DIV2K dataset. To ensure fairness, each model is evaluated at different parameters and computational budgets by controlling the number of residual blocks with fixed number of channels. The results are shown in Table 1. We compare each model with its number of residual blocks. The results suggest that our proposed WDSR-A and WDSR-B have better accuracy and efficiency than EDSR [lim2017enhanced]. WDSR-B with wider activation also has better or similar performance compared with WDSR-A, which supports our wide activation hypothesis and demonstrates the effectiveness of our proposed linear low-rank convolution.

Residual Blocks 1 3
Networks EDSR WDSR-A WDSR-B EDSR WDSR-A WDSR-B
Parameters 0.26M 0.08M 0.08M 0.41M 0.23M 0.23M
DIV2K (val) PSNR 33.210 33.323 33.434 34.043 34.163 34.205
Residual Blocks 5 8
Networks EDSR WDSR-A WDSR-B EDSR WDSR-A WDSR-B
Parameters 0.56M 0.37M 0.37M 0.78M 0.60M 0.60M
DIV2K (val) PSNR 34.284 34.388 34.409 34.457 34.541 34.536
Table 1: Model comparisons at different parameters budgets by controlling the number of residual blocks with fixed number of channels. We mainly compare the number of parameters and validation PSNR to measure efficiency and accuracy.

4.2 Normalization layers:

Figure 3: Training L1 loss and validation PSNR of same model trained with weight normalization, batch normalization or no normalization.

We also demonstrate the effectiveness of weight normalization for improved training of SR networks. We compare the training and testing accuracy (PSNR) when train the same model with different normalization methods, i.e. weight normalization, batch normalization or no normalization. The results in Figure 3 show that the model trained with weight normalization has faster convergence and better accuracy. The model trained with batch normalization is unstable during testing, which is likely due to different formulations of BN in training and testing.

Figure 4: Training L1 loss and validation PSNR of model trained with batch normalization but different learning rates.

To further study whether this is because the learning rate is too large for models trained with batch normalization, we also train the same model with different learning rates. The results are shown in Figure 4. Even with when the training curves are stable, the validation PSNR is still not stable across training.

5 Conclusions

In this report, we introduce two super-resolution networks WDSR-A and WDSR-B based on the central idea of wide activation. We demonstrate in our experiments that with same parameter and computation complexity, models with wider features before ReLU activation have better accuracy for single image super-resolution. We also find training with weight normalization leads to better accuracy for deep super-resolution networks comparing to batch normalization or no normalization. The proposed methods may help to other low-level image restoration tasks like denoising and dehazing.