Can Maxout Units Downsize Restoration Networks? - Single Image Super-Resolution Using Lightweight CNN with Maxout Units

11/07/2017 ∙ by Jae-Seok Choi, et al. ∙ KAIST 수리과학과 0

Rectified linear units (ReLU) are well-known to be helpful in obtaining faster convergence and thus higher performance for many deep-learning-based applications. However, networks with ReLU tend to perform poorly when the number of filter parameters is constrained to a small number. To overcome it, in this paper, we propose a novel network utilizing maxout units (MU), and show its effectiveness on super-resolution (SR) applications. In general, the MU has been known to make the filter sizes doubled in generating the feature maps of the same sizes in classification problems. In this paper, we first reveal that the MU can even make the filter sizes halved in restoration problems thus leading to compaction of the network sizes. To show this, our SR network is designed without increasing the filter sizes with MU, which outperforms the state of the art SR methods with a smaller number of filter parameters. To the best of our knowledge, we are the first to incorporate MU into SR applications and show promising performance results. In MU, feature maps from a previous convolutional layer are divided into two parts along channels, which are then compared element-wise and only their max values are passed to a next layer. Along with some interesting properties of MU to be analyzed, we further investigate other variants of MU and their effects. In addition, while ReLU have a trouble for learning in networks with a very small number of convolutional filter parameters, MU do not. For SR applications, our MU-based network reconstructs high-resolution images with comparable quality compared to previous deep-learning-based SR methods, with lower filter parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Super-resolution (SR) methods aim to reconstruct high-resolution (HR) image or video contents from their low-resolution (LR) versions. The SR problem is known to be highly ill-posed, where an LR input can lead to multiple degraded HR versions without proper prior information [39]. As the role of SR becomes crucial recently in various areas such as up-scaling full-high-definition (FHD) to 4K [6], it is important to develop SR algorithms that are capable of generating HR contents with superior visual quality while maintaining reasonable complexity and moderate amounts of parameters.

1.1 Related work

SR methods can be divided into two families according to their input types: single image SR (SISR) and video SR. While both spatial and temporal information can be used in video SR, SISR utilizes only spatial information within given LR images, making the SR problem more difficult [12, 31]. In this paper, we mainly focus on SISR.

Figure 1: Comparison on PSNR performance versus the number of filter parameters for two toy SR networks with ReLU [29] and maxout units (MU) [15], respectively. The network with MU shows higher performance than the conventional network with ReLU, especially when the number of parameter is small. This makes MU a suitable unit for application platforms with limited resources, such as mobile platforms.

Various SR methods employed the following techniques in reconstructing HR images of high quality: sparse-representation [20, 22, 39], linear mappings [8, 36, 41, 42, 6], self-examples [12, 13, 14, 38], and neural networks [7, 10, 11, 21, 25, 26, 31, 34, 35]. Sparse-representation-based SR methods [20, 22, 39] undergo heavy computations to calculate sparse-representation of an LR patch from a pre-trained and complex LR dictionary. The resultant sparse-representation is then applied to a corresponding HR dictionary to reconstruct its HR version. Some SR methods [12, 13, 14, 38] extracted LR-to-HR mappings by searching for similar patches (self-examples) to the current patch inside its self-dictionary. Linear-mapping-based SR methods [6, 8, 36, 41, 42] (LMSR) have been proposed to obtain HR images of comparable quality but with much lower computational complexity. The adjusted anchored neighborhood regression (A+, APLUS) [36] method searches for the best linear mapping for each LR patch, based on the correlation with pre-trained dictionary sets from [39]. Choi [6, 8] employs simple edge classification to find suitable linear mappings, which are applied directly to small LR patches to reconstruct their HR version.

Recently, SR methods using convolutional neural networks (CNN) [7, 10, 11, 21, 25, 26, 31, 34, 35] have shown high PSNR performance. Dong et al. [10] first utilized a 3-layered CNN for SR (SRCNN), and reported a remarkable performance jump compared to previous SR methods. Recently, Kim et al. [21]

proposed a very deep 20-layered CNN (VDSR) with gradient clipping and residual learning, yielding the reconstructed HR images of even higher PSNR compared to SRCNN. Shi

et al. [31] proposed a network structure where features are extracted in LR space. The feature maps at the last layer are up-scaled to HR space using a sub-pixel convolution layer. Recursive convolutions were also used in [34] to lower the number of parameters. Ledig et al. [25] presented two SR network structures: a network using residual units to maximize PSNR performance (SRResNet), and a network using generative adversarial networks for perceptual improvement (SRGAN). Lately, some SR methods using very deep networks [7, 26, 35] with large parameters have been proposed in NTIRE2017 Challenge [35], achieving the state-of-the-art PSNR performance.

In these deep learning-based SR methods, rectified linear units (ReLU) [29]

are used to obtain nonlinearity between two adjacent convolutional layers. ReLU is a simple function, which has an identity mapping for positive values and 0 for negative. Unlike a sigmoid or hyperbolic tangent, ReLU does not suffer from gradient vanishing problems. By using ReLU, networks can learn piece-wise linear mappings between LR and HR images, which results in the mapping with high visual quality and faster training convergence. There are other nonlinear activation functions such as leaky ReLU (LReLU)

[27], parametric ReLU [16] and exponential linear units (ELU) [9], but they are not often used in regression problems unlike ReLU. While LReLU replaces the zero part of ReLU with a linearity with certain small gradient, parametric ReLU parameterizes this gradient value so that a network can learn it. ELU has been designed so that it pushes mean unit activations closer to zero for faster learning.

1.2 Motivations and contributions

One major reason for such high performance of neural networks in many applications [7, 10, 11, 21, 25, 26, 31, 34, 35] would be the use of ReLU [29] and its successors [27]. These nonlinear units were first introduced in classification papers [2, 9, 16, 18, 19, 24, 27, 29], which were subsequently reused for regression problems such as SR. It can be easily noticed that while ReLU and LReLU functions have been frequently used in SR, it is hard to find other types of activation functions [9]

. This is because they tend to distort scales of input values (more in Section 3.3), and thus networks with these functions generate HR results with lower quality compared to those with ReLU. This phenomenon can also be observed in normalization layers such as batch normalization

[18] and layer normalization [2], and there have been some reports that these normalization layers degrade performance when used in regression problems [7, 26].

In this paper, we try to tackle some limitations of ReLU: i) ReLU produces feature maps with many zeros whose number is not controllable; ii) therefore, learning with ReLU tends to collapse in a network with very deep layers without some help such as identity mappings [17]; and iii) there could be a way to make use of those empty zero values so that we may be able to reduce number of channels for lower memory consumption and less computations.

Maxout units (MU) [15] are activation units which could overcome the aforementioned limitations. MU were first introduced in various classification problems [5, 15, 33]. Goodfellow et al. [15] proposed MU and used them in conjunction with dropout [32]

in a multi-layer-perceptron (MLP), and showed competitive classification results, compared to those of using conventional ReLU

[29]. In [33], MU were used for speech recognition, and it is stated that networks with MU were about three times faster to converge in training with comparable performance. In addition, Chang et al. [5] reported a network-in-network structure using MU for classification, which was able to mediate the problem of vanishing gradients that can occur when using ReLU. Although networks using MU were known to work well in high-level vision areas, only a few works [4] employed MU for regression problems. In this paper, we develop and present a novel SR network incorporating MU. Our contributions are as follows:

  • Contrary to common thought that the number of parameters needs to be doubled when using MU, we first reveal that MU can effectively be incorporated into restoration problems. We show our SR network with MU that the number of channels of input feature maps is halved, even showing good results and thus resulting in a less memory usage and lower computational costs.

  • We show a deep analysis on networks using basic MU, and further investigate other MU variants, showing their effectiveness on the SR application.

Various experiment results show that our SR networks that incorporate MU as activation functions are able to reconstruct HR images of competitive quality compared to those of ReLU. Figure 1 shows comparison on PSNR performance versus the number of parameters for two toy network examples with ReLU and MU, respectively. Both networks share the same 6-layered SR structure, except the type of activation functions used.

2 Maxout units

First, let us denote the outputs of the l-th convolution layer as , where a network has L convolutional layers. Also, we denote the outputs of an activation function for as .

2.1 Conventional nonlinear activation functions

Many SR methods [7, 10, 11, 21, 25, 26, 31, 34, 35] often use ReLU [29] for activation functions between every two convolutional layers to obtain high nonlinearity between LR and HR. After each ReLU, the negative part of feature maps becomes zero as

(1)

where max() is a function that calculates maximum values between two inputs in element-wise fashion. The negative parts where inputs become zero ensure nonlinearity, while the positive parts allow for fast learning as its derivative is a unity. However, very deep or narrow networks may have some difficulty in learning when too many values fall into negative and become zero. While other ReLU variants such as LReLU [27] and ELU [9] try to overcome this limitation by modifying the negative parts, these ReLU variants still have little control over a ratio of the number of negative values.

Figure 2: Block diagrams for various activation functions used in networks: conventional units (ReLU [29], LReLU [27], ELU [9]), MU [15], MU-D [15], MU-M, MU-S and MU-R.

2.2 Maxout unit

To overcome the limitations, we come up with an SR network structure incorporating the MU.

2.2.1 Maxout.

MU [15]

computes the maximum of a vector of any length. Here, we use a special case of MU, where the feature maps

are halved along channel into two parts and , and element-wise maximum of these two parts is calculated as:

(2)

2.2.2 Difference of two MU.

In [15], a difference of two MU was also introduced with a proposition that any continuous piece-wise linear function can be expressed as a difference of two convex piece-wise linear functions. In this paper, we use the form of:

(3)

where is equally divided into four parts , , and . Note after this activation function, the input feature maps are reduced to quarter. We denote this MU variant as MU-D. Incorporating a simple max function between two sets of feature maps provides nonlinearity with various properties as follow:

  • MU simply transfers feature map values from the input layer to the next, acting as the linear parts of ReLU. In backpropagation, error gradients simply flow to the selected values (maximum).

  • Because MU does not consider negative or positive values unlike ReLU, outputs of MU would always have certain values, alleviating a chance of creating many close-to-zero values in feature maps and failing in learning.

  • In narrow networks where the number of channels of feature maps is small, the MU allows for stable learning, while networks with ReLU may converge poorly.

  • MU always ensures 50% sparsity: that is, 50% of larger values of the feature maps would always be selected and transmitted to the next layer, while the other 50% of the feature maps are not used. In backpropagation, there would be always 50% of paths alive for error gradients to be back-propagated.

  • As the output of MU is only 50% of the previous feature map values, the number of convolutional filter parameters in the next layer can be reduced by half, lowering both computation time and memory consumption. Similarly, unlike ReLU, MU is able to compress the given feature maps by stopping the transmission of close-to-zero values in the feature maps. In doing so, the network compactness is improved by preserving needed information.

We demonstrate the effectiveness of MU through various experiments in Section 3. Based on the properties of MU, we further investigate other variants of MU.

2.3 MU variants

From MU, its variants can be designed while preserving similar properties: minimum, recursive and sorting.

2.3.1 Minimum.

Instead of using the max function, one can design activation functions with the min function as

(4)

where min() is a function that calculates minimum values between two inputs in element-wise fashion. In training, this variant works similar to the original MU. We denote this MU variant as MU-M.

2.3.2 Sorting.

If we are to maintain the size of feature maps as ReLU does, we can employ both max and min functions into one activation function as

(5)

where cat() is a function that concatenates all inputs along channels. We denote this MU variant as MU-S.

2.3.3 Recursive.

By using MU recursively for n times before applying convolutions in the next layer, we can further enforce more sparsity, e.g. 75%, resulting reduced feature maps as outputs. This can be expressed as

(6)

where indicates n-times repeated MU, whose output channels are reduced by . We denote this MU variant as MU-R.

Figure 2 illustrates the various activation functions, including MU and MU variants. Through additional experiments using the MU variants, we confirmed that networks with the variants could be trained well as shown in Section 3.

Figure 3:

Our proposed SR network, which incorporates MU for activation functions, residual learning between an interpolated image and a target HR image, and a sub-pixel convolution layer for up-scaling.

2.4 Network details

By incorporating MU and its variants, we propose multiple network structures as shown in Figure 3, and show their performance for SR applications.

2.4.1 Toy networks.

In order to conduct many and quick validations for comparing effects of multiple activation function variants including MU, we present a baseline toy network structure that is shared for testing all types of activation functions. The toy networks were trained using a smaller training dataset from [39]. Our toy networks includes three types of layers: 6 layers of 33 convolutions, one type of activation function, and one sub-pixel convolution layer [31] at the end for up-scaling purpose. For convolutional layers, we simply use the kernel size of 3

3, where input feature maps are padded with zero before convolution, so that the size of feature maps is preserved until the last sub-pixel convolution layer. The experimental results obtained using the toy networks are presented throughout Figures 1, 6, 7 and Table

2.

2.4.2 Espcn-Mu.

For comparison, several state-of-the-art SR network structures [31, 21] are implemented as stated in the papers but using MU and some modifications. Our first SR network using MU is based on ESPCN [31]. We replace all ReLU layers in [31] with MU. A 5-3-3 model [31] is also used in our network with 64 filters for the first convolution layer and 32 filters for the second convolution layer. Note that due to MU’s characteristics where the number of channels is halved after activation, the number of filter parameters of ours is reduced almost in half compared to that of ESPCN [31]. In addition, we aim to learn the residual between original HR images and interpolated LR images as in [21], but we use nearest-neighbor interpolation instead of bicubic to make SR problem harder and thus mainly focus on capability of types of activation functions. In doing so, networks converge faster. Due to its small number of parameters, we utilize a small training data set [39], but still produce comparable SR results to [31].

Methods Bicubic SRCNN [11] ESPCN [31] ESPCN [31] VDSR [21] SRResNet [25]
# of Params - 57K 25K 25K 665K 923K
Training Sets - ImageNet 91 ImageNet 291 ImageNet
Testing Scale PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
Set5 3 30.40 0.8687 32.75 0.9095 32.39 - 33.00 0.9121 33.66 0.9213 - -
4 28.43 0.8109 30.49 0.8634 - - 30.76 0.8679 31.35 0.8838 32.06 0.8927
Set14 3 27.55 0.7741 29.30 0.8219 28.97 - 29.51 0.8247 29.77 0.8314 - -
4 26.01 0.7023 27.50 0.7517 - - 27.75 0.7580 28.01 0.7674 28.59 0.7811
B100 3 27.21 0.7389 28.41 0.7867 - - - - 28.82 0.7976 - -
4 25.96 0.6678 26.90 0.7107 - - - - 27.29 0.7251 27.60 0.7361
*Results for the 9-5-5 model of SRCNN and results of ESPCN using ReLU are reported.
Methods ESPCN-MU VDSR-MU DNSR
# of Params 13K 338K 133K
Training Sets 91 291 291
Testing Scale PSNR SSIM PSNR SSIM PSNR SSIM
Set5 3 32.85 0.9118 33.92 0.9231 33.80 0.9224
4 30.57 0.8667 31.61 0.8861 31.57 0.8858
Set14 3 29.40 0.8222 29.99 0.8346 29.95 0.8338
4 27.61 0.7547 28.21 0.7713 28.21 0.7714
B100 3 28.40 0.7853 28.87 0.7989 28.82 0.7980
4 26.91 0.7114 27.31 0.7262 27.30 0.7260
Table 1: Average performance comparison for various SR methods.

2.4.3 Vdsr-Mu.

In addition, we propose another SR network using MU based on 20-layered VDSR [21]. Similar to [21], 20 convolutional layers with 33-sized filters are used in our network. We replace all ReLU layers in [21] with MU. Similar to that of ESPCN-MU, the number of filters parameters of ours is reduced almost in half compared to that of VDSR [21]. Also, we use nearest-neighbor interpolation instead of bicubic, and a sub-pixel convolution layer [31] for faster computation speed. Due to its large number of parameters, our VDSR-MU network was trained using a larger data set combining [39] and [28] as in VDSR [21].

2.4.4 Dnsr.

We also present a deeper and narrower version of VDSR-MU, called DNSR. While VDSR-MU has 20 layers with 64 channels, our DNSR has 30 layers (deeper) with 32 channels (narrower). Due to its deeper structure, we also employ residual units [17] into DNSR for stable learning. Our DNSR holds a smaller number of total filter parameters, which is about 1/5 of that of VDSR [21], and about 1/2.6 of that of VDSR-MU, while showing PSNR performance similar to VDSR-MU.

3 Experiment results

We now demonstrate the effectiveness of MU and its variants in SR framework on popular image datasets, compared to conventional SR deep networks with common nonlinear activation functions, including ReLU.

3.1 Experiment settings

3.1.1 Datasets.

Two popular datasets [39, 28] were used for training networks. Images in the datasets were used as original HR images. Before given into networks, LR-HR training images are normalized between 0 and 1, and then LR training images are subtracted by 0.5 to have a zero mean. LR input images were created from these HR images by applying nearest-neighbor interpolation. SR process is only applied on Y-channel of YCbCr color space, and the chroma components, Cb and Cr, are up-scaled using simple bicubic interpolation. When comparing SR output images with original HR images, performance measures such as PSNR were done in Y-channel.

The training set of 91 images [39] has frequently been used in various SR methods [39, 10, 21, 31]. The dataset consists of small resolutions but with a variety of texture types. In our experiments, this smaller training set was used for various toy networks in order to conduct fast and many experiments, and was also used for our ESPCN-MU.

The Berkeley Segmentation Dataset [28] has also been often used in SR works [21, 31]. This dataset includes 200 training images and 100 testing images for segmentation. As used in VDSR [21], we utilize 200 training images of BSD and 91 images from [39] from training. This larger set was used for training VDSR-MU and DNSR.

For testing, three popular benchmark datasets including Set5 [3], Set14 [40] and BSD100 [28] were used.

3.1.2 Training.

We trained all the networks using ADAM [23] optimization with an initial learning rate of and the other hyper-parameters as defaults. We employed a uniform weight initialization technique in [19]

for training. All the networks including our proposed networks with MU were implemented using TensorFlow

[1], which is a deep learning toolbox for Python, and were trained/tested on GPU Nvidia Titan Xp.

The toy networks were trained for iterations, where a learning rate was lowered by a factor of 10 after iterations. The mini-batch size was set to 2, weight decay was not used, and simple data augmentation with flip and rotation was used. For sub-images, LR-HR training image pairs were randomly cropped for the size of 4040 for a scale factor of 4.

Our ESPCN-MU, VDSR-MU and DNSR networks were trained for iterations, where a learning rate was lowered by a factor of 10 after iterations. The mini-batch size was set to 4, and weight decay was not used. To create sub-images for training, LR-HR training image pairs were randomly cropped for the size of 7575 and 76

76 in HR space, respectively, for a scale factor of 3 and 4. We apply various data augmentations to the HR images such as flipping, rotating, mirroring, and randomly multiplying their intensities by a value in a range from 0.8 and 1.2. Data augmentations are done on the fly for every epoch in training to reduce overfitting.

3.2 SR results

First, we show SR results using our three proposed SR networks, including ESPCN-MU, VDSR-MU and DNSR, and compare them with the state-of-the-art methods, including SRCNN [11], ESPCN [31], VDSR [21] and SRResNet [25]. Table 1 summarizes performance details for all the SR methods, including their numbers of filter parameters, their used training sets, and PSNR and SSIM [37] values for scale factors of 3 and 4, tested on three popular testing datasets. For SRCNN [11], the reported results of the 9-5-5 model are shown. For ESPCN [31], the reported results using ReLU for two different training datasets are shown in Table 1. The PSNR/SSIM values for the conventional SR methods in Table 1 are either the ones reported in their respective papers, or directly calculated from their publically available result images online. Figure 5 and 5 show reconstructed HR images and their magnified portions of baby and zebra, respectively, using various SR methods for a scale factor of 4.

Figure 4: Reconstructed HR images of baby using various SR methods for a scale factor of 4.
Figure 5: Reconstructed HR images of zebra using various SR methods for a scale factor of 4.
Figure 4: Reconstructed HR images of baby using various SR methods for a scale factor of 4.

3.2.1 SR performance.

As shown in Table 1, SRResNet [25], an SR network of the largest number of filter parameters (about 900K) that was trained using ImageNet [30], shows the highest PSNR and SSIM performance among various SR methods. Our proposed VDSR-MU and DNSR show the second and third highest performance with only 338K and 133K parameters, respectively, outperforming most of the conventional SR methods except SRResNet. It can be seen that our networks using MU have good efficiency with much less parameters, compared to other SR methods, while showing reasonable PSNR performance. As shown in Figure 5 and 5, the quality of the reconstructed HR images using our VDSR-MU and DNSR are comparable to that of SRResNet [25]. Especially, our VDSR-MU and DNSR were able to reconstruct clearly discerned stripes of zebra as shown in Figure 5-(i) and (j), which are comparable to Figure 5-(g) of SRResNet, while other SR methods fail to do so.

3.2.2 Espcn-Mu.

In order to show the effectiveness of using MU in SR, we compare two similar networks: ESPCN [31] and our ESPCN-MU. As shown in Table 1, the number of parameters of ESPCN-MU is only about 13K, which is almost half the number of parameters of ESPCN [31]. While our network was trained using 91 images, our ESPCN-MU outperforms ESPCN [31] trained with 91 training images, and shows comparable performance even compared to ESPCN [31] that was trained using a larger set of images from ImageNet [30].

Figure 6:

Average ratios of activated neurons for each feature map in two toy networks using ReLU and MU for Bird and Butterfly, respectively (e.g. 100% activation is colored in white, while 0% in black and 50% in gray). Here, rows and columns indicate channels and layers, respectively.

3.2.3 Vdsr-Mu.

Similar to our ESPCN-MU, our VDSR-MU outperforms VDSR [21] in terms of PSNR and SSIM, but with a much lower number of parameters. Our VDSR-MU networks has about 338K parameters, which is about half the number of parameters of VDSR [21]. Note both networks were trained using the same 291 images [39, 28].

3.2.4 Dnsr.

Our deeper and narrower version of VDSR-MU, DNSR, has an almost 2/5 times the number of filter parameters compared to our VDSR-MU, which is about 1/5 of VDSR [21] and 1/7 of SRResNet [25]. Even with a low number of parameters, our DNSR network was able to reconstruct HR images comparable to VDSR-MU, and outperforms most of the conventional SR methods.

Activation Function Size of Conv Filters Number of Params Training PSNR Testing PSNR
ReLU [29] 7.1K 28.22 29.81
LReLU [27] 7.1K 28.19 29.78
ELU [9] 7.1K 27.91 29.30
MU [15] 6K 28.42 30.07
MU-D [15] 6.4K 28.46 30.07
MU-M 6K 28.43 30.05
MU-S 7.1K 28.46 30.08
MU-R 6.4K 28.38 29.98
Table 2: Training and testing PSNR (dB) performance after the first iterations for networks with various activation functions.

3.3 Discussions

We also conducted experiments on toy networks using various activation functions including ReLU, MU and MU variants. We show potential properties of MU compared to units used in conventional SR methods, by analyzing parameter-vs.-PSNR performance and by showing activation rates in feature maps.

3.3.1 MU versus ReLU.

Figure 1 shows comparison on PSNR performance versus the number of parameters for two toy networks with ReLU and MU, respectively. Both networks share the same 6-layered SR structure, except the type of activation functions used. The number of parameters for each subtest is controlled by adjusting the number of convolution filters. As shown, the PSNR performance gap between networks using ReLU and MU becomes larger as the number of parameters decreases. This indicates that in narrow networks where the number of channels of feature maps is small, MU allows for stable learning, while ReLU converges towards a worse point. We can argue that because MU does not consider negative or positive values unlike ReLU, the outputs of MU would always have certain values, alleviating a chance of creating many close-to-zero values in feature maps and failing in learning.

Figure 6 shows the average ratios of activated neurons for each feature map in two toy networks using ReLU and MU for Bird and Butterfly, respectively (e.g. 100% activation is colored in white, 0% in black and 50% in gray). The rows and columns indicate channels and layers, respectively. It is interesting to see that activations after MU are sparser than those of ReLU, which supports the effectiveness of MU in SR. Note that since the maximum values between two feature maps are always passed to the next layer, the feature maps after MU would always be 100% activated with half the number of feature maps. Note that Figure 6 may suggest that MU can be related to network pruning, and this remains as our future work.

3.3.2 MU variants.

Table 2 shows training and testing PSNR performance after the first iterations for toy networks with various activation functions. Note that the size of convolutional filters has been adjusted for each network to yield a similar number of total parameters. Figure 7 presents a PSNR-versus-iteration plot for networks with various activation functions from Table 2. It can be seen in Figure 7 that the networks with MU and MU variants enable faster convergence compared to those with ReLU and ReLU variants. Note that the network with ELU has a training difficulty in the SR problem, contrary to its performance in other classification papers. This may be due to the fact that ELU tend to distort scales of input values, which is undesirable in regression problems such as SR. Overall, the networks with MU and its variants show higher PSNR values with less parameters.

Figure 7: A PSNR-versus-iteration plot for networks with various activation functions from Table 2.

4 Conclusion

The proposed SR networks showed superior PSNR performance compared to the base networks using ReLU and other activation functions. The SR networks using MU tend to produce higher PSNR results with a smaller number of convolution filter parameters, which is desirable for computational platforms with limited resources. We showed that MU can be used in regression problems especially SR, and they have some potential with further extension to new types of activation functions for other applications.

References

  • [1]

    Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI. vol. 16, pp. 265–283 (2016)

  • [2] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
  • [3] Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012)
  • [4] Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: An end-to-end system for single image haze removal. IEEE Trans. Image Proc. 25(11), 5187–5198 (2016)
  • [5] Chang, J.R., Chen, Y.S.: Batch-normalized maxout network in network. arXiv preprint arXiv:1511.02583 (2015)
  • [6] Choi, J.S., Kim, M.: Super-interpolation with edge-orientation-based mapping kernels for low complex 2 upscaling. IEEE Trans. Image Proc. 25(1), 469–483 (2016)
  • [7] Choi, J.S., Kim, M.: A deep convolutional neural network with selection units for super-resolution. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. pp. 1150–1156 (2017)
  • [8] Choi, J.S., Kim, M.: Single image super-resolution using global regression based on multiple local linear mappings. IEEE Trans. Image Proc. 26(3), 1300–1314 (2017)
  • [9] Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
  • [10] Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: European Conf. Comp. Vis. pp. 184–199 (2014)
  • [11] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
  • [12] Freedman, G., Fattal, R.: Image and video upscaling from local self-examples. ACM Trans. on Graph. 30(2),  12 (2011)
  • [13] Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEE Comp. Graph. and App. 22(2), 56–65 (2002)
  • [14] Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: IEEE Int. Conf. Comp. Vis. pp. 349–356 (2009)
  • [15] Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)
  • [16] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 1026–1034 (2015)
  • [17] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conf. Comp. Vis. pp. 630–645 (2016)
  • [18] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Int. Conf. Mach. Learn. pp. 448–456 (2015)
  • [19]

    Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proc. ACM Int. Conf. Mul. pp. 675–678 (2014)

  • [20] Jianchao, Y., Wright, J., Huang, T., Ma, Y.: Image super-resolution as sparse representation of raw image patches. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1–8 (2008)
  • [21] Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1646–1654 (2016)
  • [22] Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. IEEE Trans. Pattern Anal. Mach. Intell. 32(6), 1127–1133 (2010)
  • [23] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [24] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Adv. Neural Info. Proc. Sys. pp. 1097–1105 (2012)
  • [25] Ledig, C., Theis, L., Huszár, F., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. vol. 2, p. 4 (2017)
  • [26] Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. vol. 1, p. 3 (2017)
  • [27] Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. Int. Conf. Mach. Learn. vol. 30, p. 3 (2013)
  • [28] Martin, D., Fowlkes, C., et al.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. IEEE Int. Conf. Comp. Vis. vol. 2, pp. 416–423 (2001)
  • [29]

    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proc. Int. Conf. Mach. Learn. pp. 807–814 (2010)

  • [30] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. Journal Comp. Vis. 115(3), 211–252 (2015)
  • [31] Shi, W., Caballero, J., Huszár, F., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1874–1883 (2016)
  • [32] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Mach. Learn. Research 15(1), 1929–1958 (2014)
  • [33] Swietojanski, P., Li, J., Huang, J.T.: Investigation of maxout networks for speech recognition. In: IEEE Int. Conf. Acoust. Speech Signal Proc. pp. 7649–7653 (2014)
  • [34] Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residual network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. vol. 1 (2017)
  • [35] Timofte, R., Agustsson, E., Van Gool, L., et al.: Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. pp. 1110–1121 (2017)
  • [36] Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored neighborhood regression for fast super-resolution. In: Proc. Asian Conf. Comp. Vis. pp. 111–126 (2014)
  • [37] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004)
  • [38] Yang, C.Y., Huang, J.B., Yang, M.H.: Exploiting self-similarities for single frame super-resolution. In: Proc. Asian Conf. Comp. Vis. pp. 497–510 (2010)
  • [39] Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. Image Proc. 19(11), 2861–2873 (2010)
  • [40] Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: Int. Conf. Curves and Surfaces. pp. 711–730 (2010)
  • [41] Zhang, K., Gao, X., Tao, D., Li, X.: Single image super-resolution with non-local means and steering kernel regression. IEEE Trans. Image Proc. 21(11), 4544–4556 (2012)
  • [42] Zhang, K., Tao, D., Gao, X., Li, X., Xiong, Z.: Learning multiple linear mappings for efficient single image super-resolution. IEEE Trans. Image Proc. 24(3), 846–861 (2015)