1 Introduction
Superresolution (SR) methods aim to reconstruct highresolution (HR) image or video contents from their lowresolution (LR) versions. The SR problem is known to be highly illposed, where an LR input can lead to multiple degraded HR versions without proper prior information [39]. As the role of SR becomes crucial recently in various areas such as upscaling fullhighdefinition (FHD) to 4K [6], it is important to develop SR algorithms that are capable of generating HR contents with superior visual quality while maintaining reasonable complexity and moderate amounts of parameters.
1.1 Related work
SR methods can be divided into two families according to their input types: single image SR (SISR) and video SR. While both spatial and temporal information can be used in video SR, SISR utilizes only spatial information within given LR images, making the SR problem more difficult [12, 31]. In this paper, we mainly focus on SISR.
Various SR methods employed the following techniques in reconstructing HR images of high quality: sparserepresentation [20, 22, 39], linear mappings [8, 36, 41, 42, 6], selfexamples [12, 13, 14, 38], and neural networks [7, 10, 11, 21, 25, 26, 31, 34, 35]. Sparserepresentationbased SR methods [20, 22, 39] undergo heavy computations to calculate sparserepresentation of an LR patch from a pretrained and complex LR dictionary. The resultant sparserepresentation is then applied to a corresponding HR dictionary to reconstruct its HR version. Some SR methods [12, 13, 14, 38] extracted LRtoHR mappings by searching for similar patches (selfexamples) to the current patch inside its selfdictionary. Linearmappingbased SR methods [6, 8, 36, 41, 42] (LMSR) have been proposed to obtain HR images of comparable quality but with much lower computational complexity. The adjusted anchored neighborhood regression (A+, APLUS) [36] method searches for the best linear mapping for each LR patch, based on the correlation with pretrained dictionary sets from [39]. Choi [6, 8] employs simple edge classification to find suitable linear mappings, which are applied directly to small LR patches to reconstruct their HR version.
Recently, SR methods using convolutional neural networks (CNN) [7, 10, 11, 21, 25, 26, 31, 34, 35] have shown high PSNR performance. Dong et al. [10] first utilized a 3layered CNN for SR (SRCNN), and reported a remarkable performance jump compared to previous SR methods. Recently, Kim et al. [21]
proposed a very deep 20layered CNN (VDSR) with gradient clipping and residual learning, yielding the reconstructed HR images of even higher PSNR compared to SRCNN. Shi
et al. [31] proposed a network structure where features are extracted in LR space. The feature maps at the last layer are upscaled to HR space using a subpixel convolution layer. Recursive convolutions were also used in [34] to lower the number of parameters. Ledig et al. [25] presented two SR network structures: a network using residual units to maximize PSNR performance (SRResNet), and a network using generative adversarial networks for perceptual improvement (SRGAN). Lately, some SR methods using very deep networks [7, 26, 35] with large parameters have been proposed in NTIRE2017 Challenge [35], achieving the stateoftheart PSNR performance.In these deep learningbased SR methods, rectified linear units (ReLU) [29]
are used to obtain nonlinearity between two adjacent convolutional layers. ReLU is a simple function, which has an identity mapping for positive values and 0 for negative. Unlike a sigmoid or hyperbolic tangent, ReLU does not suffer from gradient vanishing problems. By using ReLU, networks can learn piecewise linear mappings between LR and HR images, which results in the mapping with high visual quality and faster training convergence. There are other nonlinear activation functions such as leaky ReLU (LReLU)
[27], parametric ReLU [16] and exponential linear units (ELU) [9], but they are not often used in regression problems unlike ReLU. While LReLU replaces the zero part of ReLU with a linearity with certain small gradient, parametric ReLU parameterizes this gradient value so that a network can learn it. ELU has been designed so that it pushes mean unit activations closer to zero for faster learning.1.2 Motivations and contributions
One major reason for such high performance of neural networks in many applications [7, 10, 11, 21, 25, 26, 31, 34, 35] would be the use of ReLU [29] and its successors [27]. These nonlinear units were first introduced in classification papers [2, 9, 16, 18, 19, 24, 27, 29], which were subsequently reused for regression problems such as SR. It can be easily noticed that while ReLU and LReLU functions have been frequently used in SR, it is hard to find other types of activation functions [9]
. This is because they tend to distort scales of input values (more in Section 3.3), and thus networks with these functions generate HR results with lower quality compared to those with ReLU. This phenomenon can also be observed in normalization layers such as batch normalization
[18] and layer normalization [2], and there have been some reports that these normalization layers degrade performance when used in regression problems [7, 26].In this paper, we try to tackle some limitations of ReLU: i) ReLU produces feature maps with many zeros whose number is not controllable; ii) therefore, learning with ReLU tends to collapse in a network with very deep layers without some help such as identity mappings [17]; and iii) there could be a way to make use of those empty zero values so that we may be able to reduce number of channels for lower memory consumption and less computations.
Maxout units (MU) [15] are activation units which could overcome the aforementioned limitations. MU were first introduced in various classification problems [5, 15, 33]. Goodfellow et al. [15] proposed MU and used them in conjunction with dropout [32]
in a multilayerperceptron (MLP), and showed competitive classification results, compared to those of using conventional ReLU
[29]. In [33], MU were used for speech recognition, and it is stated that networks with MU were about three times faster to converge in training with comparable performance. In addition, Chang et al. [5] reported a networkinnetwork structure using MU for classification, which was able to mediate the problem of vanishing gradients that can occur when using ReLU. Although networks using MU were known to work well in highlevel vision areas, only a few works [4] employed MU for regression problems. In this paper, we develop and present a novel SR network incorporating MU. Our contributions are as follows:
Contrary to common thought that the number of parameters needs to be doubled when using MU, we first reveal that MU can effectively be incorporated into restoration problems. We show our SR network with MU that the number of channels of input feature maps is halved, even showing good results and thus resulting in a less memory usage and lower computational costs.

We show a deep analysis on networks using basic MU, and further investigate other MU variants, showing their effectiveness on the SR application.
Various experiment results show that our SR networks that incorporate MU as activation functions are able to reconstruct HR images of competitive quality compared to those of ReLU. Figure 1 shows comparison on PSNR performance versus the number of parameters for two toy network examples with ReLU and MU, respectively. Both networks share the same 6layered SR structure, except the type of activation functions used.
2 Maxout units
First, let us denote the outputs of the lth convolution layer as , where a network has L convolutional layers. Also, we denote the outputs of an activation function for as .
2.1 Conventional nonlinear activation functions
Many SR methods [7, 10, 11, 21, 25, 26, 31, 34, 35] often use ReLU [29] for activation functions between every two convolutional layers to obtain high nonlinearity between LR and HR. After each ReLU, the negative part of feature maps becomes zero as
(1) 
where max() is a function that calculates maximum values between two inputs in elementwise fashion. The negative parts where inputs become zero ensure nonlinearity, while the positive parts allow for fast learning as its derivative is a unity. However, very deep or narrow networks may have some difficulty in learning when too many values fall into negative and become zero. While other ReLU variants such as LReLU [27] and ELU [9] try to overcome this limitation by modifying the negative parts, these ReLU variants still have little control over a ratio of the number of negative values.
2.2 Maxout unit
To overcome the limitations, we come up with an SR network structure incorporating the MU.
2.2.1 Maxout.
MU [15]
computes the maximum of a vector of any length. Here, we use a special case of MU, where the feature maps
are halved along channel into two parts and , and elementwise maximum of these two parts is calculated as:(2) 
2.2.2 Difference of two MU.
In [15], a difference of two MU was also introduced with a proposition that any continuous piecewise linear function can be expressed as a difference of two convex piecewise linear functions. In this paper, we use the form of:
(3) 
where is equally divided into four parts , , and . Note after this activation function, the input feature maps are reduced to quarter. We denote this MU variant as MUD. Incorporating a simple max function between two sets of feature maps provides nonlinearity with various properties as follow:

MU simply transfers feature map values from the input layer to the next, acting as the linear parts of ReLU. In backpropagation, error gradients simply flow to the selected values (maximum).

Because MU does not consider negative or positive values unlike ReLU, outputs of MU would always have certain values, alleviating a chance of creating many closetozero values in feature maps and failing in learning.

In narrow networks where the number of channels of feature maps is small, the MU allows for stable learning, while networks with ReLU may converge poorly.

MU always ensures 50% sparsity: that is, 50% of larger values of the feature maps would always be selected and transmitted to the next layer, while the other 50% of the feature maps are not used. In backpropagation, there would be always 50% of paths alive for error gradients to be backpropagated.

As the output of MU is only 50% of the previous feature map values, the number of convolutional filter parameters in the next layer can be reduced by half, lowering both computation time and memory consumption. Similarly, unlike ReLU, MU is able to compress the given feature maps by stopping the transmission of closetozero values in the feature maps. In doing so, the network compactness is improved by preserving needed information.
We demonstrate the effectiveness of MU through various experiments in Section 3. Based on the properties of MU, we further investigate other variants of MU.
2.3 MU variants
From MU, its variants can be designed while preserving similar properties: minimum, recursive and sorting.
2.3.1 Minimum.
Instead of using the max function, one can design activation functions with the min function as
(4) 
where min() is a function that calculates minimum values between two inputs in elementwise fashion. In training, this variant works similar to the original MU. We denote this MU variant as MUM.
2.3.2 Sorting.
If we are to maintain the size of feature maps as ReLU does, we can employ both max and min functions into one activation function as
(5) 
where cat() is a function that concatenates all inputs along channels. We denote this MU variant as MUS.
2.3.3 Recursive.
By using MU recursively for n times before applying convolutions in the next layer, we can further enforce more sparsity, e.g. 75%, resulting reduced feature maps as outputs. This can be expressed as
(6) 
where indicates ntimes repeated MU, whose output channels are reduced by . We denote this MU variant as MUR.
Figure 2 illustrates the various activation functions, including MU and MU variants. Through additional experiments using the MU variants, we confirmed that networks with the variants could be trained well as shown in Section 3.
2.4 Network details
By incorporating MU and its variants, we propose multiple network structures as shown in Figure 3, and show their performance for SR applications.
2.4.1 Toy networks.
In order to conduct many and quick validations for comparing effects of multiple activation function variants including MU, we present a baseline toy network structure that is shared for testing all types of activation functions. The toy networks were trained using a smaller training dataset from [39]. Our toy networks includes three types of layers: 6 layers of 33 convolutions, one type of activation function, and one subpixel convolution layer [31] at the end for upscaling purpose. For convolutional layers, we simply use the kernel size of 3
3, where input feature maps are padded with zero before convolution, so that the size of feature maps is preserved until the last subpixel convolution layer. The experimental results obtained using the toy networks are presented throughout Figures 1, 6, 7 and Table
2.2.4.2 EspcnMu.
For comparison, several stateoftheart SR network structures [31, 21] are implemented as stated in the papers but using MU and some modifications. Our first SR network using MU is based on ESPCN [31]. We replace all ReLU layers in [31] with MU. A 533 model [31] is also used in our network with 64 filters for the first convolution layer and 32 filters for the second convolution layer. Note that due to MU’s characteristics where the number of channels is halved after activation, the number of filter parameters of ours is reduced almost in half compared to that of ESPCN [31]. In addition, we aim to learn the residual between original HR images and interpolated LR images as in [21], but we use nearestneighbor interpolation instead of bicubic to make SR problem harder and thus mainly focus on capability of types of activation functions. In doing so, networks converge faster. Due to its small number of parameters, we utilize a small training data set [39], but still produce comparable SR results to [31].
Methods  Bicubic  SRCNN [11]  ESPCN [31]  ESPCN [31]  VDSR [21]  SRResNet [25]  

# of Params    57K  25K  25K  665K  923K  
Training Sets    ImageNet  91  ImageNet  291  ImageNet  
Testing  Scale  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM 
Set5  3  30.40  0.8687  32.75  0.9095  32.39    33.00  0.9121  33.66  0.9213     
4  28.43  0.8109  30.49  0.8634      30.76  0.8679  31.35  0.8838  32.06  0.8927  
Set14  3  27.55  0.7741  29.30  0.8219  28.97    29.51  0.8247  29.77  0.8314     
4  26.01  0.7023  27.50  0.7517      27.75  0.7580  28.01  0.7674  28.59  0.7811  
B100  3  27.21  0.7389  28.41  0.7867          28.82  0.7976     
4  25.96  0.6678  26.90  0.7107          27.29  0.7251  27.60  0.7361  
*Results for the 955 model of SRCNN and results of ESPCN using ReLU are reported. 
Methods  ESPCNMU  VDSRMU  DNSR  

# of Params  13K  338K  133K  
Training Sets  91  291  291  
Testing  Scale  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM 
Set5  3  32.85  0.9118  33.92  0.9231  33.80  0.9224 
4  30.57  0.8667  31.61  0.8861  31.57  0.8858  
Set14  3  29.40  0.8222  29.99  0.8346  29.95  0.8338 
4  27.61  0.7547  28.21  0.7713  28.21  0.7714  
B100  3  28.40  0.7853  28.87  0.7989  28.82  0.7980 
4  26.91  0.7114  27.31  0.7262  27.30  0.7260 
2.4.3 VdsrMu.
In addition, we propose another SR network using MU based on 20layered VDSR [21]. Similar to [21], 20 convolutional layers with 33sized filters are used in our network. We replace all ReLU layers in [21] with MU. Similar to that of ESPCNMU, the number of filters parameters of ours is reduced almost in half compared to that of VDSR [21]. Also, we use nearestneighbor interpolation instead of bicubic, and a subpixel convolution layer [31] for faster computation speed. Due to its large number of parameters, our VDSRMU network was trained using a larger data set combining [39] and [28] as in VDSR [21].
2.4.4 Dnsr.
We also present a deeper and narrower version of VDSRMU, called DNSR. While VDSRMU has 20 layers with 64 channels, our DNSR has 30 layers (deeper) with 32 channels (narrower). Due to its deeper structure, we also employ residual units [17] into DNSR for stable learning. Our DNSR holds a smaller number of total filter parameters, which is about 1/5 of that of VDSR [21], and about 1/2.6 of that of VDSRMU, while showing PSNR performance similar to VDSRMU.
3 Experiment results
We now demonstrate the effectiveness of MU and its variants in SR framework on popular image datasets, compared to conventional SR deep networks with common nonlinear activation functions, including ReLU.
3.1 Experiment settings
3.1.1 Datasets.
Two popular datasets [39, 28] were used for training networks. Images in the datasets were used as original HR images. Before given into networks, LRHR training images are normalized between 0 and 1, and then LR training images are subtracted by 0.5 to have a zero mean. LR input images were created from these HR images by applying nearestneighbor interpolation. SR process is only applied on Ychannel of YCbCr color space, and the chroma components, Cb and Cr, are upscaled using simple bicubic interpolation. When comparing SR output images with original HR images, performance measures such as PSNR were done in Ychannel.
The training set of 91 images [39] has frequently been used in various SR methods [39, 10, 21, 31]. The dataset consists of small resolutions but with a variety of texture types. In our experiments, this smaller training set was used for various toy networks in order to conduct fast and many experiments, and was also used for our ESPCNMU.
The Berkeley Segmentation Dataset [28] has also been often used in SR works [21, 31]. This dataset includes 200 training images and 100 testing images for segmentation. As used in VDSR [21], we utilize 200 training images of BSD and 91 images from [39] from training. This larger set was used for training VDSRMU and DNSR.
3.1.2 Training.
We trained all the networks using ADAM [23] optimization with an initial learning rate of and the other hyperparameters as defaults. We employed a uniform weight initialization technique in [19]
for training. All the networks including our proposed networks with MU were implemented using TensorFlow
[1], which is a deep learning toolbox for Python, and were trained/tested on GPU Nvidia Titan Xp.The toy networks were trained for iterations, where a learning rate was lowered by a factor of 10 after iterations. The minibatch size was set to 2, weight decay was not used, and simple data augmentation with flip and rotation was used. For subimages, LRHR training image pairs were randomly cropped for the size of 4040 for a scale factor of 4.
Our ESPCNMU, VDSRMU and DNSR networks were trained for iterations, where a learning rate was lowered by a factor of 10 after iterations. The minibatch size was set to 4, and weight decay was not used. To create subimages for training, LRHR training image pairs were randomly cropped for the size of 7575 and 76
76 in HR space, respectively, for a scale factor of 3 and 4. We apply various data augmentations to the HR images such as flipping, rotating, mirroring, and randomly multiplying their intensities by a value in a range from 0.8 and 1.2. Data augmentations are done on the fly for every epoch in training to reduce overfitting.
3.2 SR results
First, we show SR results using our three proposed SR networks, including ESPCNMU, VDSRMU and DNSR, and compare them with the stateoftheart methods, including SRCNN [11], ESPCN [31], VDSR [21] and SRResNet [25]. Table 1 summarizes performance details for all the SR methods, including their numbers of filter parameters, their used training sets, and PSNR and SSIM [37] values for scale factors of 3 and 4, tested on three popular testing datasets. For SRCNN [11], the reported results of the 955 model are shown. For ESPCN [31], the reported results using ReLU for two different training datasets are shown in Table 1. The PSNR/SSIM values for the conventional SR methods in Table 1 are either the ones reported in their respective papers, or directly calculated from their publically available result images online. Figure 5 and 5 show reconstructed HR images and their magnified portions of baby and zebra, respectively, using various SR methods for a scale factor of 4.
3.2.1 SR performance.
As shown in Table 1, SRResNet [25], an SR network of the largest number of filter parameters (about 900K) that was trained using ImageNet [30], shows the highest PSNR and SSIM performance among various SR methods. Our proposed VDSRMU and DNSR show the second and third highest performance with only 338K and 133K parameters, respectively, outperforming most of the conventional SR methods except SRResNet. It can be seen that our networks using MU have good efficiency with much less parameters, compared to other SR methods, while showing reasonable PSNR performance. As shown in Figure 5 and 5, the quality of the reconstructed HR images using our VDSRMU and DNSR are comparable to that of SRResNet [25]. Especially, our VDSRMU and DNSR were able to reconstruct clearly discerned stripes of zebra as shown in Figure 5(i) and (j), which are comparable to Figure 5(g) of SRResNet, while other SR methods fail to do so.
3.2.2 EspcnMu.
In order to show the effectiveness of using MU in SR, we compare two similar networks: ESPCN [31] and our ESPCNMU. As shown in Table 1, the number of parameters of ESPCNMU is only about 13K, which is almost half the number of parameters of ESPCN [31]. While our network was trained using 91 images, our ESPCNMU outperforms ESPCN [31] trained with 91 training images, and shows comparable performance even compared to ESPCN [31] that was trained using a larger set of images from ImageNet [30].
3.2.3 VdsrMu.
Similar to our ESPCNMU, our VDSRMU outperforms VDSR [21] in terms of PSNR and SSIM, but with a much lower number of parameters. Our VDSRMU networks has about 338K parameters, which is about half the number of parameters of VDSR [21]. Note both networks were trained using the same 291 images [39, 28].
3.2.4 Dnsr.
Our deeper and narrower version of VDSRMU, DNSR, has an almost 2/5 times the number of filter parameters compared to our VDSRMU, which is about 1/5 of VDSR [21] and 1/7 of SRResNet [25]. Even with a low number of parameters, our DNSR network was able to reconstruct HR images comparable to VDSRMU, and outperforms most of the conventional SR methods.
Activation Function  Size of Conv Filters  Number of Params  Training PSNR  Testing PSNR 

ReLU [29]  7.1K  28.22  29.81  
LReLU [27]  7.1K  28.19  29.78  
ELU [9]  7.1K  27.91  29.30  
MU [15]  6K  28.42  30.07  
MUD [15]  6.4K  28.46  30.07  
MUM  6K  28.43  30.05  
MUS  7.1K  28.46  30.08  
MUR  6.4K  28.38  29.98 
3.3 Discussions
We also conducted experiments on toy networks using various activation functions including ReLU, MU and MU variants. We show potential properties of MU compared to units used in conventional SR methods, by analyzing parametervs.PSNR performance and by showing activation rates in feature maps.
3.3.1 MU versus ReLU.
Figure 1 shows comparison on PSNR performance versus the number of parameters for two toy networks with ReLU and MU, respectively. Both networks share the same 6layered SR structure, except the type of activation functions used. The number of parameters for each subtest is controlled by adjusting the number of convolution filters. As shown, the PSNR performance gap between networks using ReLU and MU becomes larger as the number of parameters decreases. This indicates that in narrow networks where the number of channels of feature maps is small, MU allows for stable learning, while ReLU converges towards a worse point. We can argue that because MU does not consider negative or positive values unlike ReLU, the outputs of MU would always have certain values, alleviating a chance of creating many closetozero values in feature maps and failing in learning.
Figure 6 shows the average ratios of activated neurons for each feature map in two toy networks using ReLU and MU for Bird and Butterfly, respectively (e.g. 100% activation is colored in white, 0% in black and 50% in gray). The rows and columns indicate channels and layers, respectively. It is interesting to see that activations after MU are sparser than those of ReLU, which supports the effectiveness of MU in SR. Note that since the maximum values between two feature maps are always passed to the next layer, the feature maps after MU would always be 100% activated with half the number of feature maps. Note that Figure 6 may suggest that MU can be related to network pruning, and this remains as our future work.
3.3.2 MU variants.
Table 2 shows training and testing PSNR performance after the first iterations for toy networks with various activation functions. Note that the size of convolutional filters has been adjusted for each network to yield a similar number of total parameters. Figure 7 presents a PSNRversusiteration plot for networks with various activation functions from Table 2. It can be seen in Figure 7 that the networks with MU and MU variants enable faster convergence compared to those with ReLU and ReLU variants. Note that the network with ELU has a training difficulty in the SR problem, contrary to its performance in other classification papers. This may be due to the fact that ELU tend to distort scales of input values, which is undesirable in regression problems such as SR. Overall, the networks with MU and its variants show higher PSNR values with less parameters.
4 Conclusion
The proposed SR networks showed superior PSNR performance compared to the base networks using ReLU and other activation functions. The SR networks using MU tend to produce higher PSNR results with a smaller number of convolution filter parameters, which is desirable for computational platforms with limited resources. We showed that MU can be used in regression problems especially SR, and they have some potential with further extension to new types of activation functions for other applications.
References

[1]
Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: A system for largescale machine learning. In: OSDI. vol. 16, pp. 265–283 (2016)
 [2] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
 [3] Bevilacqua, M., Roumy, A., Guillemot, C., AlberiMorel, M.L.: Lowcomplexity singleimage superresolution based on nonnegative neighbor embedding (2012)
 [4] Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: An endtoend system for single image haze removal. IEEE Trans. Image Proc. 25(11), 5187–5198 (2016)
 [5] Chang, J.R., Chen, Y.S.: Batchnormalized maxout network in network. arXiv preprint arXiv:1511.02583 (2015)
 [6] Choi, J.S., Kim, M.: Superinterpolation with edgeorientationbased mapping kernels for low complex 2 upscaling. IEEE Trans. Image Proc. 25(1), 469–483 (2016)
 [7] Choi, J.S., Kim, M.: A deep convolutional neural network with selection units for superresolution. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. pp. 1150–1156 (2017)
 [8] Choi, J.S., Kim, M.: Single image superresolution using global regression based on multiple local linear mappings. IEEE Trans. Image Proc. 26(3), 1300–1314 (2017)
 [9] Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
 [10] Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image superresolution. In: European Conf. Comp. Vis. pp. 184–199 (2014)
 [11] Dong, C., Loy, C.C., He, K., Tang, X.: Image superresolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
 [12] Freedman, G., Fattal, R.: Image and video upscaling from local selfexamples. ACM Trans. on Graph. 30(2), 12 (2011)
 [13] Freeman, W.T., Jones, T.R., Pasztor, E.C.: Examplebased superresolution. IEEE Comp. Graph. and App. 22(2), 56–65 (2002)
 [14] Glasner, D., Bagon, S., Irani, M.: Superresolution from a single image. In: IEEE Int. Conf. Comp. Vis. pp. 349–356 (2009)
 [15] Goodfellow, I.J., WardeFarley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)
 [16] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 1026–1034 (2015)
 [17] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conf. Comp. Vis. pp. 630–645 (2016)
 [18] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Int. Conf. Mach. Learn. pp. 448–456 (2015)

[19]
Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proc. ACM Int. Conf. Mul. pp. 675–678 (2014)
 [20] Jianchao, Y., Wright, J., Huang, T., Ma, Y.: Image superresolution as sparse representation of raw image patches. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1–8 (2008)
 [21] Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image superresolution using very deep convolutional networks. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1646–1654 (2016)
 [22] Kim, K.I., Kwon, Y.: Singleimage superresolution using sparse regression and natural image prior. IEEE Trans. Pattern Anal. Mach. Intell. 32(6), 1127–1133 (2010)
 [23] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [24] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Adv. Neural Info. Proc. Sys. pp. 1097–1105 (2012)
 [25] Ledig, C., Theis, L., Huszár, F., et al.: Photorealistic single image superresolution using a generative adversarial network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. vol. 2, p. 4 (2017)
 [26] Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image superresolution. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. vol. 1, p. 3 (2017)
 [27] Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. Int. Conf. Mach. Learn. vol. 30, p. 3 (2013)
 [28] Martin, D., Fowlkes, C., et al.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. IEEE Int. Conf. Comp. Vis. vol. 2, pp. 416–423 (2001)

[29]
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proc. Int. Conf. Mach. Learn. pp. 807–814 (2010)
 [30] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. Journal Comp. Vis. 115(3), 211–252 (2015)
 [31] Shi, W., Caballero, J., Huszár, F., et al.: Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1874–1883 (2016)
 [32] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Mach. Learn. Research 15(1), 1929–1958 (2014)
 [33] Swietojanski, P., Li, J., Huang, J.T.: Investigation of maxout networks for speech recognition. In: IEEE Int. Conf. Acoust. Speech Signal Proc. pp. 7649–7653 (2014)
 [34] Tai, Y., Yang, J., Liu, X.: Image superresolution via deep recursive residual network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. vol. 1 (2017)
 [35] Timofte, R., Agustsson, E., Van Gool, L., et al.: Ntire 2017 challenge on single image superresolution: Methods and results. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. pp. 1110–1121 (2017)
 [36] Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored neighborhood regression for fast superresolution. In: Proc. Asian Conf. Comp. Vis. pp. 111–126 (2014)
 [37] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004)
 [38] Yang, C.Y., Huang, J.B., Yang, M.H.: Exploiting selfsimilarities for single frame superresolution. In: Proc. Asian Conf. Comp. Vis. pp. 497–510 (2010)
 [39] Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image superresolution via sparse representation. IEEE Trans. Image Proc. 19(11), 2861–2873 (2010)
 [40] Zeyde, R., Elad, M., Protter, M.: On single image scaleup using sparserepresentations. In: Int. Conf. Curves and Surfaces. pp. 711–730 (2010)
 [41] Zhang, K., Gao, X., Tao, D., Li, X.: Single image superresolution with nonlocal means and steering kernel regression. IEEE Trans. Image Proc. 21(11), 4544–4556 (2012)
 [42] Zhang, K., Tao, D., Gao, X., Li, X., Xiong, Z.: Learning multiple linear mappings for efficient single image superresolution. IEEE Trans. Image Proc. 24(3), 846–861 (2015)