Super-resolution (SR) methods aim to reconstruct high-resolution (HR) image or video contents from their low-resolution (LR) versions. The SR problem is known to be highly ill-posed, where an LR input can lead to multiple degraded HR versions without proper prior information . As the role of SR becomes crucial recently in various areas such as up-scaling full-high-definition (FHD) to 4K , it is important to develop SR algorithms that are capable of generating HR contents with superior visual quality while maintaining reasonable complexity and moderate amounts of parameters.
1.1 Related work
SR methods can be divided into two families according to their input types: single image SR (SISR) and video SR. While both spatial and temporal information can be used in video SR, SISR utilizes only spatial information within given LR images, making the SR problem more difficult [12, 31]. In this paper, we mainly focus on SISR.
Various SR methods employed the following techniques in reconstructing HR images of high quality: sparse-representation [20, 22, 39], linear mappings [8, 36, 41, 42, 6], self-examples [12, 13, 14, 38], and neural networks [7, 10, 11, 21, 25, 26, 31, 34, 35]. Sparse-representation-based SR methods [20, 22, 39] undergo heavy computations to calculate sparse-representation of an LR patch from a pre-trained and complex LR dictionary. The resultant sparse-representation is then applied to a corresponding HR dictionary to reconstruct its HR version. Some SR methods [12, 13, 14, 38] extracted LR-to-HR mappings by searching for similar patches (self-examples) to the current patch inside its self-dictionary. Linear-mapping-based SR methods [6, 8, 36, 41, 42] (LMSR) have been proposed to obtain HR images of comparable quality but with much lower computational complexity. The adjusted anchored neighborhood regression (A+, APLUS)  method searches for the best linear mapping for each LR patch, based on the correlation with pre-trained dictionary sets from . Choi [6, 8] employs simple edge classification to find suitable linear mappings, which are applied directly to small LR patches to reconstruct their HR version.
Recently, SR methods using convolutional neural networks (CNN) [7, 10, 11, 21, 25, 26, 31, 34, 35] have shown high PSNR performance. Dong et al.  first utilized a 3-layered CNN for SR (SRCNN), and reported a remarkable performance jump compared to previous SR methods. Recently, Kim et al. 
proposed a very deep 20-layered CNN (VDSR) with gradient clipping and residual learning, yielding the reconstructed HR images of even higher PSNR compared to SRCNN. Shiet al.  proposed a network structure where features are extracted in LR space. The feature maps at the last layer are up-scaled to HR space using a sub-pixel convolution layer. Recursive convolutions were also used in  to lower the number of parameters. Ledig et al.  presented two SR network structures: a network using residual units to maximize PSNR performance (SRResNet), and a network using generative adversarial networks for perceptual improvement (SRGAN). Lately, some SR methods using very deep networks [7, 26, 35] with large parameters have been proposed in NTIRE2017 Challenge , achieving the state-of-the-art PSNR performance.
In these deep learning-based SR methods, rectified linear units (ReLU) 
are used to obtain nonlinearity between two adjacent convolutional layers. ReLU is a simple function, which has an identity mapping for positive values and 0 for negative. Unlike a sigmoid or hyperbolic tangent, ReLU does not suffer from gradient vanishing problems. By using ReLU, networks can learn piece-wise linear mappings between LR and HR images, which results in the mapping with high visual quality and faster training convergence. There are other nonlinear activation functions such as leaky ReLU (LReLU), parametric ReLU  and exponential linear units (ELU) , but they are not often used in regression problems unlike ReLU. While LReLU replaces the zero part of ReLU with a linearity with certain small gradient, parametric ReLU parameterizes this gradient value so that a network can learn it. ELU has been designed so that it pushes mean unit activations closer to zero for faster learning.
1.2 Motivations and contributions
One major reason for such high performance of neural networks in many applications [7, 10, 11, 21, 25, 26, 31, 34, 35] would be the use of ReLU  and its successors . These nonlinear units were first introduced in classification papers [2, 9, 16, 18, 19, 24, 27, 29], which were subsequently reused for regression problems such as SR. It can be easily noticed that while ReLU and LReLU functions have been frequently used in SR, it is hard to find other types of activation functions 
. This is because they tend to distort scales of input values (more in Section 3.3), and thus networks with these functions generate HR results with lower quality compared to those with ReLU. This phenomenon can also be observed in normalization layers such as batch normalization and layer normalization , and there have been some reports that these normalization layers degrade performance when used in regression problems [7, 26].
In this paper, we try to tackle some limitations of ReLU: i) ReLU produces feature maps with many zeros whose number is not controllable; ii) therefore, learning with ReLU tends to collapse in a network with very deep layers without some help such as identity mappings ; and iii) there could be a way to make use of those empty zero values so that we may be able to reduce number of channels for lower memory consumption and less computations.
Maxout units (MU)  are activation units which could overcome the aforementioned limitations. MU were first introduced in various classification problems [5, 15, 33]. Goodfellow et al.  proposed MU and used them in conjunction with dropout 
in a multi-layer-perceptron (MLP), and showed competitive classification results, compared to those of using conventional ReLU. In , MU were used for speech recognition, and it is stated that networks with MU were about three times faster to converge in training with comparable performance. In addition, Chang et al.  reported a network-in-network structure using MU for classification, which was able to mediate the problem of vanishing gradients that can occur when using ReLU. Although networks using MU were known to work well in high-level vision areas, only a few works  employed MU for regression problems. In this paper, we develop and present a novel SR network incorporating MU. Our contributions are as follows:
Contrary to common thought that the number of parameters needs to be doubled when using MU, we first reveal that MU can effectively be incorporated into restoration problems. We show our SR network with MU that the number of channels of input feature maps is halved, even showing good results and thus resulting in a less memory usage and lower computational costs.
We show a deep analysis on networks using basic MU, and further investigate other MU variants, showing their effectiveness on the SR application.
Various experiment results show that our SR networks that incorporate MU as activation functions are able to reconstruct HR images of competitive quality compared to those of ReLU. Figure 1 shows comparison on PSNR performance versus the number of parameters for two toy network examples with ReLU and MU, respectively. Both networks share the same 6-layered SR structure, except the type of activation functions used.
2 Maxout units
First, let us denote the outputs of the l-th convolution layer as , where a network has L convolutional layers. Also, we denote the outputs of an activation function for as .
2.1 Conventional nonlinear activation functions
Many SR methods [7, 10, 11, 21, 25, 26, 31, 34, 35] often use ReLU  for activation functions between every two convolutional layers to obtain high nonlinearity between LR and HR. After each ReLU, the negative part of feature maps becomes zero as
where max() is a function that calculates maximum values between two inputs in element-wise fashion. The negative parts where inputs become zero ensure nonlinearity, while the positive parts allow for fast learning as its derivative is a unity. However, very deep or narrow networks may have some difficulty in learning when too many values fall into negative and become zero. While other ReLU variants such as LReLU  and ELU  try to overcome this limitation by modifying the negative parts, these ReLU variants still have little control over a ratio of the number of negative values.
2.2 Maxout unit
To overcome the limitations, we come up with an SR network structure incorporating the MU.
computes the maximum of a vector of any length. Here, we use a special case of MU, where the feature mapsare halved along channel into two parts and , and element-wise maximum of these two parts is calculated as:
2.2.2 Difference of two MU.
In , a difference of two MU was also introduced with a proposition that any continuous piece-wise linear function can be expressed as a difference of two convex piece-wise linear functions. In this paper, we use the form of:
where is equally divided into four parts , , and . Note after this activation function, the input feature maps are reduced to quarter. We denote this MU variant as MU-D. Incorporating a simple max function between two sets of feature maps provides nonlinearity with various properties as follow:
MU simply transfers feature map values from the input layer to the next, acting as the linear parts of ReLU. In backpropagation, error gradients simply flow to the selected values (maximum).
Because MU does not consider negative or positive values unlike ReLU, outputs of MU would always have certain values, alleviating a chance of creating many close-to-zero values in feature maps and failing in learning.
In narrow networks where the number of channels of feature maps is small, the MU allows for stable learning, while networks with ReLU may converge poorly.
MU always ensures 50% sparsity: that is, 50% of larger values of the feature maps would always be selected and transmitted to the next layer, while the other 50% of the feature maps are not used. In backpropagation, there would be always 50% of paths alive for error gradients to be back-propagated.
As the output of MU is only 50% of the previous feature map values, the number of convolutional filter parameters in the next layer can be reduced by half, lowering both computation time and memory consumption. Similarly, unlike ReLU, MU is able to compress the given feature maps by stopping the transmission of close-to-zero values in the feature maps. In doing so, the network compactness is improved by preserving needed information.
We demonstrate the effectiveness of MU through various experiments in Section 3. Based on the properties of MU, we further investigate other variants of MU.
2.3 MU variants
From MU, its variants can be designed while preserving similar properties: minimum, recursive and sorting.
Instead of using the max function, one can design activation functions with the min function as
where min() is a function that calculates minimum values between two inputs in element-wise fashion. In training, this variant works similar to the original MU. We denote this MU variant as MU-M.
If we are to maintain the size of feature maps as ReLU does, we can employ both max and min functions into one activation function as
where cat() is a function that concatenates all inputs along channels. We denote this MU variant as MU-S.
By using MU recursively for n times before applying convolutions in the next layer, we can further enforce more sparsity, e.g. 75%, resulting reduced feature maps as outputs. This can be expressed as
where indicates n-times repeated MU, whose output channels are reduced by . We denote this MU variant as MU-R.
Figure 2 illustrates the various activation functions, including MU and MU variants. Through additional experiments using the MU variants, we confirmed that networks with the variants could be trained well as shown in Section 3.
2.4 Network details
By incorporating MU and its variants, we propose multiple network structures as shown in Figure 3, and show their performance for SR applications.
2.4.1 Toy networks.
In order to conduct many and quick validations for comparing effects of multiple activation function variants including MU, we present a baseline toy network structure that is shared for testing all types of activation functions. The toy networks were trained using a smaller training dataset from . Our toy networks includes three types of layers: 6 layers of 33 convolutions, one type of activation function, and one sub-pixel convolution layer  at the end for up-scaling purpose. For convolutional layers, we simply use the kernel size of 3
3, where input feature maps are padded with zero before convolution, so that the size of feature maps is preserved until the last sub-pixel convolution layer. The experimental results obtained using the toy networks are presented throughout Figures 1, 6, 7 and Table2.
For comparison, several state-of-the-art SR network structures [31, 21] are implemented as stated in the papers but using MU and some modifications. Our first SR network using MU is based on ESPCN . We replace all ReLU layers in  with MU. A 5-3-3 model  is also used in our network with 64 filters for the first convolution layer and 32 filters for the second convolution layer. Note that due to MU’s characteristics where the number of channels is halved after activation, the number of filter parameters of ours is reduced almost in half compared to that of ESPCN . In addition, we aim to learn the residual between original HR images and interpolated LR images as in , but we use nearest-neighbor interpolation instead of bicubic to make SR problem harder and thus mainly focus on capability of types of activation functions. In doing so, networks converge faster. Due to its small number of parameters, we utilize a small training data set , but still produce comparable SR results to .
|Methods||Bicubic||SRCNN ||ESPCN ||ESPCN ||VDSR ||SRResNet |
|# of Params||-||57K||25K||25K||665K||923K|
|*Results for the 9-5-5 model of SRCNN and results of ESPCN using ReLU are reported.|
|# of Params||13K||338K||133K|
In addition, we propose another SR network using MU based on 20-layered VDSR . Similar to , 20 convolutional layers with 33-sized filters are used in our network. We replace all ReLU layers in  with MU. Similar to that of ESPCN-MU, the number of filters parameters of ours is reduced almost in half compared to that of VDSR . Also, we use nearest-neighbor interpolation instead of bicubic, and a sub-pixel convolution layer  for faster computation speed. Due to its large number of parameters, our VDSR-MU network was trained using a larger data set combining  and  as in VDSR .
We also present a deeper and narrower version of VDSR-MU, called DNSR. While VDSR-MU has 20 layers with 64 channels, our DNSR has 30 layers (deeper) with 32 channels (narrower). Due to its deeper structure, we also employ residual units  into DNSR for stable learning. Our DNSR holds a smaller number of total filter parameters, which is about 1/5 of that of VDSR , and about 1/2.6 of that of VDSR-MU, while showing PSNR performance similar to VDSR-MU.
3 Experiment results
We now demonstrate the effectiveness of MU and its variants in SR framework on popular image datasets, compared to conventional SR deep networks with common nonlinear activation functions, including ReLU.
3.1 Experiment settings
Two popular datasets [39, 28] were used for training networks. Images in the datasets were used as original HR images. Before given into networks, LR-HR training images are normalized between 0 and 1, and then LR training images are subtracted by 0.5 to have a zero mean. LR input images were created from these HR images by applying nearest-neighbor interpolation. SR process is only applied on Y-channel of YCbCr color space, and the chroma components, Cb and Cr, are up-scaled using simple bicubic interpolation. When comparing SR output images with original HR images, performance measures such as PSNR were done in Y-channel.
The training set of 91 images  has frequently been used in various SR methods [39, 10, 21, 31]. The dataset consists of small resolutions but with a variety of texture types. In our experiments, this smaller training set was used for various toy networks in order to conduct fast and many experiments, and was also used for our ESPCN-MU.
The Berkeley Segmentation Dataset  has also been often used in SR works [21, 31]. This dataset includes 200 training images and 100 testing images for segmentation. As used in VDSR , we utilize 200 training images of BSD and 91 images from  from training. This larger set was used for training VDSR-MU and DNSR.
for training. All the networks including our proposed networks with MU were implemented using TensorFlow, which is a deep learning toolbox for Python, and were trained/tested on GPU Nvidia Titan Xp.
The toy networks were trained for iterations, where a learning rate was lowered by a factor of 10 after iterations. The mini-batch size was set to 2, weight decay was not used, and simple data augmentation with flip and rotation was used. For sub-images, LR-HR training image pairs were randomly cropped for the size of 4040 for a scale factor of 4.
Our ESPCN-MU, VDSR-MU and DNSR networks were trained for iterations, where a learning rate was lowered by a factor of 10 after iterations. The mini-batch size was set to 4, and weight decay was not used. To create sub-images for training, LR-HR training image pairs were randomly cropped for the size of 7575 and 76
76 in HR space, respectively, for a scale factor of 3 and 4. We apply various data augmentations to the HR images such as flipping, rotating, mirroring, and randomly multiplying their intensities by a value in a range from 0.8 and 1.2. Data augmentations are done on the fly for every epoch in training to reduce overfitting.
3.2 SR results
First, we show SR results using our three proposed SR networks, including ESPCN-MU, VDSR-MU and DNSR, and compare them with the state-of-the-art methods, including SRCNN , ESPCN , VDSR  and SRResNet . Table 1 summarizes performance details for all the SR methods, including their numbers of filter parameters, their used training sets, and PSNR and SSIM  values for scale factors of 3 and 4, tested on three popular testing datasets. For SRCNN , the reported results of the 9-5-5 model are shown. For ESPCN , the reported results using ReLU for two different training datasets are shown in Table 1. The PSNR/SSIM values for the conventional SR methods in Table 1 are either the ones reported in their respective papers, or directly calculated from their publically available result images online. Figure 5 and 5 show reconstructed HR images and their magnified portions of baby and zebra, respectively, using various SR methods for a scale factor of 4.
3.2.1 SR performance.
As shown in Table 1, SRResNet , an SR network of the largest number of filter parameters (about 900K) that was trained using ImageNet , shows the highest PSNR and SSIM performance among various SR methods. Our proposed VDSR-MU and DNSR show the second and third highest performance with only 338K and 133K parameters, respectively, outperforming most of the conventional SR methods except SRResNet. It can be seen that our networks using MU have good efficiency with much less parameters, compared to other SR methods, while showing reasonable PSNR performance. As shown in Figure 5 and 5, the quality of the reconstructed HR images using our VDSR-MU and DNSR are comparable to that of SRResNet . Especially, our VDSR-MU and DNSR were able to reconstruct clearly discerned stripes of zebra as shown in Figure 5-(i) and (j), which are comparable to Figure 5-(g) of SRResNet, while other SR methods fail to do so.
In order to show the effectiveness of using MU in SR, we compare two similar networks: ESPCN  and our ESPCN-MU. As shown in Table 1, the number of parameters of ESPCN-MU is only about 13K, which is almost half the number of parameters of ESPCN . While our network was trained using 91 images, our ESPCN-MU outperforms ESPCN  trained with 91 training images, and shows comparable performance even compared to ESPCN  that was trained using a larger set of images from ImageNet .
Similar to our ESPCN-MU, our VDSR-MU outperforms VDSR  in terms of PSNR and SSIM, but with a much lower number of parameters. Our VDSR-MU networks has about 338K parameters, which is about half the number of parameters of VDSR . Note both networks were trained using the same 291 images [39, 28].
Our deeper and narrower version of VDSR-MU, DNSR, has an almost 2/5 times the number of filter parameters compared to our VDSR-MU, which is about 1/5 of VDSR  and 1/7 of SRResNet . Even with a low number of parameters, our DNSR network was able to reconstruct HR images comparable to VDSR-MU, and outperforms most of the conventional SR methods.
|Activation Function||Size of Conv Filters||Number of Params||Training PSNR||Testing PSNR|
We also conducted experiments on toy networks using various activation functions including ReLU, MU and MU variants. We show potential properties of MU compared to units used in conventional SR methods, by analyzing parameter-vs.-PSNR performance and by showing activation rates in feature maps.
3.3.1 MU versus ReLU.
Figure 1 shows comparison on PSNR performance versus the number of parameters for two toy networks with ReLU and MU, respectively. Both networks share the same 6-layered SR structure, except the type of activation functions used. The number of parameters for each subtest is controlled by adjusting the number of convolution filters. As shown, the PSNR performance gap between networks using ReLU and MU becomes larger as the number of parameters decreases. This indicates that in narrow networks where the number of channels of feature maps is small, MU allows for stable learning, while ReLU converges towards a worse point. We can argue that because MU does not consider negative or positive values unlike ReLU, the outputs of MU would always have certain values, alleviating a chance of creating many close-to-zero values in feature maps and failing in learning.
Figure 6 shows the average ratios of activated neurons for each feature map in two toy networks using ReLU and MU for Bird and Butterfly, respectively (e.g. 100% activation is colored in white, 0% in black and 50% in gray). The rows and columns indicate channels and layers, respectively. It is interesting to see that activations after MU are sparser than those of ReLU, which supports the effectiveness of MU in SR. Note that since the maximum values between two feature maps are always passed to the next layer, the feature maps after MU would always be 100% activated with half the number of feature maps. Note that Figure 6 may suggest that MU can be related to network pruning, and this remains as our future work.
3.3.2 MU variants.
Table 2 shows training and testing PSNR performance after the first iterations for toy networks with various activation functions. Note that the size of convolutional filters has been adjusted for each network to yield a similar number of total parameters. Figure 7 presents a PSNR-versus-iteration plot for networks with various activation functions from Table 2. It can be seen in Figure 7 that the networks with MU and MU variants enable faster convergence compared to those with ReLU and ReLU variants. Note that the network with ELU has a training difficulty in the SR problem, contrary to its performance in other classification papers. This may be due to the fact that ELU tend to distort scales of input values, which is undesirable in regression problems such as SR. Overall, the networks with MU and its variants show higher PSNR values with less parameters.
The proposed SR networks showed superior PSNR performance compared to the base networks using ReLU and other activation functions. The SR networks using MU tend to produce higher PSNR results with a smaller number of convolution filter parameters, which is desirable for computational platforms with limited resources. We showed that MU can be used in regression problems especially SR, and they have some potential with further extension to new types of activation functions for other applications.
Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI. vol. 16, pp. 265–283 (2016)
-  Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
-  Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012)
-  Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: An end-to-end system for single image haze removal. IEEE Trans. Image Proc. 25(11), 5187–5198 (2016)
-  Chang, J.R., Chen, Y.S.: Batch-normalized maxout network in network. arXiv preprint arXiv:1511.02583 (2015)
-  Choi, J.S., Kim, M.: Super-interpolation with edge-orientation-based mapping kernels for low complex 2 upscaling. IEEE Trans. Image Proc. 25(1), 469–483 (2016)
-  Choi, J.S., Kim, M.: A deep convolutional neural network with selection units for super-resolution. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. pp. 1150–1156 (2017)
-  Choi, J.S., Kim, M.: Single image super-resolution using global regression based on multiple local linear mappings. IEEE Trans. Image Proc. 26(3), 1300–1314 (2017)
-  Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
-  Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: European Conf. Comp. Vis. pp. 184–199 (2014)
-  Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
-  Freedman, G., Fattal, R.: Image and video upscaling from local self-examples. ACM Trans. on Graph. 30(2), 12 (2011)
-  Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEE Comp. Graph. and App. 22(2), 56–65 (2002)
-  Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: IEEE Int. Conf. Comp. Vis. pp. 349–356 (2009)
-  Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)
-  He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 1026–1034 (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conf. Comp. Vis. pp. 630–645 (2016)
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Int. Conf. Mach. Learn. pp. 448–456 (2015)
Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proc. ACM Int. Conf. Mul. pp. 675–678 (2014)
-  Jianchao, Y., Wright, J., Huang, T., Ma, Y.: Image super-resolution as sparse representation of raw image patches. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1–8 (2008)
-  Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1646–1654 (2016)
-  Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. IEEE Trans. Pattern Anal. Mach. Intell. 32(6), 1127–1133 (2010)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Adv. Neural Info. Proc. Sys. pp. 1097–1105 (2012)
-  Ledig, C., Theis, L., Huszár, F., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. vol. 2, p. 4 (2017)
-  Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. vol. 1, p. 3 (2017)
-  Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. Int. Conf. Mach. Learn. vol. 30, p. 3 (2013)
-  Martin, D., Fowlkes, C., et al.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. IEEE Int. Conf. Comp. Vis. vol. 2, pp. 416–423 (2001)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proc. Int. Conf. Mach. Learn. pp. 807–814 (2010)
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. Journal Comp. Vis. 115(3), 211–252 (2015)
-  Shi, W., Caballero, J., Huszár, F., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. pp. 1874–1883 (2016)
-  Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Mach. Learn. Research 15(1), 1929–1958 (2014)
-  Swietojanski, P., Li, J., Huang, J.T.: Investigation of maxout networks for speech recognition. In: IEEE Int. Conf. Acoust. Speech Signal Proc. pp. 7649–7653 (2014)
-  Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residual network. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. vol. 1 (2017)
-  Timofte, R., Agustsson, E., Van Gool, L., et al.: Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops. pp. 1110–1121 (2017)
-  Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored neighborhood regression for fast super-resolution. In: Proc. Asian Conf. Comp. Vis. pp. 111–126 (2014)
-  Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004)
-  Yang, C.Y., Huang, J.B., Yang, M.H.: Exploiting self-similarities for single frame super-resolution. In: Proc. Asian Conf. Comp. Vis. pp. 497–510 (2010)
-  Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. Image Proc. 19(11), 2861–2873 (2010)
-  Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: Int. Conf. Curves and Surfaces. pp. 711–730 (2010)
-  Zhang, K., Gao, X., Tao, D., Li, X.: Single image super-resolution with non-local means and steering kernel regression. IEEE Trans. Image Proc. 21(11), 4544–4556 (2012)
-  Zhang, K., Tao, D., Gao, X., Li, X., Xiong, Z.: Learning multiple linear mappings for efficient single image super-resolution. IEEE Trans. Image Proc. 24(3), 846–861 (2015)