Given its importance, super-resolution has attracted ample attention in the image processing and computer vision community. Early approaches to super-resolution are often based upon the rationale that higher-resolution images have a frequency domain representation whose higher-order components are greater than their lower-resolution analogues. Thus, methods such as that in  exploited the shift and aliasing properties of the Fourier transform to recover a super-resolved image. Kim et al.  extended the method in  to settings where noise and spatial blurring are present in the input image. In a related development, in , super-resolution in the frequency domain is effected using Tikhonov regularization.
Alternative approaches, however, effect super-resolution by aggregating multiple frames with complementary spatial information or by relating the higher-resolved image to the lower resolution one by a sparse linear system. For instance, Baker and Kanade  formulated the problem in a regularization setting where the examples are constructed using a pyramid approach. Protter et al. 
used block matching to estimate a motion model and use exemplars to recover super-resolved videos. Yanget al.  used sparse coding to perform super-resolution by learning a dictionary that can then be used to produce the output image, by linearly combining learned exemplars.
Note that, the idea of super-solution “by example” can be viewed as hinging on the idea of learning functions so as to map a lower-resolution image to a higher-resolved one using exemplar pairs. This is right at the center of the philosophy driving deep convolutional networks, where the net is often considered to learn a non-linear mapping between the input and the output. In fact, Dong et al. present in  a deep convolutional network for single-image super-resolution which is equivalent to the sparse coding approach in . In a similar development, Kim et al.  present a deep convolutional network inspired by VGG-net . The network in  is comprised of 20 layers so as to exploit the image context across image regions. In , a multi-frame deep network for video super-resolution is presented. The network employs motion compensated frames as input and single-image pre-training.
Here, we present a computationally efficient frequency domain deep network for super-resolution which, at input, takes the frequency domain representation of the lower resolution image and, at output, recovers the residual in the frequency domain. This residual can then be added to the lower resolution input image to obtain the super-resolved image. The network presented here is somewhat related to those above, but there are a number of important differences with respect to other approaches. Its important to note that:
Following our frequency domain interpretation of deep networks, the convolutions in other networks are treated here as multiplications. This has the well known advantages of lower computational cost and added computational efficiency.
Following the frequency domain treatment to the problem as presented here, the non-linearity in the network is given by convolutions in the frequency domain. This contrasts with the work in , which employs spectral pooling instead.
10], we can learn the analogue convolutional parameters in the frequency domain in a manner akin to that used by CNNs in the spatial domain.
We use residual training since its not only known to deal with the vanishing gradients well and often improve convergence, but its also particularly well suited to our net. This is following the notion that the lower resolved image lacks the higher frequencies in high-resolution imagery and, thus, these can be learned by the network based on the residual.
Finally, we employ the Hartley transform as an alternative to the Fourier transform so as to avoid the need to process imaginary numbers.
3 Spatial and Frequency Domains
3.1 Convolutional Neural Networks
To better understand the relationship between these networks in the spatial domain and ours, which operates in the frequency domain, recall that the two-dimensional discrete convolutions at the layer can be expressed as
where denotes the feature map delivered by the previous layer, i.e. that indexed , in the network and is the convolutional kernel of order in the corresponding layer. In the equation above, we have used and as the spatial coordinates (rows and columns) in the image under consideration.
At each layer, this convolutional stage is followed by an activation function which induces the non-linearity in the behavior of the network. Nonetheless there are a number of choices of activation function, i.e. sigmoid, binary, identity, etc., the most widely used one is ReLU, which can be expressed mathematically using a product with a Heaviside function as follows
where is the Heaviside function which yields if , if and if . In the expressions above, and so as to be consistent with Equation 1, we have used as the input to the rectifier unit.
3.2 Fourier Transform
Equations 1 and 3.1 hint at the use of the convolution theorem of the Fourier transform to obtain an analogue of spatial domain CNNs in the frequency domain. Further, the convolution theorem as widely known for the Fourier transform has similar instantiations for other closely related integral transforms. For now, and for the sake of clarity, we will focus on the Fourier transform. Later on in the paper, we will elaborate further on the use of the Hartley transform as an alternative to the Fourier transform in our implementation.
The convolution theorem states that given two functions in spatial (time) domain, their convolution is given by their point-wise multiplication in the frequency domain. For the sake of consistency with Equations 1 and 3.1, let and be the two spatial domain functions under consideration and denote the Fourier transform operator as , then we have
where denotes point-wise product.
Moreover, the converse relation also holds, whereby a product in the spatial domain becomes a convolution in the frequency domain. i.e.
where we have employed to denote a generic activation function which, in our case can also be learnt. Note that the second line in the equation above follows from substituting Equation 3 into the first line.
4 Network Structure
Note that, in the Equation 3.2, the term acts as a “weighting” operation in the frequency domain. That is, through the point-wise product operation, it curtails or amplifies the frequency domain components of . These frequency weighting operations take place of the original convolution operation in spatial domain convolutional neural networks such as ImageNet . Similarly, the nonlinear operator given by the rectifier units in CNNs is now substituted, in the frequency domain, by a convolution. This can be viewed as a smoothing or regularization operation in the frequency domain.
In Figure 1, we show a single-layer instantiation of our frequency domain network. Note that, at input, the image is transformed into the frequency domain. Once this has been effected, the frequency weighting step takes place, i.e. the pointwise multiplication operation, and then a convolution is applied. Once the outputs of all the convolutional outputs are obtained, they are summed together by an additive layer and the inverse of the frequency domain transform is applied so as to obtain the final result. Its worth noting that we have not incorporated a spectral or spatial pooling layer in our network. This is not a major issue in our approach since these pooling layers are often used for classification tasks  whereas in other applications, such as super-resolution, pooling is seldom used .
As mentioned above, the product can be viewed as a frequency weighting operation equivalent to the convolution operation in time domain. As before, consider a feature map at a given layer in the network and the convolutional kernel .
For the layer under consideration, the product will take the frequency domain of the feature map as an input and point-wise multiply it by a wight matrix given by the values of . In practice, both, and
can be viewed as matrices which are the same size. This is important since it permits us to pose the problem of performing the forward pass and backpropagation steps in a manner analogous to that used in CNNs operating in the time domain.
To see this more clearly, denote as the input matrix corresponding to to the th layer of our frequency domain network. Similarly, let the weight matrix corresponding to the coefficients of be . The output of the product of the two matrices is another matrix, which we denote and whose entries indexed are given given by
where and are the entries indexed of the matrices and , respectively, and we have introduced the bias matrix with entries .
Moreover, the Fourier transform of an image, being real and non-negative, is conjugate-symmetric222It can be shown in a straightforward manner that this symmetry property also applies to the Hartley transform.. This is important since, by noting that should be Hermitian, we can reduce the number of learnt weights by half.
As shown in Figure 1, once the weighting operation is effected, a convolution in the frequency domain, analogous to the rectification operation in the spatial domain is applied. This is inspired upon Equation 3.2, which hints at the notion that we can express the ReLU as a product between a Heaviside function and its argument. Again, in practice, this can be expressed as follows
Where is the corresponding entry of the matrix as presented in the previous section and are the coefficients of the matrix containing the values of .
From the equation above is straightforward to note that the entries of the matrix are a linear combination of the values of where the matrix can be viewed as a kernel that can be learnt. Thus, here, we consider the entries of as parameters that can be updated at each back-propagation step. This, in turn, allows us to learn both, the weights in as well as the parameters in .
4.3 Additive Layer
Recall that, in applications such as super-resolution, frequency domain approaches aim at recovering or predicting a whole frequency map corresponding to either the enhanced or super-resolved image. As a result, instead of a prediction layer, our network adds the output of all the network feature maps into a single one at output and then applies a final frequency weighting operation. This additive layer can be expressed as follows
where is the number of layers in the network, denotes the Hadamard (entrywise) product, is the final frequency weighting matrix, is the prediction of our network in the frequency domain and is the weight that controls the contribution of the layer feature map to the output.
In the equation above, is given by summing over all the matrices for each layer . In our network, we do not require this sum to have independent weights since these can be absorbed, in a straightforward manner, into the matrices . Thus, for this layer, in practice, we only learn the matrix .
In Figure 2 we show the simplified diagram of our complete network. Note that the output of each layer is a frequency feature map which is the same size as the input. The output of each layer is then added and weighted by the additive layer. We then compute the spatial domain residual by applying the inverse transform to the frequency domain output of our network and add the predicted residual to the input image so as to compute the super-resolved output.
5 Implementation and Discussion
5.1 Hartley vs Fourier Transform
As mentioned earlier, the implementation of our network makes use of the Hartley transform  as an alternative to the Fourier transform. The reasons for this are twofold. Firstly, the Hartley transform is closely related to the Fourier transform and, hence, shares analogue properties. Secondly, the Hartley transform takes at input real numbers and delivers, at output, real numbers. This has the advantage that, by using the Hartley transform, we do not need to cater for complex numbers in our implementation.
Here, we have used the fast Hartley transform introduced by Bracewell . The Hartley transform can be expressed using the real and imaginary parts of the Fourier transform as follows
where we have used for the sake of consistency with previous sections. Moreover, the Hartley transform is an involution, that is, the inverse is given by itself, i.e. .
It is worth noting that, from Equation 8, its straightforward to show that, since the Fourier transform is linear as are the matrices , the weighting operation in our network applies, in a straightforward manner to the Hartley transform without any loss of generality. In the case of the Hartley transform, the convolution theorem has the same form as that of the Fourier transform , and, hence, we can write Equation 3.2 as follows
For training our network, we have used the mini-batch gradient descent method proposed by LeCun et al. . When training our network, the layers at the end, i.e. those closer to the final additive layer, tend to have a gradient that is small in magnitude as compared to the first layers. The reason being that, due to the architecture of our network, as shown in Figure 2, the first layers (those closer to the input) will accumulate the gradient contributions of the deeper layers in the network. This is a problem akin to that in , which was tackled by the VDSR 
approach by applying gradient clipping in the back propagation step. As a result, we follow  and normalize the gradient at each layer to a fixed range . Thus, the parameters at the layer can only change within a fixed range . In our implementation, we set and for all layers.
5.3 Residual and Choice of Loss Function
For our loss function, we consider a number of alternatives. These are the L1 (), L2 () and L2 with exponential decay () loss functions. These are defined as follows
where denotes the -norm under consideration, is a hyper-parameter, is the matrix corresponding to the ground-truth high resolution image in the frequency domain and accounts for the frequency domain image recovered by our network. This is yielded by the sum of the prediction of our network as given in Equation 7 and the input to our network in the frequency domain, i.e. , which can be expressed as
It is worth mentioning in passing that, in accordance with the observations made in , we also find that the use of residual learning improved the accuracy and converge speed of our network in the frequency domain. In Figure 3, we show a comparison of the three loss functions under consideration. In the figure, we show the loss values, normalized to unity, as a function of the low-resolution image scale factor of with respect to . In the figure, we have set . Note that the L2 loss is almost linear with respect to the scale factor. Moreover, in our experiments, we found that the L2 loss performed the best with respect to both, convergence and speed. As a result, all the experiments shown hereafter employ the L2 loss.
Recall that DRRN and VSDN use 200 images in Berkeley Segmentation Dataset combined with 91 images from Yang et al. . This set of images has also been used for training in other approaches [17, 25]. These methods often use techniques such as data augmentation, i.e. application of transformations such as rotation, scaling and flipping transformations to generate novel images so as to complement the ones in the dataset. It is important to note, however, that these rotation, scaling and flipping in the spatial domain become frequency shift and scaling operations. Moreover, the dataset above, comprised of 291 images and their augmentation is, in practice, too small to allow for cross-validation of parameters.
Thus, here we have opted for a two-stage process using 5000 randomly selected images from the Pascal VOC2012 dataset  for training and Set5 , Set14  and B100  for testing. The first stage corresponds to a cross-validation process of the parameters used in our network employing 800 images out of the 5000 in our dataset for training and Set14 for testing. Also, for cross-validation, we have resized the image to and used a scale factor of 2 on the dataset so as to obtain the images that are used as input to our network. After cross-validation, and once the parameters have been selected, the second stage is to proceed to train and test our network on the whole dataset and the three testing sets. In all our experiments, we have taken the color imagery and performed super-resolution on the Y channel in the YCbCr space .
6.2 Parameter Selection
We have selected, through cross-validation, the number of layers in the network, the size of the matrices and the number of weighting matrices per layer. In all our experiments we have used squared matrices ,i.e. , and chosen a base line network with , , . We have then progressively increased , and so as to explore the trade-off between timing and performance.
In Figure 4, we show the PSNR as a function of timing for the combinations of , and used in our cross-validation exercise. For the sake of clarity, the time axis is shown in a logarithmic scale. Note that, in general, the networks with or layers seem to deliver the best trade-off between performance and timing. In the figure, , and performs the best while , and also performs well. Thus, and bearing in mind that a deeper network is expected to perform better in a larger dataset, for all the experiments shown here onwards, we set , and .
6.3 Network Behavior
Now, we turn our attention to the behavior of the network in terms of the weighting and smoothing matrices. In Figure 5, we show the feature maps in the frequency domain as yielded by each of the network layers, i.e. the matrices . From the figure, we can appreciate that the feature map for the first layer mainly contains the low frequency information corresponding to the input image. As the layers go deeper, the feature maps become dominated by the high frequency components. This is expected since the aim of prediction in our net is given by the residual, which, in frequency domain, is mainly comprised by higher order frequencies.
In Figure 6, we show all the smoothing matrices in our network after the training has been completed. Surprisingly, note that matrices in each layer all appear to behave slightly different. In the first layer, the matrices are very much a delta in the frequency domain, while as the layer index increases, they develop non-null entries mainly along the central column and row. This follows the intuition that, for the first layer, the convolution would behave as a multiplicative identity removing lower-order frequency components, whereas, for further layers, the main contribution to its output is given by the central rows and columns.
Finally, we present our results on image super-resolution. To this end, we first show some qualitative results and then provide a quantitative evaluation of our network performance and testing timing.
In Figure 7, we show a detail of the results yielded by our network and a number of alternatives on the “Barbara”, “Lena” images and a sample image from the B100 dataset. In all cases, we have applied an upscale factor of 3 and show, on the left-hand panel the full ground truth image. Note that our method yields results that are quite comparable to the alternatives. Moreover, for the detail of the “Barbara” image, the aliasing does not have a detrimental effect on the results. This contrasts with LapSRN, where the scarf stripes are over enhanced.
In Table 1 we show the performance of our network as compared to the alternatives. Here, we have used the average per-image peak signal-to-noise ratio (PSNR)  and the structural similarity index (SSIM) . We have chosen these two image quality metrics due to a couple of reasons. Firstly, these have been used extensively for the evaluation of super-resolution results elsewhere in the literature. Secondly, the PSNR is a signal processing approach based upon the mean-squared error whereas the SSIM is a structural similarity measure. From the table, we can observe that despite LapSRN is the best performer, our method is often no more than 2 decibels below LapSRN in terms of the PSNR and within a 0.05 difference in the SSIM.
Further, in Figure 8, we show the average per-image testing time, in milliseconds, for the three test datasets under consideration and three upscale factors, i.e. , and . For all testing datasets our network far outperforms the alternatives, being approximately an order of magnitude faster than LapSRN and more than two orders of magnitude faster than SRCNN.
In this paper, we have presented a computationally efficient frequency domain neural network for super-resolution. The network can be viewed as a frequency domain analogue of spatial domain CNNs. To our knowledge, this is the first network of its kind, where rectifier units in the spatial domain are substituted by convolutions in the frequency domain and vice versa. Moreover, the network is quite general in nature and well suited for other applications in computer vision and image processing which are traditionally tackled in the frequency domain. We have presented results an comparison with alternatives elsewhere in the literature. In our experiments, our network is up to more than two orders of magnitude faster than the alternatives with an imperceptible loss of performance.
-  S. Baker and T. Kanade. Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1167–1183, 2002.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
-  T. Bishop, S. Zanetti, and P. Favaro. Light field superresolution. In IEEE International Conference on Computational Photography, 2009.
-  N. K. Bose, H. C. Kim, and H. M. Valenzuela. Recursive implementation of total least squares algorithm for image reconstruction from noisy, undersampled multiframes. In IEEE Conference on Acoustics, Speech and Signal Processing, volume 5, pages 269–272, 1993.
-  R. N. Bracewell. The fast hartley transform. Proceedings of the IEEE, 72(9):1010–1018, 1984.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
-  P. E. Eren, M. I. Sezan, and A. M. Tekalp. Robust, object-based high resolution image reconstruction from low-resolution video. IEEE Transactions on Image Processing, 6(10):1446–1451, 1997.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
-  S. Farsiu, D. Robinson, M. Elad, and P. Milanfar. Fast and robust multi-frame super-resolution. IEEE Transactions on Image Processing, 13:1327–1344, 2003.
X. Glorot, A. Bordes, and Y. Bengio.
Deep sparse rectifier neural networks.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
-  R. Hartley. A more symmetrical fourier analysis applied to transmission problems. 30:144 – 150, 04 1942.
J.-B. Huang, A. Singh, and N. Ahuja.
Single image super-resolution from transformed self-exemplars.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.
-  A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
-  S. P. Kim, N. K. Bose, and H. M. Valenzuela. Recursive reconstruction of high resolution image from noisy undersampled multiframes. IEEE Transactions on Acoustics, Speech and Signal Processing, 38(6):1013–1027, 1990.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
R. Pascanu, T. Mikolov, and Y. Bengio.
On the difficulty of training recurrent neural networks.In
International Conference on Machine Learning, pages 1310–1318, 2013.
-  C. Poynton. Digital video and HD: Algorithms and Interfaces. Elsevier, 2012.
-  M. Protter and M. Elad. Super resolution with probabilistic motion estimation. IEEE Transactions on Image Processing, 18(8):1899–1904, 2009.
-  O. Rippel, J. Snoek, and R. P. Adams. Spectral representations for convolutional neural networks. In Advances in Neural Information Processing Systems, pages 2449–2457, 2015.
-  D. Salomon. Data compression: the complete reference. Springer Science & Business Media, 2004.
-  S. Schulter, C. Leistner, and H. Bischof. Fast and accurate image upscaling with super-resolution forests. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3791–3799, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
-  Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network.
-  R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision, pages 111–126. Springer, 2014.
-  R. Y. Tsai and T. S. Huang. Multipleframe image restoration and registration. In Advances in Computer Vision and Image Processing, pages 317–339, 1984.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
-  R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In International conference on curves and surfaces, pages 711–730. Springer, 2010.