Log In Sign Up

Single Image Internal Distribution Measurement Using Non-Local Variational Autoencoder

by   Yeahia Sarker, et al.

Deep learning-based super-resolution methods have shown great promise, especially for single image super-resolution (SISR) tasks. Despite the performance gain, these methods are limited due to their reliance on copious data for model training. In addition, supervised SISR solutions rely on local neighbourhood information focusing only on the feature learning processes for the reconstruction of low-dimensional images. Moreover, they fail to capitalize on global context due to their constrained receptive field. To combat these challenges, this paper proposes a novel image-specific solution, namely non-local variational autoencoder (), to reconstruct a high-resolution (HR) image from a single low-resolution (LR) image without the need for any prior training. To harvest maximum details for various receptive regions and high-quality synthetic images, is introduced as a self-supervised strategy that reconstructs high-resolution images using disentangled information from the non-local neighbourhood. Experimental results from seven benchmark datasets demonstrate the effectiveness of the model. Moreover, our proposed model outperforms a number of baseline and state-of-the-art methods as confirmed through extensive qualitative and quantitative evaluations.


page 1

page 4

page 6

page 7


Variational AutoEncoder for Reference based Image Super-Resolution

In this paper, we propose a novel reference based image super-resolution...

Real-World Single Image Super-Resolution: A Brief Review

Single image super-resolution (SISR), which aims to reconstruct a high-r...

Cross-Scale Internal Graph Neural Network for Image Super-Resolution

Non-local self-similarity in natural images has been well studied as an ...

Unsupervised Real Image Super-Resolution via Generative Variational AutoEncoder

Benefited from the deep learning, image Super-Resolution has been one of...

Efficient Non-Local Contrastive Attention for Image Super-Resolution

Non-Local Attention (NLA) brings significant improvement for Single Imag...

DCIL: Deep Contextual Internal Learning for Image Restoration and Image Retargeting

Recently, there is a vast interest in developing methods which are indep...

I Introduction

Image super-resolution (SR) refers to the task of recovering a latent high-resolution (HR) image from a corresponding low-resolution (LR) image. This has been one of the most widely explored inverse problems in computer vision

[7, 35]. A LR image is assumed to be modeled as the output of the following degradation:


where defines a degradation mapping function, is the corresponding HR image, and denotes the degradation parameter. The higher the degradation, the harder becomes the task to reconstruct the HR image [65]. The image super-resolution problem is also explored in the fields of remote sensing [30], surveillance imaging [49], and medical imaging [37]

. While a number of approaches have been attempted to solve this problem, it remains ill-posed particularly since any specific LR image may correspond to croppings from multiple HR counterparts. Three types of solutions are usually provided to solve the SR problem; interpolation-based

[22], learning-based [17, 68, 66] and reconstruction-based [53, 62, 12]

. Learning-based SR methods learn the non-linear mapping between HR and LR image using probabilistic generative models, random forest, linear or non-linear models, neighbor embedding

[5] and sparse regression [26]. Interpolation-based methods utilize the adjacent pixels to calculate the interpolated pixels by using an interpolation kernel. Several types of existing interpolation-based models have been used to tackle the SR problem such as bicubic[47]

, edge-directed estimation

[63], and auto-regressive models [36]. Interpolation-based methods are very computationally efficient and relatively simple than other architectures. However, these methods suffer from low accuracy compared to other methods because of poor representation learning capacity [52].

(a) HR Image
(b) LR Image
(c) DIP
Fig. 1: Visual Comparison of Deep Image Prior (DIP) (untrained) & NLVAE (untrained)

Learning-based solutions are widely used in the SR task. They can primarily be categorized into three types: Code-based[13, 70, 68], CNN-based [24, 15] and regression-based [71]. In the past, learning-based solutions have shown great success for image super-resolution due to a robust feature learning capability [16, 25]. The regression-based solutions are much faster than any other methods but, compared to other learning-based methods, produce blurry images and low peak-signal-noise ratio, due to poor representation learning [26, 66]. Learning-based solutions generally measure the similarity between LR & HR images. A number of methods have been proposed to solve this problem but among them, CNN-based methods have superior performance because of their robust representation learning capabilities [4]. As a learning-based method, a hierarchical pyramid structure was developed using residual layers for image super-resolution [29].

Zhang et al. [72] introduced residual dense block in order to learn hierarchical feature maps, utilizing a bottleneck layer at the end of the residual dense layer. The local connection in the same block represents short-term memory and skip connections define long-term memory for representation learning. Lim et al. [33] proposed Enhanced Deep Super-Resolution (EDSR) with a huge improvement in performance utilizing the residual structure. Despite the excellent performance, these methods utilize residual structure that is computationally resourceful. In [19]

, a combination of channel-wise attention and spatial attention block was developed for single image super-resolution (SISR). Both of these blocks were again combined with residual creating robust SR methods. This method captures sufficient representations from feature space and suppresses irrelevant information. As this approach combines two kinds of attention block with a stacked neural network, it consumes large amounts of memory and slows down the training process.

In [39], an autoencoder-based SISR method has been introduced with symmetric skip connections. Similarly, Tai et al. [55] featured a deep recursive residual network (DRRN) utilizing memory block with residual representation learning. These solutions provide better results with the help of large scale architecture. Park et al. [45] introduced a high dynamic range for super-resolution task that is very lightweight and easy to implement but decomposing the image causes loss of important information which results in low peak-signal noise ratio (PSNR) values.

Many upsampling strategies have been observed in the literature. Efficient sub-pixel CNN (ESPCN) uses sub-pixel convolution for upsampling [50] that stores channel information for extra points then reorganizes those points for HR reconstruction. Fast super-resolution CNN (FSRCNN) utilizes deconvolution operation to upsample [9]. Hua et al. used a deconvolution operation for the upsampling process, featuring the arbitrary interpolation operator and subsequent convolution operator [20]. It is to be noted that the deconvolution operation has two disadvantages: deconvolution is used at the end of the network, and the downsampling kernel is not known. Unknown input estimation consequently results in poor performance. To avoid these issues, we utilize linear upsampling of LR images so that we only focus on reconstruction quality rather than upsampling kernels.

From a theoretical perspective, it can be deduced that the deeper neural architecture provides better result than shallow architecture [44]. Keeping this in mind, Kim et al. [24]

first proposed a very deep architecture for SISR task. With 20 layers, VGGNet uses 3x kernels for all layers. Additionally, this method uses a high learning rate for faster convergence and utilizes gradient clipping to alleviate the gradient explosion problem. To learning short-term memory information, skip connections have been used in many tasks. Another work introduced recursive topology with parameter reduction using recursive convolution kernel

[55]. However, these settings are risky as self-supervised settings because a faster learning rate will provide shallow feature learning process, thus resulting in poor performance.

Several reconstruction-based SISR methods have been introduced to solve the SISR problem, utilizing a shallow feature learning process [32, 10]. KernelGAN [1]

, consisting of a deep linear generator and a discriminator, supports blind SISR. A deep linear generator removes non-linear activation functions, but the overall loss function is not convex. The discriminator uses fully convolutional layers with no strides and pooling. Even though the overall structure means that the model converges faster, it is still difficult to obtain the global minimum. Our method utilizes the non-linear activation function in both encoder and decoder, making network learn more intuitive information than KernelGAN. Shaham

et al. [48] proposed a multidisciplinary generative model capable of performing multiple computer vision tasks. To our knowledge, this work was the first attempt to use an unconditional generative model for the ZSSR task. It utilizes an adversarial network as a reconstruction-based method learning only abstract features from image patches. Due to the complexity of training an adversarial network, GAN models often suffer from convergence failure and mode collapse. Moreover, training adversarial training takes longer than discriminative models.

To alleviate these problems, we have devised an image-specific architecture, called probabilistic non-local variational neural autoencoder (NLVAE), which can generate high-quality images with a robust pixel learning capability. Our generative solution is specifically designed for ZSSR, storing more disentangled and intuitive features and learning from low-dimensional space.

Our specific contributions can be summarized as follows:

  • An unconventional internal method has been introduced for the ZSSR task where only one one LR image and its corresponding HR image are required for the training process. The proposed method is comnpletely unsupervised and does not require any prior training. It establishes a new state-of-the-art (SOTA) which outperforms currently available methods.

  • The proposed light-weight non-local feature extraction module harvests maximum representations from different receptive regions boosting the super-resolution performance.

  • The proposed loss function aids to reconstruct high quality images by controlling the Lagrange multiplier and marginal value.

The rest of the paper is organized as follows. Section II shows some works related to our proposed network structure. Section III describes the working principle of our method. In section IV, we provide quantitative and qualitative results using our model. Section V provides some ablation studies demonstrating the robustness of our network and section VI discusses the limitation of our strategy along with similarities and dissimilarities with other methods. Section VII provides concluding remarks.

Ii Related Work

Generative Models. Generative models have been proven to reconstruct finer texture details and are able to generate more photo-realistic images than CNN-based methods. While shallow CNN-based SR methods provide detailed low-frequency information, GAN-based methods as generative models can discover high-frequency information. Super-Resolution GAN (SRGAN) [31] makes use of perceptual loss as well as residual dense network generating high-resolution images. Wang et al. [64]

proposed residual-in-residual without batch normalization which produced HR images through adversarial training. Majdabad

et al. [38] attached a capsule network as a complex network with GAN for face super-resolution. In [46], a conditional GAN has been introduced using ground-truth as a conditional variable for the discriminator. Similar to conditional GAN, conditional autoregressive generative models utilize maximum likelihood estimation depending on conditions. Based on these conditions, the generated HR images are reconstructed based on previous learned pixels [60, 59]. However, these generative models suffer from mode collapse and convergence failure problems [14]. Moreover, these methods are computationally expensive and integration with self-supervised training is quite difficult to implement as GAN-based methods require more image data for training than learning-based methods [56, 64].

Non-Local Networks. Non-local networks usually comprise an attention module with non-local blocks. Wang et al. [61] proposed a deformable non-local attention module for video super-resolution. In [34]

, a non-local recurrent model was introduced for SISR task, which can learn deep feature correlation among neighbourhood locations of patches. Zhang

et al. [72] featured residual network with non-local attention units for image super-resolution. Another work presents a cross-scale non-local attention module for learning intrinsic feature correlations of images [43].

Fig. 2: Overview of non-local block used in our NLVAE network. The non-local block is composed of 3 kernels and 1 kernels. The initial convolution kernel is concatenated with the last feature transform to learn relative positional features. defines the spatial size of the feature and the channel information is denoted as

Zero-Shot Super-Resolution Methods. Shocher et al. [51]

introduced the term ”ZSSR,” presenting a shallow CNN model to learn the probability distribution of the LR and HR images. The major disadvantage of this network is that it extracts local features utilizing a simple CNN architecture and a shallow CNN model, which also results in poor performance. Another internal method proposed in

[58], introducing Deep Image Prior (DIP) to build a bridge between a CNN and convolutional sparse coding. The solution takes the neural network as the output of the reconstruction and random input signals. This is the first approach which creates a bridge between a code-based method and a learning-based method for ZSSR. Untrained DIP basically focuses on the smaller receptive fields for intuitive neural representation, but loses more context as feature extraction is limited to the smaller regions. Fig.  1 depicts that DIP method shows very weak structural information compared to NLVAE. Due to a weak feature extraction process, this method suffer from low accuracy in terms of performance metrics [2].

Fig. 3: Network structure of the proposed non-local variational autoencoder (NLVAE) model. Probabilistic encoder-deocder are composed of non-local units and various convolution and upsampling layers. The reconstruction quality is controlled by the operator

. Global avg pooling are used to calculate the mean and variance leveraging global structural details during the reconstruction process.

Iii Network Structure

In this section, we demonstrate the structure of our proposed non-local block in the neural encoder and decoder. We also show the measurement of posterior distribution and loss function and we provide a full analysis of how the Lagrange multiplier controls the reconstruction quality of generated image.

Iii-a Non-Local Encoder-decoder

As shown in Fig. 3, our proposed NLVAE model consists of an encoder and a decoder consisting of non-local convolution blocks. Fig. 2 depicts the overview of non-local blocks. The non-local block utilized in NLVAE can exploit spatial correlation between neighbourhood and locations. To design a computationally effective spectral correlation module, we have overlooked residual structure. Each non-local block is composed of convolution blocks and each convolution block comprises of 2D convolution, point-wise convolution and batch normalization [23]

followed by Leaky-ReLU activation function. The encoder encodes the input image

into the latent representation (

) and the decoder reconstructs the representation back to its approximate original data. We assume that the low-resolution image is an input vector denoted by


denotes the latent representation. The latent variables are controlled by a Gaussian distribution along with a diagonal covariance matrix. The latent space dimension is denoted by

. The output of the non-local convolutional encoder comprises a mean and a log of variances . Through the reparameterization trick, a noise vector is obtained from the latent space [28]. The goal of the NLVAE model is the ability to produce a high-resolution reconstructed image from the low-resolution image, exploiting the relationship between the input vector and prior distribution .

Iii-B Posterior Distribution

We denote the low-resolution input data distribution as and the high-resolution reconstructed data distribution as . The encoded and decoded data distributions are represented as and respectively, where and are the variables of the encoder and decoder networks. The tries to approximate the output prior . Centered isotropic multivariate Gaussian was chosen as the prior over the latent variables [11]. The inference model is designed to output two individual variables and , and thus the posterior . With this setting, to get the desired prior distribution, the non-local convolutional encoder and decoder are trained to optimize the reconstruction error (that is, the mean squared error). The loss function tries to approximate between each patch of LR and HR over a fake minibatch where is the size of the fake minibatch. The total number of data points is denoted as .

Input : Initialize network parameters while not converged do
       psuedo labels of single image Distribution from prior (Updating Adam for )
end while
Algorithm 1 Training NLVAE model

The input of the decoder is sampled from using a reparameterization trick where . The aggregated posterior distribution is defined as :


Iii-C Loss Function

It is useful to prepare the low-resolution input image by clustering in latent space, eradicating the noise loss is summed over the data points and the average of the fake minibatch is calculated. Thus, it provides more weight to the reconstruction error helping reduce potential model collapse.


To obtain the desired prior distribution,

divergence is utilized on the encoded variable to measure the probability distance of LR and HR images.

divergence is calculated over the fake minibatch as


Therefore, the total loss is calculated as


where denotes the Lagrangian multiplier and denotes the marginal value. As the negation of is the lower bound of the Lagrangian, minimization of the loss is equivalent to maximization of the Lagrangian, which is useful for our initial optimization problem. The controls the quality of image reconstruction as an aid to the objective function. For , the working principle is the same as traditional VAE. When , it applies a stronger constraint on the latent bottleneck and limits the representation capacity of [6]. Maintaining the disentanglement is the most effective representation for some of the conditionally independent generative factors [41].

Iii-D Lagrangian multiplier variation for different upscaling

The addition of in VAE provides more disentangled information and sharp gradients compared to traditional VAE [18]. A higher value of provides more efficient encoded latent vectors and further encourages disentanglement. However, too large may lead to poorer reconstruction quality as it creates a trade-off with the extent of disentanglement. The reconstruction loss ensures the network captures useful information while forming the latent distribution. An increase in the number of latent variables reduces image quality, thus through empirical evaluation, we selected a different Lagrangian multiplier for different upscaling. For our experimental settings, we have selected 150, 200, and 300 for 3 , 4 , and 8 upscaling factor respectively.

Iii-E Computational Efficiency of Non-local Block

In this subsection, the computational efficiency of the non-local block is briefly explained. The point-wise convolution is the core of non-local block for calibrating spatial information. It also serves as the channel reduction technique in this network. The weights of the point-wise convolution can be calculated as:


For this operation, . Then Equation 6 becomes:


And the corresponding number of operations is therefore:


For the standard convolution operation, the number of weights will be:


And the corresponding number of operation is:


Now, the reduction factors of weights and operations can be defined as:


From the reduction factors of weights and operation, we can observe the reduction in computational cost due to the use of point-wise convolutions.

Scale Method Set5 Set14 BSDS100 Urban100 Manga109


Bicubic 30.40 0.8684 27.55 0.7743 27.19 0.7388 24.45 0.7358 26.95 0.8558
A+ [57] 32.51 0.9080 29.10 0.8202 28.21 0.7829 25.86 0.7891 29.90 0.9101
SRCNN [8] 32.75 0.9090 29.30 0.8215 28.28 0.7832 25.87 0.7888 30.56 0.9124
FSRCNN [9] 33.17 0.9141 29.39 0.824 28.59 0.7940 26.43 0.8075 31.05 0.9189
VDSR [24] 33.67 0.9212 29.78 0.8318 28.83 0.7982 27.14 0.8280 32.07 0.9337
LapSRN [29] 33.82 0.9227 29.84 0.8322 28.82 0.7982 27.07 0.8270 32.21 0.9342
MemNet [54] 34.09 0.9248 30.00 0.8350 28.95 0.8001 27.53 0.8270 32.58 0.9382
SRGAN [31] 33.73 0.9102 29.58 0.8215 28.62 0.7790 26.04 0.8168 31.56 0.9187
NLVAE (Proposed) 34.10 0.9270 30.81 0.8398 29.05 0.7805 28.07 84.02 33.19 0.9437


Bicubic 28.43 0.8109 26.00 0.7026 25.95 0.6698 23.13 0.6598 24.89 0.7865
A+ [57] 30.25 0.8601 27.21 0.7503 26.65 0.7103 24.19 0.7198 27.08 0.8519
SRCNN [8] 30.48 0.8628 27.50 0.7513 26.90 0.7114 24.52 0.7221 27.60 0.8583
FSRCNN [9] 30.72 0.8658 27.60 0.7538 26.95 0.7138 24.62 0.7280 27.86 0.8602
VDSR [24] 31.35 0.8838 28.02 0.7682 27.29 0.7165 25.18 0.7530 28.87 0.8862
LapSRN [29] 31.54 0.8860 28.16 0.7724 27.32 0.7161 25.21 0.7558 29.09 0.8890
MemNet [54] 31.76 0.8893 28.26 0.7726 27.42 0.7280 25.50 0.7628 29.64 0.8938
SRGAN [31] 29.37 0.8471 26.01 0.7396 25.13 0.6645 24.35 0.7331 28.39 0.8603
ESRGAN [64] 30.47 0.8512 26.28 0.6987 25.32 0.6519 24.36 0.7337 28.44 0.8609
NLVAE (Proposed) 31.96 0.8903 28.67 0.7776 27.86 0.7367 25.88 0.7751 30.11 0.8945


Bicubic 24.42 0.6580 23.10 0.5660 23.65 0.5483 20.74 0.5160 21.55 0.6509
A+ [57] 25.21 0.6875 23.48 0.5889 23.97 0.5605 21.02 0.5403 22.11 0.6813
SRCNN [8] 25.33 0.6900 23.76 0.5910 24.13 0.5659 21.29 0.5438 22.40 0.6846
FSRCNN [9] 20.13 0.5520 19.75 0.4820 24.21 0.5672 21.32 0.5379 22.39 0.6730
VDSR [24] 25.95 0.7242 24.26 0.6140 24.37 0.5767 21.65 0.5704 23.16 0.7230
LapSRN [29] 26.14 0.7384 24.35 0.6200 24.53 0.5865 21.81 0.5805 23.39 0.7533
MemNet [54] 26.16 0.7414 24.38 0.6199 24.59 0.5843 21.88 0.5824 23.56 0.7386
SRGAN [31] 25.88 0.7069 24.02 0.6015 24.41 0.5786 21.68 0.5614 24.61 0.7864
ESRGAN [64] 26.30 0.7551 24.07 0.6011 24.64 0.5850 22.57 0.6279 24.75 0.7872
NLVAE (Proposed) 27.23 0.7860 25.32 0.6469 25.31 0.5983 22.97 0.6353 25.12 0.8013
TABLE I: Benchmark results for SISR methods. Best results are in bold. All the methods are trained on DIV2K datasets.

For the standard convolution operation, the number of weights will be


And the corresponding number of operation is,


Now, the reduction factors of weights and operations can be defined as


From the reduction factors of weights and operation, we can observe the reduction in computational cost due to the use of point-wise convolutions.

Iv Experimental Evaluations

Iv-a Datasets

We have evaluated our proposed NLVAE model against seven different datasets—Set5 [3], Set14 [69], BSD100 [40], Manga109 [42], Urban100 [21], General100 [9], and T19 [67]. In the qualitative analysis, we have used the General100, Set14, Set5, and T91 datasets, while the quantitative analyses are performed using the Set5, Set14, BSD100, Urban100, and Manga109 datasets. We have compared our model against a number of baseline and SOTA models, reporting PSNR and SSIM metrics.

Iv-B Implementation Details & Training Settings

We make use of the TensorFlow framework with Python for all the experiments. The experiments are implemented on a a Nvidia GeForce GTX Titan X and Intel Xeon CPU at 2.40 GHz machine. All the images are resized to

. The Adam optimizer [27] is utilized with = 0.9, = 0.999, and =

. Pesudo labels are created for training purposes as there exists only one single image. The model is trained till 2000 epochs. We use the

loss function for our solution. For our settings, the hyperparameters are selected empirically. We perform the experiments for three different scaling factors—

, , and . The value of is set to 500 for all the experiments.

As the proposed method utilizes self-training strategy, it takes both and single image as input for training pipeline. Then, it tries to learn the relationship leveraging non-local attention blocks. Finally, the self-training model tries to generate a single image from the pre-trained weights. It is to be noted that the performance metric calculation is the mean of all single generated images for each datasets.

Iv-C Results





Fig. 4: Visual comparison of reconstruction-based methods on ’tt17.png’ from T91 dataset and ’butterfly.png’ from Set5 dataset and ’img_087.png’ Urban100



Fig. 5: Visual comparison of learning-based methods on ’im_078.png’ from General100 dataset and ’pepper.png’ from Set14 dataset.
(a) L1 & L2 loss functions, and various optimizers with respect to epochs.
(b) Various feature learning units with respect to epochs.
Fig. 6: Ablation study: Evaluation of loss functions and feature learning blocks on the Set5 dataset

Iv-C1 Quantitative

Table I reports the quantitative results for (), (), () SR methods. Both learning-based (Bicubic, A+, Super-Resolution CNN (SRCNN), Fast SRCNN (FSRCNN), Very Deep Super-Resolutio (VDSR), Laplacian Pyramid Super-Resolution Network (LapSRN), Memory Network (MemNet) and reconstruction-based methods (SRGAN, ESRGAN) have been compared against the proposed framework. It is worth mentioning that deeper architectures perform better over the shallower networks. We also note that larger scaling factors affect the performance of the existing external methods. Among learning-based methods, MemNet demonstrates good performance over large scaling factors because of its large architecture, but its performance drops when the scaling factor is relatively small. The reconstruction-based strategies have higher structural similarity than other methods. Most importantly, our proposed NLVAE model outperforms other reconstruction-based methods in all the scaling factors, generating high-resolution photo-realistic images. This justifies the incorporation of the non-local convolutional block which enables the model to perform better, specifically, on smaller scaled images. Moreover, the deeper architecture of the generative models enhances the performance on large scaling factors, leading the network to a robust zero-shot super-resolution network.

Iv-C2 Qualitative

Fig. 5 and Fig. 4 depict the visualization of learning-based methods and reconstruction-based methods respectively. Samples from the Set14 and General100 datasets have been used in order to visualize the learning-based solutions. It is confirmed from Fig. 5 that our solution produces sharp edges and avoids any undesirable artifacts. As can be seen, a featured generative solution learns better representations between an LR image & its corresponding HR image. For a fair comparison with reconstruction-based solutions, we utilized Set5, Urban100, and BSDS100 for qualitative comparison among generative SISR models. The visual property of the reconstructed images is magnificent compared to other methods because of its global contextual feature learning process. Fig. 4 demonstrates that our method can reduce the blurring artifacts presenting a powerful feature learning ability. It is imperative that the NLVAE model provides better details in regions of irregular structures. More detailed visualizations containing random HR samples from the generated sets and real datasets are provided in the supplementary material.

V Ablation Study

V-a Loss function & Optimizers

In Fig. 6(a), we have explored different loss functions and optimizers to evaluate the performance of our proposed model. We observe that the combination + Adam converges to gradient more smoothly than any other solutions. Among all optimizers, Adam converges to gradient faster than others. SGD and RMSProp provide competitive results but developing slower solutions for a zero-shot process. Between and loss functions, provides faster training and finer reconstruction quality. All the hyperparamters are fixed for this ablation study.

V-B Feature Extraction Blocks

To verify the robustness of our non-local convolutional block, we explored various feature extraction units. Fig. 6(b) shows that non-local convolutional unit performs better than other feature learning units. We observe that the residual unit learns slightly better representations than other units but has relatively larger computational burdens. Comparing our non-local unit against other traditional convolution operations (including depth-wise separable convolution, transposed convolution, and standard convolution operations) our method shows excellent performance with the lowest MSE between LR & HR images.

V-C Feature Extraction Blocks

To verify the robustness of our non-local convolutional block, we explored various feature extraction units. Fig. 6(b) shows that non-local convolutional unit performs better than other feature learning units. We observe that the residual unit learns slightly better representations than other units but has relatively larger computational burdens. Comparing our non-local unit against other traditional convolution operations (including depth-wise separable convolution, transposed convolution, and standard convolution operations) our method shows excellent performance with the lowest MSE between LR & HR images.

V-D Non-Local Blocks

In this ablations, we study the essentially of non-local blocks for image generation. In table II, the PSNR and SSIM values are depicted against number of non-local blocks for our proposed method. We note that an increase in non-local blocks provide more accurate image but also increases computational resources. Moreover, we note unusual instability using 5 or more non-local blocks with ADAM optimizer.

Convolutional Encoder PSNR Convolutional Decoder PSNR
Non-Local Block - 1 Unstable Non-Local Block - 5 31.83
Non-Local Block - 2 27.45 Non-Local Block - 6 32.29
Non-Local Block - 3 30.12 Non-Local Block - 7 33.81
Non-Local Block - 4 33.27 Non-Local Block - 8 33.97
Non-Local Block - 5 34.10 Non-Local Block - 9 34.10
TABLE II: Number of Non-local blocks for convolutional encoder & decoder on Set5 dataset

Vi Discussions

In this subsection, we discuss the similarity, dissimilarities and limitations of our method compared to other data-driven strategies. Table III shows that the input image is linearly upscaled before processing for super-resolution. Similar to VDSR and DRCN, we also upscale the low-resolution image but linearly. The reconstruction process in our method is progressive as we combine both learning-based and reconstruction-based methods. Learning-based methods generally utilized direct reconstruction of HR images. We adopt the loss function for faster convergence in order to maintain high-reconstruction quality. As mentioned above, we do not use a residual representation learning process due to the computational cost for self-supervised settings. Our settings use small modifications of self-supervised settings. We do not use batches of images per epochs; instead we utilize fake batches of a single image for every epochs. Moreover, we perform all these experiments on different sizes of dataset to explore structural variation. Experiments have done on small datasets (Set5, Set14) as well as large datasets (Manga109, Urban100, BSD100) to justify the performance of our proposed solution.

SRCNN No LR Direct L2
FSRCNN No LR Direct L2
VDSR Yes LR + Bicubic Direct L2
DRCN Yes LR + Bicubic Direct L2
LapSRN Yes LR Progressive L1
NLVAE No LR + Linear Progressive L2
TABLE III: A comparison among various SISR methods defining the loss function, input types, reconstruction types and feature extraction modules.

Vii Conclusions

We have presented NLVAE, an untrained generative model, featuring a neural encoder-decoder framework capable of reconstructing high-resolution images. With the use of non-local convolutional modules, the model is enabled to capture high-quality semantic information. In addition, the beta variational autoencoder provides more disentangled information reconstructing high-resolution images. Combining the learning-based and reconstruction-based methods, the present method generates sharp and photo-realistic images. The effectiveness of the present model has been confirmed through an extensive experimentation compared with a number of SOTA methods, both qualitatively and quantitatively on multiple benchmark datasets. Moreover, leveraging the power of robust feature learning and generative modeling, the proposed model obviates the need for a large scale dataset while performing SISR. It is to be noted that our proposed method relies on linear upsampling before the super-resolution task. Our future work will include further validation of the NLVAE model against more challenging data settings across various domains as well as more powerful automatic upsampling strategy. We envision more extensive comprehension of our model and more intuitive design of the objective function .


  • [1] S. Bell-Kligler, A. Shocher, and M. Irani (2019) Blind super-resolution kernel estimation using an internal-gan. In Advances in Neural Information Processing Systems, pp. 284–293. Cited by: §I.
  • [2] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §II.
  • [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Cited by: §IV-A.
  • [4] A. Bhowmik, S. Shit, and C. S. Seelamantula (2017) Training-free, single-image super-resolution using a dynamic convolutional network. IEEE signal processing letters 25 (1), pp. 85–89. Cited by: §I.
  • [5] H. Chang, D. Yeung, and Y. Xiong (2004) Super-resolution through neighbor embedding. In

    Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004.

    Vol. 1, pp. I–I. Cited by: §I.
  • [6] R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §III-C.
  • [7] R. Dian, L. Fang, and S. Li (2017)

    Hyperspectral image super-resolution via non-local sparse tensor factorization

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5344–5353. Cited by: §I.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: TABLE I.
  • [9] C. Dong, C. C. Loy, and X. Tang (2016)

    Accelerating the super-resolution convolutional neural network

    In European conference on computer vision, pp. 391–407. Cited by: §I, TABLE I, §IV-A.
  • [10] X. Dou, C. Li, Q. Shi, and M. Liu (2020) Super-resolution for hyperspectral remote sensing images based on the 3d attention-srgan network. Remote Sensing 12 (7), pp. 1204. Cited by: §I.
  • [11] J. Durrieu, J. Thiran, and F. Kelly (2012)

    Lower and upper bounds for approximation of the kullback-leibler divergence between gaussian mixture models

    In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4833–4836. Cited by: §III-B.
  • [12] R. Fattal (2007) Image upsampling via imposed edge statistics. In ACM SIGGRAPH 2007 papers, pp. 95–es. Cited by: §I.
  • [13] W. T. Freeman, T. R. Jones, and E. C. Pasztor (2002) Example-based super-resolution. IEEE Computer graphics and Applications 22 (2), pp. 56–65. Cited by: §I.
  • [14] I. Goodfellow (2016)

    Nips 2016 tutorial: generative adversarial networks

    arXiv preprint arXiv:1701.00160. Cited by: §II.
  • [15] H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser (2018) CNN-based projected gradient descent for consistent ct image reconstruction. IEEE transactions on medical imaging 37 (6), pp. 1440–1453. Cited by: §I.
  • [16] V. K. Ha, J. Ren, X. Xu, S. Zhao, G. Xie, and V. M. Vargas (2018) Deep learning based single image super-resolution: a survey. In International Conference on Brain Inspired Cognitive Systems, pp. 106–119. Cited by: §I.
  • [17] L. He, H. Qi, and R. Zaretzki (2013) Beta process joint dictionary learning for coupled feature spaces with application to single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 345–352. Cited by: §I.
  • [18] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2016) Beta-vae: learning basic visual concepts with a constrained variational framework. Cited by: §III-D.
  • [19] Y. Hu, J. Li, Y. Huang, and X. Gao (2019) Channel-wise and spatial feature modulation network for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
  • [20] Z. Hua, H. Zhang, and J. Li (2019) Image super resolution using fractal coding and residual network. Complexity 2019. Cited by: §I.
  • [21] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5197–5206. Cited by: §IV-A.
  • [22] K. Hung and W. Siu (2011) Robust soft-decision interpolation using weighted least squares. IEEE Transactions on Image Processing 21 (3), pp. 1061–1069. Cited by: §I.
  • [23] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §III-A.
  • [24] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: §I, §I, TABLE I.
  • [25] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1637–1645. Cited by: §I.
  • [26] K. I. Kim and Y. Kwon (2010) Single-image super-resolution using sparse regression and natural image prior. IEEE transactions on pattern analysis and machine intelligence 32 (6), pp. 1127–1133. Cited by: §I, §I.
  • [27] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
  • [28] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §III-A.
  • [29] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §I, TABLE I.
  • [30] C. Lanaras, E. Baltsavias, and K. Schindler (2015) Hyperspectral super-resolution by coupled spectral unmixing. In Proceedings of the IEEE international conference on computer vision, pp. 3586–3594. Cited by: §I.
  • [31] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §II, TABLE I.
  • [32] S. Lian, H. Zhou, and Y. Sun (2019) Fg-srgan: a feature-guided super-resolution generative adversarial network for unpaired image super-resolution. In International Symposium on Neural Networks, pp. 151–161. Cited by: §I.
  • [33] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §I.
  • [34] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang (2018) Non-local recurrent network for image restoration. In Advances in Neural Information Processing Systems, pp. 1673–1682. Cited by: §II.
  • [35] H. Liu, S. Li, and H. Yin (2013) Infrared surveillance image super resolution via group sparse representation. Optics Communications 289, pp. 45–52. Cited by: §I.
  • [36] M. Lu, L. Huang, and Y. Xia (2017)

    Two dimensional autoregressive modeling-based interpolation algorithms for image super-resolution: a comparison study

    In 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pp. 1–6. Cited by: §I.
  • [37] D. Mahapatra, B. Bozorgtabar, and R. Garnavi (2019) Image super-resolution using progressive generative adversarial networks for medical image analysis. Computerized Medical Imaging and Graphics 71, pp. 30–39. Cited by: §I.
  • [38] M. M. Majdabadi and S. Ko (2020) Capsule gan for robust face super resolution. Multimedia Tools and Applications 79 (41), pp. 31205–31218. Cited by: §II.
  • [39] X. Mao, C. Shen, and Y. Yang (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2810–2818. Cited by: §I.
  • [40] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 416–423. Cited by: §IV-A.
  • [41] E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh (2019) Disentangling disentanglement in variational autoencoders. In

    International Conference on Machine Learning

    pp. 4402–4412. Cited by: §III-C.
  • [42] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76 (20), pp. 21811–21838. Cited by: §IV-A.
  • [43] Y. Mei, Y. Fan, Y. Zhou, L. Huang, T. S. Huang, and H. Shi (2020) Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5690–5699. Cited by: §II.
  • [44] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio (2014) On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932. Cited by: §I.
  • [45] J. S. Park, J. W. Soh, and N. I. Cho (2018) High dynamic range and super-resolution imaging from a single image. IEEE Access 6, pp. 10966–10978. Cited by: §I.
  • [46] J. Qiao, H. Song, K. Zhang, X. Zhang, and Q. Liu (2019) Image super-resolution using conditional generative adversarial network. IET Image Processing 13 (14), pp. 2673–2679. Cited by: §II.
  • [47] W. Ruangsang and S. Aramvith (2017) Efficient super-resolution algorithm using overlapping bicubic interpolation. In 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE), pp. 1–2. Cited by: §I.
  • [48] T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: §I.
  • [49] P. Shamsolmoali, M. Zareapoor, D. K. Jain, V. K. Jain, and J. Yang (2019) Deep convolution network for surveillance records super-resolution. Multimedia Tools and Applications 78 (17), pp. 23815–23829. Cited by: §I.
  • [50] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §I.
  • [51] A. Shocher, N. Cohen, and M. Irani (2018) “Zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3118–3126. Cited by: §II.
  • [52] A. Shukla, S. Merugu, and K. Jain (2020) A technical review on image super-resolution techniques. Advances in Cybernetics, Cognition, and Machine Learning for Communication Technologies, pp. 543–565. Cited by: §I.
  • [53] J. Sun, J. Zhu, and M. F. Tappen (2010) Context-constrained hallucination for image super-resolution. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 231–238. Cited by: §I.
  • [54] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) Memnet: a persistent memory network for image restoration. In Proceedings of the IEEE international conference on computer vision, pp. 4539–4547. Cited by: TABLE I.
  • [55] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3147–3155. Cited by: §I, §I.
  • [56] N. Takano and G. Alaghband (2019) Srgan: training dataset matters. arXiv preprint arXiv:1903.09922. Cited by: §II.
  • [57] R. Timofte, V. De Smet, and L. Van Gool (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In Asian conference on computer vision, pp. 111–126. Cited by: TABLE I.
  • [58] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §II.
  • [59] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29, pp. 4790–4798. Cited by: §II.
  • [60] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel recurrent neural networks

    In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pp. 1747–1756. Cited by: §II.
  • [61] H. Wang, D. Su, C. Liu, L. Jin, X. Sun, and X. Peng (2019) Deformable non-local network for video super-resolution. IEEE Access 7, pp. 177734–177744. Cited by: §II.
  • [62] L. Wang, H. Wu, and C. Pan (2014) Fast image upsampling via the displacement field. IEEE Transactions on Image Processing 23 (12), pp. 5123–5135. Cited by: §I.
  • [63] L. Wang, S. Xiang, G. Meng, H. Wu, and C. Pan (2013) Edge-directed single-image super-resolution via adaptive gradient magnitude self-interpolation. IEEE Transactions on Circuits and Systems for Video Technology 23 (8), pp. 1289–1299. Cited by: §I.
  • [64] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §II, TABLE I.
  • [65] Z. Wang, J. Chen, and S. C. Hoi (2020) Deep learning for image super-resolution: a survey. IEEE transactions on pattern analysis and machine intelligence. Cited by: §I.
  • [66] J. Yang, Z. Lin, and S. Cohen (2013) Fast image super-resolution based on in-place example regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1059–1066. Cited by: §I, §I.
  • [67] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010) Image super-resolution via sparse representation. IEEE transactions on image processing 19 (11), pp. 2861–2873. Cited by: §IV-A.
  • [68] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao (2015) Coupled deep autoencoder for single image super-resolution. IEEE transactions on cybernetics 47 (1), pp. 27–37. Cited by: §I, §I.
  • [69] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In International conference on curves and surfaces, pp. 711–730. Cited by: §IV-A.
  • [70] K. Zhang, X. Gao, X. Li, and D. Tao (2010) Partially supervised neighbor embedding for example-based image super-resolution. IEEE Journal of Selected Topics in Signal Processing 5 (2), pp. 230–239. Cited by: §I.
  • [71] K. Zhang, X. Gao, D. Tao, and X. Li (2012) Single image super-resolution with non-local means and steering kernel regression. IEEE Transactions on Image Processing 21 (11), pp. 4544–4556. Cited by: §I.
  • [72] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu (2019) Residual non-local attention networks for image restoration. In International Conference on Learning Representations, External Links: Link Cited by: §I, §II.