Joint Dictionary Learning for Example-based Image Super-resolution

01/12/2017 ∙ by Mojtaba Sahraee-Ardakan, et al. ∙ 0

In this paper, we propose a new joint dictionary learning method for example-based image super-resolution (SR), using sparse representation. The low-resolution (LR) dictionary is trained from a set of LR sample image patches. Using the sparse representation coefficients of these LR patches over the LR dictionary, the high-resolution (HR) dictionary is trained by minimizing the reconstruction error of HR sample patches. The error criterion used here is the mean square error. In this way we guarantee that the HR patches have the same sparse representation over HR dictionary as the LR patches over the LR dictionary, and at the same time, these sparse representations can well reconstruct the HR patches. Simulation results show the effectiveness of our method compared to the state-of-art SR algorithms.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Super-resolution is the problem of reconstructing a high resolution111In this article by resolution we mean spatial resolution. image from one or several low resolution images [1]. It has many potential applications like enhancing the image quality of low-cost imaging sensors (e.g., cell phone cameras) and increasing the resolution of standard definition (SD) movies to display them on high definition (HD) TVs, to name a few.

Prior to SR methods, the usual way to increase resolution of images was to use simple interpolation-based methods such as bilinear, bicubic and more recently the resampling method described in

[2] among many others. However all these methods suffer from blurring high-frequency details of the image especially for large upscaling factors (the amount by which the resolution of image is increased in each dimension). Thus, over the last few years, a large number of SR algorithms have been proposed [3]

. These methods can be classified into two categories: multi-image SR, and single-image SR.

Since the seminal work by Tsai and Huang [4] in 1984, many multi-image SR techniques were proposed [5, 6, 7, 8]. In the conventional SR problem, multiple images of the same scene with subpixel motion are required to generate the HR image. However the performance of these SR methods are only acceptable for small upscaling factors (usually smaller than 2). As the upscaling factor increases, the SR problem becomes severely ill-conditioned and a large number of LR images are needed to recover the HR image with acceptable quality.

To address this problem, example-based SR techniques were developed which require only a single LR image as input [9]. In these methods, an external training database is used to learn the correspondence between manifolds of LR and HR image patches. In some approaches, instead of using an external database, the patches extracted from the LR image itself across different resolutions are used [10]. In [9] Freeman et al. used a Markov network model for super-resolution. Inspired by the ideas in locally linear embedding (LLE) [11], the authors of [12]

used the similarity between manifolds of HR patches and LR patches to estimate HR image patches. Motivated by results of compressive sensing

[13], Yang et al. in [14] and [15] used sparse representation for SR. In [16] they introduced coupled dictionary training in which the sparse representation of LR image patches better reconstructs the HR patches.

Recently, joint and coupled learning methods are utilized for efficient modeling of correlated sparsity structures [17, 15]. However joint learning methods and the coupled learning methods proposed in [14, 15, 18] still does not guarantee that the sparse representation of HR image patches over the HR dictionary is the same as the sparse representation of LR patches over LR dictionary. To address this problem, in this paper we propose a direct way to train the dictionaries that enforces the same sparse representation for LR and HR patches. Moreover since the HR dictionary is trained by minimizing the final error in reconstruction of HR patches, the reconstruction error in our method is smaller.

The rest of this paper is organized as follows. In section 2, Yang’s method for super-resolution via sparse representation is reviewed. In section 3, a flaw in Yang’s method is discussed, and our method to solve this problem is presented. Finally, section 4 is devoted to simulation results.

2 review of super-resolution via sparse representation

In SR via sparse representation we are given two sets of training data: a set of LR image patches, and a set of corresponding HR image patches. In other words, in the training data we have pairs of LR and HR image patches. The goal of SR is to use this database to increase the resolution of a given LR image.


be the set of LR patches (each patch is arranged into a column vector

) and be the set of corresponding HR patches. In SR using sparse representation, the problem is to train two dictionaries and for the set of LR patches (or a feature of these patches) and HR patches respectively, such that for any LR patch , its sparse representation over , reconstructs the corresponding HR patch using :  [15]. Towards this end, first the dictionary learning problem is briefly reviewed in section  2.1. Then the dictionary learning method for SR proposed in [15] is studied in section  2.2. Finally in section  2.3, it is shown how these trained dictionaries can be used to perform SR on a LR image.

2.1 Dictionary learning

Given a set of signals , dictionary learning is the problem of finding a wide matrix over which the signals have sparse representation  [19]. This problem is highly related to subspace identification [20]. However, sparsity helps us to turn the subspace recovery to a well-defined problem. This approach has attracted lot of attentions in the last decade and found diverse applications [21, 22, 23]. If we denote the sparse representation of over by , the dictionary learning problem can be formulated as


in which the is the -norm which is the number of nonzero components of a vector and is a small constant which determines the maximum tolerable error in sparse representations. Replacing the -norm by -norm, Yang et al. in [15] used the following formulation for sparse coding instead of (1)


By defining and , it can be rewritten in matrix form as


in which stands for the Frobenius norm. (2) and (1) are not equivalent, but closely related. (2) can be interpreted as minimizing the representation error of signals over the dictionary, while forcing these representations to be sparse by adding a -regularization to the error. Therefore can be used as a parameter that balances the sparsity and the error; a larger results in sparser representations with larger errors.

2.2 Dictionary learning for SR

Given the sets of LR and HR training patches, and , by defining , and having (3) in mind, Yang et al. in [15] proposed the following joint dictionary learning to ensure that the sparse representation of LR patches over is the same as sparse representation of HR image patches over :


The key point here is that they have used the same matrix for sparse representation of both LR and HR patches to make sure that their representation is the same over the dictionaries and . If we define the concatenated space of HR and LR patches:

then joint dictionary training (4) can also be written equivalently as


This formulation is clearly the same as (3). In other words, in the concatenated space, joint dictionary learning is the same as conventional dictionary learning, and any dictionary learning algorithm can be used for joint dictionary learning.

2.3 Super-Resolution

After training the two dictionaries and , the input LR image can be super-resolved using the following steps:

  1. The input LR image is divided into a set of overlapping LR patches: .

  2. From each image patch , subtract its mean, ,

    and find its sparse representation over

  3. Using the sparse representation of each LR patch and its mean, the corresponding HR patch is estimated by

  4. Combining the estimated HR image patches, the output HR image is generated.

3 our proposed method

Our method for SR is to improve the dictionary learning part of Yang’s method, described in section 2.2. Having the dictionaries trained, the rest of the method is the same as what described in section 2.3.

As mentioned earlier, in SR the dictionaries should be trained in a way that the sparse representation of each LR patch well reconstructs the corresponding HR patch. The Yang’s method uses (4) to accomplish this. It uses the same sparse representation matrix for both LR and HR patches to ensure that each LR and HR patch, both have the same sparse representation. However as can be seen from (5), this joint dictionary learning is only optimal in the concatenated space of LR and HR patches, but if we look at the space of LR and HR patches separately, we may find a sparser representation for some patches than the sparse representation found in the concatenated space.

To address this problem, note first that the SR method described in section 2.3 consists of two distinct operations: finding the sparse representation of the LR patch, and the reconstruction of the HR patch. Then, we note that the first operation uses only , and the second operation uses only . Therefore, instead of training the dictionaries jointly as in (4), we propose to train for LR patches solely, and then to train the HR dictionary by minimizing the reconstruction error when sparse representation of LR patches are used.

Mathematically, we propose to train the LR dictionary as


which is a conventional dictionary learning problem. After training the LR dictionary, for each LR patch, its sparse representation is found over (note that this step is already done during the dictionary training in (6))


Using the sparse representation of LR patches , the HR dictionary is found such that the reconstruction error of the corresponding HR patches are minimized, that is,


This is an unconstrained quadratic optimization problem which has the following closed-form solution:


in which and represent transpose and pseudo-inverse of a matrix, respectively.

Note that unlike Yang’s method, in the proposed method is not trained in a way that explicitly enforces the sparsity of representation of HR patches over it, rather it is trained to minimize the final reconstruction error.

4 simulation results

(a) Original (b) Bicubic interpolation (c) Yang’s method (d) Proposed method
Figure 1: Results of Lena image magnified by a factor of 2 using: (b) Bicubic interpolation, (c) Yang’s method, (d) our proposed method. The original image is also given in (a) for comparison.

In this section we compare the performance of our method with Yang’s method. The error criteria used here are Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM) index [24]. PSNR criterion is defined as


where MSE is mean square error given by


in which is the original distortion-free image, and is the super-resolved image derived from the SR algorithm, and and are dimensions of the image in pixels.

For the definition of SSIM refer to [24]. From these definitions it is clear that higher PSNR means less mean square error, however it does not necessarily mean a better image quality when perceived by human eye. Many other error criteria have been proposed to solve this problem of PSNR. SSIM is one of these error criteria. But still PSNR is widely used because of its simple mathematical form. This can be seen in (8

) where MSE is to train the HR dictionary, but in order to show the effectiveness of our method, here we use both SSIM and PSNR to compare images produced by our method with Yang’s method.

To make a fair comparison, the same set of 80000 training data patches sampled randomly from natural images is used to train dictionaries for both Yang’s and our method. The size of LR patches is and they are magnified by a factor of 2, i.e. the size of generated HR image patches is . The LR patches extracted from the input image have a 4 pixel overlap. Dictionary size is fixed at 1024, and is used for both methods as in [15].

In Fig. 1, simulation results of Yang’s method and proposed method on Lena image can be seen. The original image and the image magnified using bicubic interpolation are also given as references. The PSNRs of these images are dB, dB and dB for bicubic interpolation, Yang’s method and ours respectively. It is clear that the quality of images magnified by SR is much better than the image magnified by bicubic interpolation and the details are more visible, which has resulted in sharper images. But the difference between image (c) and image (d) is not noticeable visually, although the PSNR of image (d) which is super-resolved by our method is about dB higher.

Bicubic   Yang proposed
Lena PSNR 32.79 34.73 34.86
SSIM 0.9012 0.9268 0.9283
Parthenon PSNR 26.50 27.77 27.89
SSIM 0.8334 0.8737 0.8762
Baboon PSNR 24.66 25.30 25.39
SSIM 0.9529 0.9872 0.9873
Barbara PSNR 27.93 28.61 28.59
SSIM 0.9609 0.9852 0.9852
Flower PSNR 30.51 33.24 33.36
SSIM 0.9230 0.9526 0.9538
Average PSNR 28.47 29.93 30.02
SSIM 0.9143 0.9451 0.9462
Table 1: PSNR and SSIM of some images magnified using bicubic iterpolation, Yang’s method and our proposed method. The average PSNRs and SSIMs are given in the last row. Best performance in each row is written in boldface.

In Table 1 the PSNRs and SSIMs of some images produced by our method is compared with those of Yang’s method and bicubic interpolation. Almost all of the images recovered by our method have higher PSNRs than images recovered by Yang’s method. The average PSNRs given in the last row show that our method performs slightly better than Yang’s method on average.

The SSIMs in Table 1 also confirm that our method is performing better than Yang’s method. The images super-resolved by the proposed method have on average a higher SSIM than images recovered by Yang’s method. Since SSIM is much more consistent with the image quality as it is perceived by human eye compared to PSNR, higher SSIM of images recovered by our method suggests that they also have better visual quality.

5 Conclusion and Future Works

In this paper, we presented a new dictionary learning algorithm for example-based SR. The dictionaries were trained from a set of sample LR and HR image patches in order to minimize the final reconstruction error. Simulation results on real images showed the effectiveness of our algorithm in super-resolving images with less error compared to Yang’s method. The average PSNR and average SSIM of images produced by our method were higher than images recovered by Yang’s method. In future, we can extend this work by training the HR dictionary using a better error criterion instead of PSNR. One of the advantages of our method is that training of is separated from in (6) and (8). We can use another error criterion that better represents the image quality like SSIM in (8) without making the training of more complex. Changing the error criterion in each of Yang’s methods will make the optimizations in their algorithms much more complex.


  • [1] Sung Cheol Park, Min Kyu Park, and Moon Gi Kang, “Super-resolution image reconstruction: a technical overview,” Signal Processing Magazine, IEEE, vol. 20, no. 3, pp. 21–36, 2003.
  • [2] Lei Zhang and Xiaolin Wu, “An edge-guided image interpolation algorithm via directional filtering and data fusion,” Image Processing, IEEE Transactions on, vol. 15, no. 8, pp. 2226–2238, 2006.
  • [3] Peyman Milanfar, Super-resolution imaging, CRC Press, 2010.
  • [4] RY Tsai and T.S. Huang, “Multiframe image restoration and registration,”

    Advances in computer vision and Image Processing

    , vol. 1, no. 2, pp. 317–339, 1984.
  • [5] Michael Elad and Arie Feuer, “Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images,” Image Processing, IEEE Transactions on, vol. 6, no. 12, pp. 1646–1658, 1997.
  • [6] Michael Elad and Arie Feuer, “Super-resolution reconstruction of image sequences,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 21, no. 9, pp. 817–834, 1999.
  • [7] David Capel and Andrew Zisserman, “Super-resolution from multiple views using learnt image models,” in

    Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on

    . IEEE, 2001, vol. 2, pp. II–627.
  • [8] Sina Farsiu, M Dirk Robinson, Michael Elad, and Peyman Milanfar, “Fast and robust multiframe super resolution,” Image processing, IEEE Transactions on, vol. 13, no. 10, pp. 1327–1344, 2004.
  • [9] William T Freeman, Thouis R Jones, and Egon C Pasztor, “Example-based super-resolution,” Computer Graphics and Applications, IEEE, vol. 22, no. 2, pp. 56–65, 2002.
  • [10] Daniel Glasner, Shai Bagon, and Michal Irani, “Super-resolution from a single image,” in Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 349–356.
  • [11] Sam T Roweis and Lawrence K Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
  • [12] Hong Chang, Dit-Yan Yeung, and Yimin Xiong, “Super-resolution through neighbor embedding,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. IEEE, 2004, vol. 1, pp. I–275.
  • [13] Emmanuel J Candès, “Compressive sampling,” in Proceedings oh the International Congress of Mathematicians: Madrid, August 22-30, 2006: invited lectures, 2006, pp. 1433–1452.
  • [14] Jianchao Yang, John Wright, Thomas Huang, and Yi Ma, “Image super-resolution as sparse representation of raw image patches,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
  • [15] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma, “Image super-resolution via sparse representation,” Image Processing, IEEE Transactions on, vol. 19, no. 11, pp. 2861–2873, 2010.
  • [16] Jianchao Yang, Zhaowen Wang, Zhe Lin, Scott Cohen, and Thomas Huang, “Coupled dictionary training for image super-resolution,” Image Processing, IEEE Transactions on, vol. 21, no. 8, pp. 3467–3478, 2012.
  • [17] A. Taalimi, H. Qi, and R. Khorsandi, “Online multi-modal task-driven dictionary learning and robust joint sparse representation for visual tracking,” in Advanced Video and Signal Based Surveillance (AVSS), 2015 12th IEEE International Conference on, Aug 2015, pp. 1–6.
  • [18] A. Taalimi, A. Rahimpour, C. Capdevila, Z. Zhang, and H. Qi, “Robust coupling in space of sparse codes for multi-view recognition,” in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 3897–3901.
  • [19] Michael Elad, Sparse and redundant representations: from theory to applications in signal and image processing, Springer, 2010.
  • [20] Mostafa Rahmani and George Atia, “Innovation pursuit: A new approach to subspace clustering,” arXiv preprint arXiv:1512.00907, 2015.
  • [21] Shervin Minaee, Amirali Abdolrashidi, and Yao Wang, “Screen content image segmentation using sparse-smooth decomposition,” in 2015 49th asilomar conference on signals, systems and computers. IEEE, 2015, pp. 1202–1206.
  • [22] Mahdi Abavisani, Mohsen Joneidi, Shideh Rezaeifar, and Shahriar Baradaran Shokouhi,

    “A robust sparse representation based face recognition system for smartphones,”

    in 2015 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). IEEE, 2015, pp. 1–6.
  • [23] M Joneidi, M Rahmani, HB Golestani, and M Ghanbari, “Eigen-gap of structure transition matrix: A new criterion for image quality assessment,” in Signal Processing and Signal Processing Education Workshop (SP/SPE), 2015 IEEE. IEEE, 2015, pp. 370–375.
  • [24] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: From error visibility to structural similarity,” Image Processing, IEEE Transactions on, vol. 13, no. 4, pp. 600–612, 2004.