Visual signals exhibit strong correlation across scales that can often be modeled and exploited to enhance image processing algorithms [2, 27]. An important example of this idea is the multi-scale coding of images using the wavelet-tree model which provides both a sparse as well as a predictive model for the occurrence of non-zero wavelet coefficients across scales . Specifically, the wavelet tree model arranges the wavelet coefficients of an image onto a tree such that nodes on the tree correspond to the coefficients and each level corresponds to coefficients associated with a particular scale. Under such an organization, the dominant non-zero coefficients form a connected rooted sub-tree, i.e., children of a node with small wavelet coefficients are expected to take small values as well. The wavelet tree model is central to many compression , sensing [7, 11], and processing algorithms .
Learnt dictionaries provide an alternate approach to wavelets in terms of enabling sparse representations . Given a large amount of data, there are many approaches that learn
a dictionary such that the training dataset can be expressed as a sparse linear combination of the elements/atoms of the dictionary. The reliance on machine learning, as opposed to analytic constructions as in the case of wavelets, provides immense flexibility towards obtaining a dictionary that is tuned to the specifics of a particular signal class or application. Yet, in spite of a large body of work devoted to learning sparse representations, there is little work devoted to learning predictive models that exploit correlations across spatial, temporal, spectral and angular scales.
In this paper, we propose a multi-scale dictionary model for visual signals that naturally enables cross-scale prediction. Our contributions are as follows.
Model. We propose a novel signal model that uses multi-scale sparsifying dictionaries to provide cross-scale prediction for a wide array of visual signals. Specifically, given the set of sparsifying dictionaries — one for each scale — the non-zero support patterns of a signal and its downsampled counterparts are constrained to only exhibit specifc pre-determined patterns.
Computational speedups. We show that the proposed signal model, with its constrained support pattern across scales, naturally enables cross-scale prediction that can be used to speedup algorithms like OMP. We term our algorithm, zero tree OMP, as the sparse representation of the signal forms a zero tree under the current model.
Learning. Given large collections of training data, we propose a simple training method, which is a modified form of K-SVD training method, to obtain dictionaries that are consistent with our proposed model.
Validation. We verify empirically that the model works through simulation on an array of visual signals including images, videos, hyper-spectral and light-field images.
The organization of this paper is as follows. In Section 2, we present related work in the area of sparse representation and dictionary learning. Section 3 introduces the proposed model, details the benefits of the model in terms of speedup, and finally, presents an approach to learn the model given training data. In Section 4, we validate our approach on a range of visual signals to verify our model.
2 Prior work
We denote vectors in bold font and scalars/matrices in capital letters. A vector is said to be-sparse if it has at most non-zero entires. The list of indices of non-zero entries of a sparse vector is termed its support; the support of a vector is denoted as . The -norm of a vector is the number of non-zero entries. Finally, given a dictionary and a support set , refers to the matrix of size formed by selecting columns of corresponding to the elements of ; similarly, given a vector , refers to an -dimensional vector formed by selecting entries in corresponding to .
Sparse approximation problems arise in a wide range of settings . The broad problem definition is as follows: given a vector , a matrix , we solve
There are many approaches to solving (P0) and its many variants. Of particular interest to this paper is orthogonal matching pursuit (OMP) , a greedy approach to solving (P0). OMP recovers the support of the sparse vector
, one element at a time, by finding the column of the dictionary that is most correlated with the current residue. In each iteration of the algorithm, there are three steps: first, the index of the atom that is closest in angle to the current residue is added to the support; second, solving a least square problem with the updated support to obtain the current estimate; and third, updating the residue by removing the contribution of the current estimate.
Speeding up OMP.
Obtaining sparse representations with high accuracy often requires dictionaries with a large number of redundant elements. Figure 2 shows the timing vs. accuracy for a dictionary of image patches for varying number of dictionary atoms, . We observe that the increase in accuracy enabled by a dictionary with larger number of atoms comes with increased computational time as well. A number of techniques have been devoted to speeding up different aspects of the problem. For problems in high-dimensionality, i.e. large , one approach is to embed to work on random projections of the dictionary . Specifically, as opposed to the objective , we minimize where ,
, is a random matrix that preserves the geometry of the problem thereby allowing us to do all computations in an
-dimensional space. In the context of high-dimensional data, it is typical to have dictionaries with a very large number of atoms, i.e., . Here, the search for the atom closest to the residue becomes the most time-consuming step. One approach to speeding up OMP is by using approximate nearest neighbors and shallow-tree based matching [4, 15]. Another approach is to restrict the search space by imposing a tree structure on sparse coefficients . Speed up in OMP has also been obtained through parallel implementation of the search for atoms , and through tweaking the least squares step . However, such methods provide lesser improvements for very large problem sizes.
For signal classes that have no obvious sparsifying transforms, a promising approach is to learn a dictionary that provides sparse representations for the specific class of interest. Field and Olshausen , in a seminal contribution, showed that patches of natural images were sparsified by a dictionary containing Gabor-like atoms — this provided a connection between sparse coding and the receptor fields in the visual cortex. More recently, Aharon et al. 
proposed the “K-SVD” algorithm which can be viewed as an extension of the k-means clustering algorithm for dictionary learning. Given a collection of training data, K-SVD aims to learn a dictionary such that with each being -sparse.
Multi-scale dictionary models.
The idea of coupling multi-scale models and sparsifying dictionaries has been explored before. Jayaraman et al.  provide a multi-level representation of image patches where simple patches with little textures are captured in the early stages while more complex textures are only resolved at the higher levels. This provides speedups when solving sparse approximation problems since patches that occur more often are captured at the earlier levels. Jenatthon et al.  present a hierarchical dictionary learning mechanism, where they impose a tree structure on the sparsity, which forces the dictionary atoms to cluster like a tree. Though it does give higher accuracy of reconstruction, not much has been said about speed up obtained. Mairal et al.  learn a dictionary based on quad-tree models, where each patch is further sub-divided into four non-overlapping patches. While this method gives better accuracy, the algorithm is very slow, as claimed by the authors. None of the multi-scale learning algorithms exploit the cross-scale coding especially for visual signals.
The goal of this paper is to construct dictionaries endowed with structured sparse representations, similar to the wavelet-tree model, and enable computational speedups in solving sparse approximation problems.
Compressive sensing (CS).
An application of sparse representations is in CS where we sense signals from far-fewer measurements than their dimensionality . CS relies on the low-dimensional representations for the sensed signal — sparse representation under a transform or a dictionary being an example of this. There is a rich body of work on applying compressive sensing to imaging or sensing visual signals including videos , light fields [23, 29], and hyperspectral images [8, 19, 20]. Most relevant to our paper is the video CS work of Hitomi et al.  where a sparsifying dictionary is used on video patches to recover high-speed videos from low-frame rate sensors. Hitomi et al. also demonstrated the accuracy enabled by very large dictionaries; specifically, they obtain remarkable results with a dictionary of atoms for video patches of dimension .
Our proposed method is inspired by multi-resolution representations and tree-models enabled by wavelets. In particular, Baraniuk  shows that the non-zero wavelet coefficients form a rooted sub-tree for signals that have trends (smooth variations) and anomalies (edges and discontinuities). Hence, piecewise-smooth signals enjoy a sparse representation with a structured support pattern with the non-zero wavelet coefficients forming a rooted sub-tree. Similar properties have also been shown for 2D images under the separable Haar basis . However, in spite of these elegant results for images, there are no obvious sparsifying bases for higher-dimensional visual signals like videos and light-field images. To address this, we build cross-scale predictive models, similar to the wavelet tree model, by replacing a basis with an over-complete dictionary that is capable of providing sparse representation for a wide class of signals.
3 Proposed signal model
Proposed cross-scale predictive sparse model.
We propose a signal model that predicts the support of a signal across scales (see Figure 3). For simplicity, we first present the model for a two-scale scenario.
Given a collection of signals, , our proposed signal model consists of two sparsifying dictionaries and that satisfy the following three properties.
Sparse approximation at the finer scale. A signal enjoys a -sparse representation in , i.e, with .
Sparse approximation at the coarser scale. Given and a downsampling operator , the downsampled signal enjoys a sparse representation in , i.e., with . The downsampling operator is domain specific.
Cross-scale prediction. The support of is constrained by the support of ; specifically, , where the mapping is known a priori.
We make a few observations.
Observation 1. since . With the increase of dimension of the signal, more complex patterns emerge which require larger number of redundant elements. Empirically we found that the number of atoms in a dictionary increases super linearly with increasing dimension of the signal for a given approximation accuracy.
Observation 2. Recall that the computational time for OMP is proportional to the number of atoms in the dictionary since, at each iteration of the algorithm, we need to compute the inner product between the residue and the atoms in the dictionary. If we can constrain the search space by constraining the number of atoms, then we can obtain computational speedups.
The proposed model obtains speedups by first solving a sparse approximation problem at the coarse scale and subsequently exploiting the cross-scale prediction property to constrain the support at the finer scale. The source of the speedups relies on two intuitive ideas: first, solving a sparse approximation problem for a problem with fewer atoms (and in smaller dimensions) is faster due to OMP’s runtime being linear in the number of atoms of the dictionary used; and second, if we knew the support of , then we can simply discard all atoms in that do not belong to since the support of is guaranteed to lie within .
We use a simple strategy for the cross-scale mapping . Let (assuming and are chosen to ensure is an integer). The cross-scale prediction map is defined using this simple rule.
Each element of the support in the coarser scale controls the inclusion/exclusion of a non-overlapping block of locations for the sparse vector in the finer scale. As a consequence, the cardinality of is simply .
Solving inverse problems under the proposed signal model.
We now detail the procedure for solving a sparse approximation problem using the proposed signal model (see Figure 3). Specifically, we seek to recover from a set of linear measurements of the form
where is the measurement matrix and is the measurement noise. As indicated earlier, we obtain using a two-step procedure.
Step1 — Sparse approximation at the coarse scale. We first solve the following sparse approximation problem:
Here, is an upsampling operator such that is an identity map on . In all our experiments, we used a uniform down sampler and a nearest neighbour up sampler specific to the domain of the signal.
This first step recovers a low-resolution approximation to the signal, .
Step 2 — Sparse approximation at the finer scale. Armed with the support , we can solve for by solving:
The sparse approximation problems in both steps are solved using OMP. The proposed mapping across scales for the sparse support forms a zero tree, where a coefficient is zero if the corresponding coefficient at coarser scale is zero. Hence we refer to our algorithm as zero tree OMP.
We next provide precise expressions for the expected speedups over the traditional single-scale OMP. Since any analysis of speedup has to account for the complexity of implementing , we consider the denoising problem where is the identity map.
Let be the amount of time required to solve a sparse-approximation problem using OMP for a dictionary of size and sparsity level . Hence, obtaining directly from would require computations. In contrast, our proposed two-step solution using cross-scale prediction has a computational cost of .
To compute the dependence of on and , recall that for each iteration in the OMP algorithm, we need operations  for finding inner product between the residue and the dictionary atoms, operations to find the maximally aligned vector and operations for the least-squares step. Thus,
For dictionaries with a large number of atoms, i.e., large , and small values for sparsity level , the linear dependence on dominates the total computation time. Here, the speedup provided by our algorithm is approximately .
Learning cross-scale sparse models.
We can learn the dictionaries with a simple modification to the K-SVD algorithm.
Inputs. The inputs to the learning/training phases are the training dataset and the values for the parameters , and .
Step 1 — Learning . We learn the coarse-scale dictionary by applying K-SVD to downsampled training dataset . As a by-product of learning the dictionary are the supports of the sparse approximations of the downsampled training dataset.
Step 2 — Learning . We learn the fine-scale dictionary by solving
The above optimization problem can be solved simply by modifying the sparse approximation step of K-SVD to restrict the support appropriately. Figure 4 shows an example of the learned low resolution atoms and the corresponding high resolution atoms. Observe that constraining the sparse support of the high resolution approximation alone learns patches which are very similar in appearance to the low resolution patches, which is in strong favor of our signal model. We also note that the time required for training the model was about the same as that required to learn a single high-resolution dictionary, with the same specification as the high-resolution dictionary, using K-SVD.
The design parameters in the two scale dictionary training are and . can be chosen to fine tune the accuracy at lower scales. We found that provided high model approximation accuracy. must be at least , since at least one patch corresponding to each low resolution patch will be combined in high resolution. can be increased further to increase approximation accuracy without much reduction in speed up. The value of
, number of degrees of freedom for super resolving the low resolution atom, is highly dependent on the ratio. For our experiments, , and hence was chosen as as well.
4 Experimental results
We compare zero tree OMP using our proposed two-scale dictionaries against traditional OMP on dictionaries learnt using K-SVD. We compare both the run time and approximation accuracy for images, videos, hyperspectral images and light-field images. We quantify approximation accuracy using recovered SNR that is defined as follows: given a signal and its estimate , .
We learn a two-scale dictionary on image patches of dimension . Each patch was scaled down to size. The low resolution dictionary was formed of 64 atoms and the high resolution dictionary had 1024 atoms. We compared the results against a 1024 atom single scale dictionary trained using K-SVD algorithm. We performed image denoising with the learned dictionaries. Figure 9(a), (d) show performance metrics in terms of recovered SNR for denoising and inpainting, as compared against traditional OMP. Note that speed ups in the order of are obtained even for a small problem setting.
Figure 1 shows demosaicing of the Bayer pattern using both the methods. We trained an atom high resolution dictionary on Kodak True color RGB images and atom low resolution dictionary on the patches downscaled to . We compare this against atom single scale dictionary. It took minutes for the single scale, whereas only minutes for the two scale dictionary.
Figure 5 shows image denoising at an SNR of . We perform denoising with the trained RGB dictionaries of patch and with a patch overlap of pixels. With hardly any reduction in accuracy, our method performs faster.
We trained an atom high resolution dictionary for video patches and atom low resolution dictionary for the patches downscaled to . We compared the trained dictionaries against an atom single scale dictionary obtained using K-SVD. We maintained the same sparsity across all the dictionaries. Figure 9(b), (d) show the performance of our proposed method and conventional K-SVD+OMP for denoising and video compressive sensing where we implemented the temporal sampling method proposed in Hitomi et al. . Visualization of the recovered frames is shown in Figure 6. The reconstruction accuracy is similar for both methods.
We trained over-complete dictionaries from 32 channel (31 channels + 1 channel repeated for computational ease) hyper-spectral images obtained from . An atom high-resolution dictionary for patch and atom low resolution dictionary for the downscaled was trained using our proposed method, which was compared against an atom dictionary learnt using K-SVD. We tested the dictionaries for denoising and image demosaicing, the results of which are shown in Figure 9(c), and (f) respectively. A visualization of the demosaiced images can be seen in Figure 7.
We trained over-complete dictionary from data obtained from . A 32768 atom high-resolution dictionary for patch and a atom low resolution dictionary for uniformed downscaled patch was trained using the proposed method, which we compare against a 32768 atom dictoinary learned using K-SVD. We tested the dictionaries for denoising and image reconstruction using compressive angle sampling of light-field data , results of which are shown in Figures 9(d), and (h) respectively. Reconstruction from compressively sampled light-field on synthetic data from  is shown in Figure. 8. We obtained a speed up of for reconstruction from compressive sampling.
Table 1 and Figure 9 quantify the performance of the proposed signal model and those obtained using K-SVD for a wide range of parameters as well as signals. Across the board, we observe that the proposed framework provides approximations that are as good as those obtained with K-SVD, but with speedups that are for small-sized problems and for larger problems. The speedups obtained are comparable to results in  with higher approximation accuracies for our proposed method.
As a result of speed up of the sparse coding step, we get significant speed ups during the training phase () using modified K-SVD, which makes it feasible to deal with very large scale problems.
Comparison of zero tree OMP vs. OMP based processing for various applications. (a) Image denoising, (b) Video denoising, (c) Hyperspectral image denoising, (d) Light-field denoising, (e) Image inpainting (is the number of unknown pixel values per each known), (f) Video compressive sensing using coded images ( represents the number of frames recovered from each coded image), (g) Hyperspectral image demosaicing ( is the number of spectral channels combined into one image), and (h) Light-field compressive sensing using random angle sampling ( is the number of sub-aperture images reconstructed from a sub-aperture image). Observe that the curves for OMP as well as our proposed algorithm are very comparable.
5 Conclusion and discussions
We presented a signal model that enables the cross scale predictability for visual signals. Our method is particularly appealing because of the simple extension to the existing OMP and K-SVD algorithms while providing significant speed ups at little or no loss in accuracy. The computational gains provided by our algorithm are especially significant for problems involving high-dimensional dictionaries with a large number of atoms.
Beyond two scales.
All our experiments are in the setting of two-scale dictionaries. Extending them to more scales will give significant speedups for very large dictionaries for high-dimensional problems. However, with increasing problem size, the size of training dataset also grows significantly and can potentially become a bottleneck towards the training of stable dictionaries.
Connections to super resolution using sparse representations.
Roman et al.  learn a pair of low resolution and high resolution dictionary using the same sparsity pattern for the two dictionaries. Given a low resolution patch , they solve the sparse approximation problem and then super resolve the image as . In contrast, our method requires the high resolution image, and uses the sparse representation of the downscaled image to predict the high resolution sparse representation. While the primary aim of 
is image-based super resolution, our method can accommodate any inverse problem based on sparse approximation.
The authors gratefully acknowledge support from Intel Corporation.
-  Kodak lossless true color image suite. http://r0k.us/graphics/kodak/. Accessed: 2015-10-05.
-  E. Adelson, E. Simoncelli, and W. T. Freeman. Pyramids and multiscale representations. Representations and Vision, Gorea A.,(Ed.). Cambridge University Press, Cambridge, pages 3–16, 1991.
-  M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Processing, 54(11):4311–4322, 2006.
-  A. Ayremlou, T. Goldstein, A. Veeraraghavan, and R. G. Baraniuk. Fast sublinear sparse representation using shallow tree matching pursuit. arXiv preprint arXiv:1412.0680, 2014.
-  R. G. Baraniuk. Optimal tree approximation with wavelets. In SPIE Intl. Symp. Optical Science, Engineering, and Instrumentation, 1999.
-  R. G. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4), 2007.
-  R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Trans. Information Theory, 56(4):1982–2001, 2010.
-  J. Bieniarz, R. Muller, X. Zhu, and P. Reinartz. On the use of overcomplete dictionaries for spectral unmixing. In 4th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2012.
-  A. Chakrabarti and T. Zickler. Statistics of Real-World Hyperspectral Images. In , 2011.
-  D. G. Dansereau, O. Pizarro, and S. B. Williams. Decoding, calibration and rectification for lenselet-based plenoptic cameras. In IEEE Conf., Computer Vision and Pattern Recognition, 2013.
-  S. Deutsch, A. Averbush, and S. Dekel. Adaptive compressed image sensing based on wavelet modeling and direct sampling. In SAMPTA, 2009.
-  M. Elad. Sparse and redundant representations: From theory to applications in signal and image processing. Springer, 2010.
-  Y. Fang, L. Chen, J. Wu, and B. Huang. GPU implementation of orthogonal matching pursuit for compressive sensing. In IEEE Intl. Conf on Parallel and Distributed Systems (ICPADS), 2011.
-  M. Gharavi-Alkhansari and T. S. Huang. A fast orthogonal matching pursuit algorithm. In IEEE Intl. Conf. Acoustics, Speech, Signal Processing, 1998.
-  R. Gribonval. Fast matching pursuit with a multiscale dictionary of gaussian chirps. IEEE Trans. Signal Processing, 49(5), 2001.
-  Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar. Video from a single coded exposure photograph using a learned over-complete dictionary. In IEEE Intl. Conf. Computer Vision, 2011.
-  R. Jenatton, J. Mairal, F. R. Bach, and G. R. Obozinski. Proximal methods for sparse hierarchical dictionary learning. In Intl. Conf., Machine Learning, 2010.
-  C. La and M. N. Do. Tree-based orthogonal matching pursuit algorithm for signal reconstruction. In IEEE Intl. Conf. Image Processing, 2006.
-  S. Li and H. Qi. Sparse representation based band selection for hyperspectral images. In IEEE Intl. Conf. Image Processing, 2011.
-  X. Lin, Y. Liu, J. Wu, and Q. Dai. Spatial-spectral encoded compressive hyperspectral imaging. ACM Trans. Graphics.
-  B. Mailhé, R. Gribonval, F. Bimbot, and P. Vandergheynst. A low complexity orthogonal matching pursuit for sparse signal approximation with shift-invariant dictionaries. In IEEE Intl. Conf. Acoustics, Speech, Signal Processing, 2009.
-  J. Mairal, G. Sapiro, and M. Elad. Multiscale sparse image representation with learned dictionaries. In IEEE Intl. Conf. Image Processing, 2007.
-  K. Marwah, G. Wetzstein, Y. Bando, and R. Raskar. Compressive light field photography using overcomplete dictionaries and optimized projections. ACM Trans. Graphics, 32:46, 2013.
-  B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research, 37(23):3311–3325, 1997.
-  Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Asilomar Conf. Signals, Systems, Computers, 1993.
-  S. Pelletier. Acceleration methods for image super-resolution. PhD thesis, McGill University, 2009.
-  A. Secker and D. Taubman. Lifting-based invertible motion adaptive transform (limat) framework for highly scalable video compression. IEEE Trans. Image Processing, 12(12):1530–1542, 2003.
-  J. M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. Signal Processing, 41(12):3445–3462, 1993.
-  S. Tambe, A. Veeraraghavan, and A. Agrawal. Towards motion aware light field video for dynamic scenes. In IEEE Intl. Conf. Computer Vision, 2013.
J. J. Thiagarajan, K. N. Ramamurthy, and A. Spanias.
Learning stable multilevel dictionaries for sparse representations.
IEEE Trans. Neural Networks and Learning Systems, PP(99), 2014.
-  S. N. Vitaladevuni, P. Natarajan, and R. Prasad. Efficient orthogonal matching pursuit using sparse random projections for scene and video classification. In IEEE Intl. Conf. Computer Vision, 2011.
-  M. J. Wainwright, E. P. Simoncelli, and A. S. Willsky. Random cascades on wavelet trees and their use in analyzing and modeling natural images. Appl. Comp. Harmonic Analysis, 11(1):89–123, 2001.
-  R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In Curves and Surfaces, pages 711–730. Springer, 2012.