Compressive sensing and sparse approximation using redundant dictionaries are important tools for a wide range of imaging applications including image/video denoising [1, 2, 3], superresolution [4, 5], compressive sensing of videos [6, 7, 8], light-fields [9, 10], hyperspectral data , and even inference tasks such as face and object recognition .
In spite of this widespread adoption in research, their adoption in commercial and practical systems is still lacking. One of the principal reasons is the computational complexity of the algorithms needed: these algorithms are either linear or super-linear in both the data dimensionality and size of the dictionary. Common applications requiring dictionaries with over atoms require computation times that may exceed several days.
As an illustrative example, consider the problem of compressive video sensing using overcomplete dictionaries . In , an overcomplete dictionary consisting of video patches was learned and utilized for compressive sensing using orthogonal matching pursuit (OMP). Reconstruction for a single video (36 frames) using dictionary atoms takes more than a day, making these methods impractical for most applications and highlighting the need for significantly faster algorithms.
I-a Motivation and Related Work
Algorithms for sparse approximation can be broadly classified into two categories: those based on convex optimization, and greedy pursuit algorithms. Several attempts have been made to develop fast algorithms for sparse regularization based on convex optimization [14, 15, 16]. In spite of the progress made, these algorithms may still be slow when the size of the dictionary and data dimensionality become large.
Fast Matching Pursuit: Matching Pursuit(MP) and its many variants  build sparse approximations by sequentially choosing atoms from a dictionary, and adding them one-at-a-time to the current ensemble. On each stage, the target vector is approximated using the current ensemble, and the approximation error or “residual” is measured. Next, an atom is selected from the dictionary that best approximates the current residual. The computational bottleneck of this process is finding the dictionary atom closest to the residual. Nearest Neighbor(NN) search methods face a similar bottleneck that has been aggressively tackled using Approximate Nearest Neighbor (ANN) search algorithms [18, 19, 20]. Most ANN search methods organize data into a tree structure that enables fast retrieval at query time  . Typically a very deep tree with binary branching at every level is learned.
Hierarchical/tree approaches have been used in many applications to speed up dictionary matching. In , Batch Tree Orthogonal Matching Pursuit (BTOMP) is used to build a feature hierarchy that yields a better classification. The authors of  construct trees using “kernel descriptors” for the same application. Hierarchical methods for representing image patches are studied in , .The authors of  Random prejections for dimensionally reduction are used in  to build hierarchical dictionaries. In 
binary hierarchical structure and PCA (Principal Component Analysis) are combined to reduce the complexity of the OMP.
Unfortunately, such a deep tree does not provide a beneficial trade-off between accuracy and speedup for dictionaries, since these atoms tend to be highly coherent. Further, because they require backtracking and branch-and-bound methods, typical ANN techniques such as kd-trees do not provide reliable runtime guarantees.
In contrast, we organize the dictionary using a shallow tree (typically 3 levels) as shown in Figure 1
. Our tree construction scheme is such that the resulting tree represents a balanced, hierarchical clustering of the atoms. Finally, we devise a sublinear time search algorithm for identifying the support set that provides the user with precise control over the computational speedup achieved while retaining high fidelity approximations.
We propose an algorithm for balanced hierarchical clustering of dictionary atoms. We exploit the clustering to derive a sublinear time algorithm for sparse approximation. Our methods has a single parameter that provides fine-scale control on the computational speed-up achieved, enabling a natural trade-off between accuracy and computation. We perform extensive experiments that span numerous applications where shallow trees achieve 150-1000x speedup (with a 1dB of less loss in accuracy) compared to conventional methods.
Ii Problem Formulation
Ii-a Sparse Approximation using Dictionaries
Our approach to fast dictionary coding uses Matching Pursuit (MP), which is a greedy method for selecting the constituent dictionary elements that comprise a test vector. MP is a commonly used scheme for this application because dictionary representations of image patches are extremely sparse. For computing representations involving large numbers of atoms (e.g. for representing entire images rather than just patches) more complex pursuit algorithms have been proposed  that we do not consider here.
Matching Pursuit: MP is a stage-wise scheme that builds a signal representation one atom at a time. Algorithm 1 is initialized by declaring the “residual” to be equal to the test vector . This residual represents the component of that has not yet been accounted for by the sparse approximation. In each iteration of the main loop an atom enters the representation. The atom is selected by computing inner products with all (normalized) columns in and selecting the atom with the largest inner product. The residual is then updated by subtracting the contribution of the entering dictionary element.
MP Computational Complexity: MP requires the computation of inner products on the “matching” stage of each iteration. Since each inner product requires operations and there are stages, the overall complexity is Note that this complexity is dominated by the number of atoms in the dictionary. For most imaging applications, is highly over-complete. A typical image denoising method may operate on image patches (), use atoms per patch, and require dictionary elements. For video or light field applications, may be substantially larger. the computational burden of handling large dictionaries is a major roadblock for use in applications.
Ii-B Problem Definition and Goals
We consider variations on MP that avoid the brute-force matching of dictionary elements. Our method is based on a hierarchical clustering method that organizes an arbitrary dictionary into a tree. The tree can be traversed efficiently to approximately select the best atom to enter the representation on each stage of MP. Our method is conceptually related to ANN methods (such as k-d trees). However, unlike conventional ANN schemes, the proposed method is customized to the problem of dictionary matching pursuit, and so differs from conventional ANN methods in several ways. The most significant difference is that the proposed method uses “shallow” trees (i.e. trees with a very small number of layers), as opposed to most ANN methods with use very deep trees with only two branches per level.
Iii Algorithm for Hierarchical Clustering
The proposed method relies on a hierarchically clustering that organizes dictionaries trees. Each node of the tree represents a group of dictionary elements. As we traverse down the tree, these groups are decomposed into smaller sub-groups. To decompose groups of atoms into intelligent components, we use an algorithm based on k-means. To facilitate fast searching of the resulting tree, we require that each node be balanced – i.e., all nodes at the same level of the tree represent the same number of atoms. Conventional k-means, when applied to image dictionaries, tends to produce highly unbalanced clusters, sometimes with as many as 90% of atoms in a single cluster. For the purpose of tree search, this is clearly undesirable as descending to this branch of the tree does not substantially reduce the number of atoms to choose from. For this reason, the proposed clustering uses “balanced” k-means.
Iii-a Balanced Clustering
We now consider the problem of uniformly breaking a set of elements into smaller groups. We begin with a collection of atoms to be decompose into groups. We apply k-means to the atoms. We then examine only the “large” clusters that contain at least atoms and discard the rest. For each large cluster, we keep the nearest atoms to the centroid, and discard the remaining atoms. Suppose such clusters are identified. The algorithm is then repeated by applying k-means to the remaining unclustered atoms to form groups. Once again, groups of at least atoms are identified, and reduced to a cluster of exactly elements. This process is repeated until the number of remaining atoms is less than , at which point the remaining atoms form their own last cluster.
Iii-B Hierarchical Clustering
The clustering method in Section III-A can be used to organize dictionaries into hierarchical trees. We begin with a parent node containing all dictionary atoms. The dictionary is then decomposed into balanced groups. Each such groups is considered to be a “child” of the parent node. Each child node is examined, and the atoms it contains are partitioned into groups which become its children. This process is repeated until the desired level of granularity is attained in the bottom-level (leaf) nodes of the tree.
Iii-C Fast Matching using Shallow Tree
Using the tree representation of the dictionary, ANN matches can be found from a given test vector . The goal is to find the dictionary entry with the largest inner product with . The tradeoff between precision and speed is controlled by a parameter . The search algorithm begins by considering the top-level node in the tree. The test vector is compared to the centroid of the cluster that each child node represents. Using the notation of Section III-B, there are such clusters. We retain the clusters with the largest inner products with the test vector. The search process is then called recursively on these nearby clusters to approximately find the closest atom in each cluster. The resulting atoms are compared to , and the closest atom is returned.
Computational Complexity: At the first level of the tree, Algorithm 2 must compute inner products. On the second level of the tree, inner products are computed, and on the third, etc. It total, the number of inner products is given by As long as remains bounded and , this grows sub-linearly with . In particular, if we choose for all then the number of inner products is and the total complexity (including the cost of inner products) is Figure 2 shows the inner products needed to match an atom for a variety of parameter choices and dictionary sizes. Note the sublinear scaling with dictionary size.
Construction of Shallow trees: For all experiments in this article we use trees with only 3 levels. We choose and Because we have chosen the number of inner products that are computed does not grow as we descent lower into the tree.
We call the proposed method a “shallow tree” algorithm because the hierarchical clustering generates trees with only 3 levels and 100 branches on the first node. This is in sharp contrast to conventional tree-based nearest neighbor methods (see e.g., ) that rely on very deep trees with only two branches per node. For use with image dictionaries, shallow trees appear to perform much better for patch matching than conventional off-the-shelf nearest neighbor methods.
Iv Experimental Results
We compare Shallow Tree Matching Pursuit with other algorithms in terms of both run-time and reconstruction quality for a variety of problems. The main conclusion from the experiments is that STMP provides a 100-1000x speedup compared to existing sparse regularization methods with less than loss in performance.
We compare to the following techniques:
STMP: Shallow Tree Matching Pursuit with three different values of , i.e., and . Lower results in faster run-time, while larger results in better approximations. Our implementation is in matlab.
OMP: Orthogonal Matching Pursuit (OMP) is a popular pursuit algorithm used in several vision applications. We use the mex implementation available as a part of the K-SVD software package .
SPGL1: A matlab solver for large scale regularized least squares problems . This code achieves sparse coding via basis pursuit denoising problems.
FPC-AS: A matlab solver for regularized least squares based on fixed point continuation . Due to impractically slow performance FPC-AS is only tested in imaging problems.
GPSR: A matlab solver for sparse reconstruction using gradient projections .
KD-Tree: We use the built-in matlab function for fast approximate nearest neighbor search using kd-trees to speed up traditional matching pursuit.
ANN: We use the ANN C++ library for approximate nearest neighbor matching to speed up traditional matching pursuit .
Other Notes: We sometimes use a smaller randomly sub-sampled dictionary to test the variational methods in cases where runtimes were impractically long ( hours). All experiments begin by breaking datasets into patches using a sliding frame. A restored image is then synthesized by averaging together the individual restored patches.
Iv-a Imaging Experiments
|Denoising||Video Compressive Sampling|
Dictionary Construction: A general image dictionary is constructed from a set of 8 natural test images from the USC-SIPI Image Database (Barbara, Boat, Couple, Elaine, House, Lena, Man, and Peppers) using patches and a shift of 2 pixels. From each image, a dictionary of 5,000 atoms is learned using the K-SVD method. These dictionaries were merged to create a 40,000 atom dictionary. This dictionary was randomly sub-sampled to create a 4,000 element dictionary for use with FPC_AS and SPGL1. The dictionary was clustered using the hierarchical scheme of Section III with 100 equally sized clusters in the first level (), 10 equally sized cluster in the second level, and 10 equally sized clusters in the third (final) level (). Sparse coding is performed using Algorithm 2.
Three images were selected from the Berkeley Segmentation Dataset image numbers 223061,102061, and 253027). Each image was contaminated with Gaussian white noise to achieve an SNR of 10dB. Greedy recovery was performed using 10 dictionary atoms per patch. Sample denoising results are shown in Figure4. Time trial results are shown in Table I.
Image Super-resolution: This experiment enhances the resolution of an image using information from a higher resolution dictionary. We use three test images from the Berkeley Segmentation Database. Low resolution images are broken into patches. The low resolution patches are mapped onto the dictionary patches for comparison, and then matched using sparse regularization algorithms 2 with sparsity . The reconstructed high-resolution patches are then averaged together to create a super-resolution image. Sample super-resolution reconstructions are shown in Figure 4. Time trails are displayed in Table I.
Dictionary Construction: We obtained the dictionaries and high speed videos used in  from the authors for this experiment. MP experiments were done using a dictionary of atoms, and variational experiments were done using a randomly sub-sampled dictionary of atoms. Video patches of size are extracted from video frames. The dictionary was clustered using the same parameters as the image dictionary. Sparse coding is performed using Algorithm 2 with .
Video Denoising: Video denoising proceeds similarly to image denoising. The original 18 frame videos were contaminated with Gaussian white noise to have an SNR of 10dB. Patches of size were extracted from the video to create test vectors of dimension . For the “dog” video patches were generated with a shift of 1 pixel (35673 patches) while for truck a 3 pixels shift was used (14505 patches). Sparse coding and recovery was performed using 10 atoms per patch. Sample frames from denoised videos are shown in Figure 6 and runtimes are displayed in Table II.
Video Compressive Sampling: We emulate the video compressive sampling experiments in . This experiment simulates a pixel-wise coded exposure video camera much like . The pixel-wise coded exposure video camera operates at the frame-rate of the reconstructed video and therefore results in samples/measurements compared to the original video. For reconstruction, we closely follow the approach of  and reconstruct the video by using patch-wise sparse coding using the learned dictionary. Sample frames from reconstructed videos are shown in Figure 6 and runtimes are displayed in Table II.
Iv-C Light Field Analysis
Dictionary Construction: A dictionary was created for light field patches using several sample light fields: synthetic light fields created from the “Barbara” test image as well as several urban scenes and light field data from the MIT Media Lab Synthetic Light Field Archive (Happy Buddha, Messerschmitt, Dice, Green Dragon, Mini Cooper, Butterfly, and Lucy). Dictionaries are learned on 4-dimensional patches that consist of an 8x8 pixel grid and a 5x5 view window (total dimensions per patch is 8x8x5x5 = 1600). By combining patches from all training data, a dictionary with atoms was built. Again we randomly sampled the dictionary to generate a small atom dictionary for methods that were intractably slow when using the full-sized dictionary.
The dictionary was subjected to hierarchical clustering using the same parameters as the image dictionary, with 100 equally sized clusters in the first level, 10 equally sized cluster in second level, and 10 equally sized clusters in third (final) level. Sparse coding is performed using Algorithm 2 with .
Light-Field Denoising: Denoising experiments were performed using the “Tarot Cards” and “Crystal Ball” datasets from the Stanford Light Field Archive. We add noise to the light field to achieve an SNR of 10dB. Patches are extracted with a 2 pixels shift (15625 patches). Because of the high dimensionality of light-field patches, sparse coding was done using a sparsity of 50. Results are displayed in Figure 9 and Table III.
|Denoising||Light-Field from Trinocular Stereo|
|Method||Size||SNR(dB)||Time(min)||Run Time||SNR(dB)||Time(min)||Run Time|
Light-Field from Trinocular Stereo: In this experiment, we attempt to reconstruct a light field with views from just three cameras (trinocular), much like . The Lego Knights light field dataset from the Stanford Light Field Archive we subsampled to retain only the top middle, bottom left, and bottom right views of the view grid at each pixel. Patches of size were then sampled with 2 pixel shift. The observed patch data was mapped onto the corresponding entries for each dictionary atom, and used for sparse coding. This reduces the dimension of the test set and dictionary from 1600 to 1600x(3/25)= 192. Sparse coding was performed with 10 dictionary atoms per patch. Restored patches were then averaged to reconstruct the full light field with views. Results are displayed in Figure 9 and Table III.
V Discussion and Conclusions
The high performance of shallow trees for dictionary matching seems to contradict the conventional intuition that deeper tree are better. For image dictionaries, it seems that atoms are naturally organized into a large number of separated clusters with fairly uniform separation. By exploiting this structure at a high level, shallow trees perform highly accurate matching using relatively few comparisons. In contrast, deep tree nearest neighbor searches require a smaller number of dot products to descend to the bottom of the tree. However, these approaches require branch-and-bound methods that backtrack up the tree and explore multiple branches in order to achieve an acceptable level of accuracy. For well clustered data such as the dictionaries considered here, the shallow tree approach achieves superior performance by avoiding the high cost of backtracking searches through the tree.
-  M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” Image Processing, IEEE Transactions on, vol. 15, no. 12, pp. 3736–3745, 2006.
-  M. Protter and M. Elad, “Image sequence denoising via sparse and redundant representations,” Image Processing, IEEE Transactions on, vol. 18, no. 1, pp. 27–35, 2009.
-  J. Mairal, G. Sapiro, and M. Elad, “Learning multiscale sparse representations for image and video restoration,” DTIC Document, Tech. Rep., 2007.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” Image Processing, IEEE Transactions on, vol. 19, no. 11, pp. 2861–2873, 2010.
-  D. Kong, M. Han, W. Xu, H. Tao, and Y. Gong, “Video super-resolution with scene-specific priors.” in BMVC, 2006, pp. 549–558.
-  M. Wakin, J. Laska, M. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. F. Kelly, and R. G. Baraniuk, “Compressive imaging for video representation and coding,” in Picture Coding Symposium, 2006.
-  D. Reddy, A. Veeraraghavan, and R. Chellappa, “P2c2: Programmable pixel compressive camera for high speed imaging,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 329–336.
-  Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Video from a single coded exposure photograph using a learned over-complete dictionary,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 287–294.
-  K. Marwah, G. Wetzstein, Y. Bando, and R. Raskar, “Compressive light field photography using overcomplete dictionaries and optimized projections,” ACM TRANSACTIONS ON GRAPHICS, vol. 32, no. 4, 2013.
-  B. Salahieh, A. Ashok, and M. Neifeld, “Compressive light field imaging using joint spatio-angular modulation,” in Computational Optical Sensing and Imaging. Optical Society of America, 2013.
-  M. Li, J. Shen, and L. Jiang, “Hyperspectral remote sensing images classification method based on learned dictionary,” in 2013 International Conference on Information Science and Computer Applications (ISCA 2013). Atlantis Press, 2013.
-  Z. Jiang, Z. Lin, and L. Davis, “Label consistent k-svd: Learning a discriminative dictionary for recognition,” IEEE, 2013.
-  Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on. IEEE, 1993, pp. 40–44.
-  M. A. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” Selected Topics in Signal Processing, IEEE Journal of, vol. 1, no. 4, pp. 586–597, 2007.
-  E. van den Berg and M. P. Friedlander, “SPGL1: A solver for large-scale sparse reconstruction,” June 2007, http://www.cs.ubc.ca/labs/scl/spgl1.
-  E. T. Hale, W. Yin, and Y. Zhang, “Fixed-point continuation for ell_1-minimization: Methodology and convergence,” SIAM Journal on Optimization, vol. 19, no. 3, pp. 1107–1130, 2008.
-  R. Gribonval, “Fast matching pursuit with a multiscale dictionary of gaussian chirps,” Signal Processing, IEEE Transactions on, vol. 49, no. 5, pp. 994–1001, 2001.
-  S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” Journal of the ACM (JACM), vol. 45, no. 6, pp. 891–923, 1998.
-  A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 459–468.
-  M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration.” in VISAPP (1), 2009, pp. 331–340.
-  L. Bo, X. Ren, and D. Fox, “Hierarchical matching pursuit for image classification: Architecture and fast algorithms,” in Advances in Neural Information Processing Systems, 2011, pp. 2115–2123.
-  L. Bo, K. Lai, X. Ren, and D. Fox, “Object recognition with hierarchical kernel descriptors,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, June 2011, pp. 1729–1736.
-  K. Yu, Y. Lin, and J. Lafferty, “Learning image representations from the pixel level via hierarchical sparse coding,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, June 2011, pp. 1713–1720.
L. Bo, X. Ren, and D. Fox, “Multipath sparse coding using hierarchical
matching pursuit,” in
NIPS workshop on deep learning, 2012.
-  B. Chen, G. Polatkan, G. Sapiro, D. Blei, D. Dunson, and L. Carin, “Deep learning with hierarchical convolutional factor analysis,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1887–1901, Aug 2013.
Z. J. Xiang, H. Xu, and P. J. Ramadge, “Learning sparse representations of high dimensional data on large scale dictionaries,” inAdvances in Neural Information Processing Systems, 2011, pp. 900–908.
-  J.-L. Lin, W.-L. Hwang, and S.-C. Pei, “Fast matching pursuit video coding by combining dictionary approximation and atom extraction,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 17, no. 12, pp. 1679–1689, Dec 2007.
-  D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, Apr. 2008. [Online]. Available: http://arxiv.org/abs/0803.2392
-  R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit,” CS Technion, 2008.
-  D. Liu, J. Gu, Y. Hitomi, M. Gupta, T. Mitsunaga, and S. Nayar, “Efficient space-time sampling with pixel-wise coded exposure for high speed imaging,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. PP, no. 99, pp. 1–1, 2013.
-  K. Mitra and A. Veeraraghavan, “Light field denoising, light field superresolution and stereo camera based refocussing using a gmm light field patch prior,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012, pp. 22–28.