I Introduction
There are several ways to represent lowdimensional structure of highdimensional data, the best known being principal component analysis (PCA). However, PCA is based on a linear subspace model that is generally not capable of capturing the geometric structure of realworld datasets
[1].The sparse signal model is a nonlinear generalization of the linear subspace model that has been used in various signal and image processing tasks [2, 3, 4], as well as compressive sensing [5]. This model assumes that each data sample can be represented as a linear combination of a few elements (atoms) from a dictionary. Dataadaptive dictionary learning can lead to a much more compact representation than predefined dictionaries such as wavelets, and thus a central problem is finding a good dataadaptive dictionary.
Dictionary learning algorithms such as the method of optimal directions (MOD) [6] and the KSVD algorithm [7] aim to learn a dictionary by minimizing the representation error of data in an iterative procedure involving two steps of sparse coding and dictionary update. The latter often requires ready access to the entire data available at a central processing unit.
Due to increasing sizes of datasets, not only do algorithms take longer to run, but it may not even be feasible or practical to acquire and/or hold every data entry. In applications such as distributed databases, where data is typically distributed over an interconnected set of distributed sites [8], it is important to avoid communicating the entire data.
A promising approach to address these issues is to take a compressive sensing approach, where we only have access to compressive measurements of data. In fact, performing signal processing and data mining tasks on compressive versions of the data has been an important topic in the recent literature. For example, in [9]
, certain inference problems such as detection and estimation within the compressed domain have been studied. Several lines of work consider recovery of principal components
[10, 11, 12, 13], spectral features [14], and change detection [15] from compressive measurements.In this paper, we focus on the problem of dictionary learning based on compressive measurements. Our contributions are twofold. First, we show the connection between dictionary learning in the compressed domain and Kmeans clustering. Most standard dictionary learning algorithms are indeed a generalization of the Kmeans clustering algorithm [16], where the reference to Kmeans is a common approach to analyze the performance of these algorithms [7, 17]
. This paper takes initial steps towards providing theoretical guarantees for recovery of the true underlying dictionary from compressive measurements. Moreover, our analysis applies to compressive measurements obtained by a general class of random matrices consisting of i.i.d. zeromean entries and finite first four moments.
Second, we extend the prior work in [18]
where compressive dictionary learning for random Gaussian matrices is considered. In particular, we propose a memory and computation efficient dictionary learning algorithm applicable to modern data settings. To do this, we learn a dictionary from very sparse random projections, i.e. projection of the data onto a few very sparse random vectors with Bernoulligenerated nonzero entries. These sparse random projections have been applied in many largescale applications such as compressive sensing and object tracking
[19, 20] and to efficient learning of principal components in the largescale data setting [13]. To further improve efficiency of our approach, we show how to share the same random matrix across blocks of data samples.
Ii Prior Work on Compressive Dictionary Learning
Several attempts have been made to address the problem of dictionary learning from compressive measurements. In three roughly contemporary papers [21, 22], and our work [18], three similar algorithms were presented to learn a dictionary based on compressive measurements. Each was inspired by the wellknown KSVD algorithm and closely followed its structure, except in that each aimed to minimize the representation error of the compressive measurements instead of that of the original signals. The exact steps of each algorithm have minor differences, but take a similar overall form.
However, none of these works explicitly aimed at designing the compressive measurements (sketches) to promote the computational efficiency of the resulting compressive KSVD, so that it would be maximally practical for dictionary learning on largescale data. Moreover, none of these works gave theoretical performance analysis for such computationallyefficient sketches.
In this paper, we extend the previous line of work on compressive dictionary learning by analyzing the scheme under assumptions that make it memory and computation efficient. The key to the efficiency of the new scheme is in considering a wider and more general class of random projection matrices for the sketches, including some very sparse ones. We further introduce an initial analysis of the theoretical performance of compressive dictionary learning under these more general random projections.
In this section, we review the general dictionary learning problem and the compressive KSVD (CKSVD) algorithm that was introduced in [18] for the case of random Gaussian matrices. (We note that the approaches of [22] and [21] are similar.) Given a set of training signals in , the dictionary learning problem is to find a dictionary that leads to the best representation under a strict sparsity constraint for each member in the set, i.e., minimizing
(1) 
where is the coefficient matrix and the pseudonorm counts the number of nonzero entries of the coefficient vector . Moreover, the columns of the dictionary are typically assumed to have unit norm. Problem (1) is generally intractable so we look for approximate solutions (e.g., via KSVD [7]).
We then consider compressed measurements (sketches), where each measurement is obtained by taking inner products of the data sample with the columns of a matrix , i.e., with , , and . In [18], the entries of
are i.i.d. from a zeromean Gaussian distribution, which is an assumption we drop in the current paper.
Given access only to the compressed measurements and not , we attempt to solve the following compressive dictionary learning problem:
(2) 
In the CKSVD algorithm, the objective function in (2) is minimized in a simple iterative approach that alternates between sparse coding and dictionary update steps.
Iia Sparse Coding
In the sparse coding step, the penalty term in (2) is minimized with respect to a fixed to find the coefficient matrix under the strict sparsity constraint. This can be written as
(3) 
where is a fixed equivalent dictionary for representation of . This optimization problem can be considered as distinct optimization problems for each compressive measurement. We can then use a variety of algorithms, such as OMP, to find the approximate solution [23].
IiB Dictionary Update
The approach is to update the dictionary atom and its corresponding coefficients while holding fixed for , and then repeat for , until . The penalty term in (2) can be written as
(4) 
where is the element of , is a set of indices of compressive measurements for which , and is the representation error for when the dictionary atom is removed. The penalty term in (4) is a quadratic function of and the minimizer is obtained by setting the derivative with respect to equal to zero. Hence,
(5) 
where and . Therefore, we get the closedform solution , where denotes the MoorePenrose pseudoinverse of . Once given the new (normalized to have unit norm), the optimal for each is given by least squares as . By design, the support of the coefficient matrix is preserved, just as in the KSVD algorithm.
Iii Initial Theoretical Analysis: KMeans Case
In this section, we provide an initial theoretical analysis of the performance of the CKSVD algorithm by restricting our attention to a special case of dictionary learning: Kmeans clustering. In this special case, we can provide theoretical guarantees on the performance of CKSVD at every step, in relation to the steps of Kmeans. Moreover, these guarantees will hold for a very general class of projection matrices including very sparse random projections.
We consider a statistical framework to establish the connection between CKSVD and Kmeans. Kmeans clustering can be viewed as a special case of dictionary learning in which each data sample is allowed to use one dictionary atom (cluster center), i.e. , and the corresponding coefficient is set to be . Therefore, we consider the following generative model
(6) 
where is the center of the cluster, and represent residuals in signal approximation and they are drawn i.i.d. from , so that the approximation error is . The set is an arbitrary partition of , with the condition that as . The random matrices are assumed to satisfy the following:
Assumption 1.
Each entry of the random matrices is drawn i.i.d. from a general class of zeromean distributions with finite first four moments .
We will see that the distribution’s kurtosis is a key factor in our results. The kurtosis, defined as
, is a measure of peakedness and heaviness of tail for a distribution.We now show how, in this special case of Kmeans, CKSVD would update the cluster centers. As mentioned before, in this case, we should set and the correponding coefficients are set to be . This means that for all , we have , and for , and it leads to . Then, the update formula for the dictionary atom of CKSVD given in (5) reduces to
(7) 
Hence, similar to Kmeans, the process of updating dictionary atoms becomes independent of each other. We can rewrite (7) as , where
(8) 
In [13], it is shown that . Thus, we see
(9)  
Therefore, when the number of samples is sufficiently large, using the law of large numbers,
and converge to and . Hence, the updated dictionary atom in our CKSVD is the original center of cluster, i.e. , exactly as in Kmeans. Note that in this case even one measurement per signal is sufficient.The following theorem characterizes convergence rates for and based on various parameters such as the number of samples and the choice of random matrices.
Theorem 2.
(10) 
where
(11) 
Also, consider the probabilistic model given in (6) and compressive measurements . Then, defined in (8) converges to the center of original data and for any , we have
(12) 
where
(13) 
and the signaltonoise ratio is defined as .
We see that for a fixed error bound , as
increases, the error probability
decreases at rate . Therefore, for any fixed , the error probability goes to zero as . Note that the shape of distribution, specified by the kurtosis, is an important factor. For random matrices with heavytailed entries, the error probability increases. However, gives us an explicit tradeoff between , the measurement ratio, and anisotropy in the distribution. For example, the increase in kurtosis can be compensated by increasing . The convergence rate analysis for follows the same path. We further note that is a decreasing function of the signaltonoise ratio and as SNR increases, gets closer to , where for the case that , then .Let’s consider an example to gain intuition on the choice of random matrices. We are interested in comparing the dense random Gaussian matrices with very sparse random matrices, where each entry is drawn from with probabilities for (we refer to this distribution as a sparseBernoulli distribution with parameter ). , and , are generated with i.i.d. entries both for Gaussian and the sparseBernoulli distribution. In Fig. 1, we see that as increases, gets closer to the identity matrix . Also, for a fixed , as the sparsity of random matrices increases, the kurtosis increases. Therefore, based on Theorem 2, we expect that the distance between and increases. Note that for Gaussian and the sparseBernoulli with , we have .
As a final note, our theoretical analysis gives us valuable insight about the number of distinct random matrices required. Based on Theorem 2, there is an inherent tradeoff between the accuracy and the number of distinct random matrices used. For example, if we only use one random matrix, we are not able to recover the true dictionary as observed in [24]. Also, increasing the number of distinct random matrices improves the accuracy, as mentioned in [21]. Hence, we can reduce the number of distinct random matrices in largescale problems where with controlled loss in accuracy.
Iv Memory and Computation Efficient Dictionary Learning
Now, we return our attention to general dictionary learning. Inspired by the generality of the projection matrices in Theorem 2, we sketch using very sparse random matrices, and furthermore reduce the number of distinct random matrices to increase the efficiency of our approach.
Assume that the original data samples are divided into blocks , where represents the block. Let , , represent the random matrix used for the block. Then, we have
(14) 
where is the sketch of . Each entry of is distributed on with probabilities . Here, the parameter controls the sparsity of random matrices such that each column of has nonzero entries, on average. We are specifically interested in choosing and such that the compression factor . Thus, the cost to acquire each compressive measurement is , , vs. the cost for collecting every data entry .
Similarly, we aim to minimize the representation error as
(15) 
where represents the sample in the block of the coefficient matrix . As before, the penalty term in (15) is minimized in a simple iterative approach involving two steps. The first step, sparse coding, is the same as the CKSVD algorithm previously described, except we can take efficient of the block structure and use BatchOMP [25] in each block which is significantly faster than OMP for each separately.
Iva Dictionary Update
The goal is to update the dictionary atom for , while assuming that , , is fixed. The penalty term in (15) can be written as
(16) 
where is the element of , and is the representation error for the compressive measurement when the dictionary atom is removed. The objective function in (16) is a quadratic function of and the minimizer is obtained by setting the derivative of the objective function with respect to equal to zero. First, let us define as a set of indices of compressive measurements in the block using . Therefore, we get the following expression
(17) 
where is defined as the sum of squares of all the coefficients related to the dictionary atom in the block, i.e. .
Note that can be computed efficiently: concatenate in a matrix , and define the diagonal matrix as
(18) 
where represents a square diagonal matrix with the elements of vector on the main diagonal. Then, we have .
Given the updated , the optimal , for all , is given by least squares as
V Experimental Results
We examine the performance of our dictionary learning algorithm on a synthetic dataset. Our proposed method is compared with the fast and efficient implementation of KSVD known as Approximate KSVD (AKSVD) [25] that requires access to the entire data. We generate dictionary atoms in ,
, drawn from the uniform distribution and normalized to have unit norm. A set of data samples
is generated where each sample is a linear combination of three distinct atoms, i.e. , and the corresponding coefficients are chosen i.i.d. from the Gaussian distribution . Then, each data is corrupted by Gaussian noise drawn from .CKSVD is applied on the set of compressive measurements obtained by very sparse random matrices for various values of the compression factor . We set the number of blocks and . Performance is evaluated by the magnitude of the inner product between learned and true atoms. A value greater than is counted as a successful recovery. Fig. 2 shows the results of CKSVD averaged over independent trials. In practice, when is small, the updates for are nearly decoupled, and we may delay updating until after all updates of . For , the accuracy results are indistinguishable.
In Fig. 2, we see that our method is able to eventually reach high accuracy even for , achieving substantial savings in memory/data access. However, there is a tradeoff between memory and computation savings vs. accuracy. Our method is efficient in memory/computation and, at the same time, accurate for and , where it outperforms AKSVD if the time of each iteration is factored in. We compare with AKSVD to give an idea of our efficiency, but note that AKSVD and our CKSVD are not completely comparable. In our example, both methods reach accuracy eventually but in general they may give different levels of accuracy. The main advantage of CKSVD appears as the dimensions grow, since then memory/data access is a dominant issue.
Acknowledgment
This material is based upon work supported by the National Science Foundation under Grant CCF1117775. This work utilized the Janus supercomputer, which is supported by the National Science Foundation (award number CNS0821794) and the University of Colorado Boulder. The Janus supercomputer is a joint effort of the University of Colorado Boulder, the University of Colorado Denver and the National Center for Atmospheric Research.
References
 [1] R. Baraniuk, V. Cevher, and M. Wakin, “Lowdimensional models for dimensionality reduction and signal recovery: A geometric perspective,” Proceedings of the IEEE, vol. 98, no. 6, pp. 959–971, 2010.
 [2] M. Elad, M. Figueiredo, and Y. Ma, “On the role of sparse and redundant representations in image processing,” Proceedings of the IEEE, vol. 98, no. 6, pp. 972–982, 2010.
 [3] J. Mairal, F. Bach, and J. Ponce, “Taskdriven dictionary learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 791–804, 2012.
 [4] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. Bach, “Supervised dictionary learning,” Advances in neural information processing systems, pp. 1033–1040, 2009.
 [5] E. Candès and M. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21–30, 2008.
 [6] K. Engan, S. Aase, and J. Husoy, “Method of optimal directions for frame design,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1999, pp. 2443–2446.
 [7] M. Aharon, M. Elad, and A. Bruckstein, “KSVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
 [8] H. Raja and W. Bajwa, “Cloud KSVD: Computing dataadaptive representations in the cloud,” in 51st Annual Allerton Conf. on Communication, Control, and Computing (Allerton), 2013, pp. 1474–1481.
 [9] M. Davenport, P. Boufounos, M. Wakin, and R. Baraniuk, “Signal processing with compressive measurements,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 2, pp. 445–460, 2010.
 [10] J. Fowler, “Compressiveprojection principal component analysis,” IEEE Transactions on Image Processing, vol. 18, no. 10, pp. 2230–2242, 2009.
 [11] H. Qi and S. Hughes, “Invariance of principal components under lowdimensional random projection of the data,” in IEEE International Conference on Image Processing (ICIP), 2012, pp. 937–940.

[12]
F. PourkamaliAnaraki and S. Hughes, “Efficient recovery of principal components from compressive measurements with application to Gaussian mixture model estimation,” in
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 2314–2318. 
[13]
——, “Memory and computation efficient PCA via very sparse random
projections,” in
Proceedings of the 31st International Conference on Machine Learning (ICML)
, 2014, pp. 1341–1349.  [14] A. Gilbert, J. Park, and M. Wakin, “Sketched SVD: Recovering spectral features from compressive measurements,” arXiv preprint arXiv:1211.0361, 2012.
 [15] G. Atia, “Change detection with compressive measurements,” IEEE Signal Processing Letters, vol. 22, no. 2, pp. 182–186, 2015.
 [16] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning. Springer, 2009, vol. 2, no. 1.
 [17] K. Skretting and K. Engan, “Recursive least squares dictionary learning algorithm,” IEEE Transactions on Signal Processing, vol. 58, no. 4, pp. 2121–2130, 2010.
 [18] F. PourkamaliAnaraki and S. Hughes, “Compressive KSVD,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 5469–5473.
 [19] T. Do, L. Gan, N. Nguyen, and T. Tran, “Fast and efficient compressive sensing using structurally random matrices,” IEEE Transactions on Signal Processing, vol. 60, no. 1, pp. 139–154, 2012.
 [20] K. Zhang, L. Zhang, and M. Yang, “Fast compressive tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 10, pp. 2002–2015, 2014.
 [21] J. Silva, M. Chen, Y. Eldar, G. Sapiro, and L. Carin, “Blind compressed sensing over a structured union of subspaces,” arXiv preprint arXiv:1103.2469, Tech. Rep., 2011.
 [22] C. Studer and R. Baraniuk, “Dictionary learning from sparsely corrupted or compressed signals,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 3341–3344.
 [23] J. Tropp and S. Wright, “Computational methods for sparse solution of linear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp. 948–958, 2010.
 [24] S. Gleichman and Y. Eldar, “Blind compressed sensing,” IEEE Transactions on Information Theory, vol. 57, no. 10, pp. 6958–6975, 2011.
 [25] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the KSVD algorithm using batch orthogonal matching pursuit,” CS Technion, 2008.
Comments
There are no comments yet.