L1AE
None
view repo
Linear encoding of sparse vectors is widely popular, but is most commonly data-independent -- missing any possible extra (but a-priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used ℓ_1 decoder. The convex ℓ_1 decoder prevents gradient propagation as needed in standard autoencoder training. Our method is based on the insight that unfolding the convex decoder into T projected gradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing matrix. Our experiments show that there is indeed additional structure beyond sparsity in several real datasets. Our autoencoder is able to discover it and exploit it to create excellent reconstructions with fewer measurements compared to the previous state of the art methods.
READ FULL TEXT VIEW PDF
The goal of compressed sensing is to estimate a vector from an
underdete...
read it
A major challenge to implement compressed sensing method for channel sta...
read it
Compressed Sensing using 𝓁1 regularization is among the most powerful an...
read it
Compressed sensing is a powerful tool in applications such as magnetic
r...
read it
We propose a novel sparsity model for distributed compressed sensing in ...
read it
Statistical dependencies among wavelet coefficients are commonly represe...
read it
Pruning the weights of neural networks is an effective and widely-used
t...
read it
None
Assume we have some unknown data vector . We can observe only a few () linear equations of its entries and we would like to design these projections by creating a measurement matrix such that the projections allow exact (or near-exact) recovery of the original vector .
If , this is an ill-posed problem in general: we are observing linear equations with unknowns, so any vector on the subspace satisfies our observed measurements. In this high-dimensional regime, the only hope is to make a structural assumption on , so that unique reconstruction is possible. A natural approach is to assume that the data vector is sparse. The problem of designing measurement matrices and reconstruction algorithms that recover sparse vectors from linear observations is called Compressed Sensing (CS), Sparse Approximation or Sparse Recovery Theory (Donoho, 2006; Candès et al., 2006).
A natural way to recover is to search for the sparsest solution that satisfies the linear measurements:
(1) |
Unfortunately this problem is NP-hard and for that reason the norm is relaxed into an -norm minimization^{1}^{1}1Other examples are greedy (Tropp & Gilbert, 2007), and iterative algorithms, e.g., CoSaMP (Needell & Tropp, 2009), IHT (Blumensath & Davies, 2009), and AMP (Donoho et al., 2009).
(2) |
Remarkably, if the measurement matrix satisfies some properties (e.g. Restricted Isometry Property (RIP) (Candès, 2008) or the nullspace condition (NSP) (Rauhut, 2010)) it is possible to prove that the minimization in (2) produces the same output as the intractable minimization in (1). However, it is NP-hard to test whether a matrix satisfies RIP (Bandeira et al., 2013).
In this paper we are interested in vectors that are not only sparse but have additional structure in their support. This extra structure may not be known or a-priori specified. We propose a data-driven algorithm that learns a good linear measurement matrix from data samples. Our linear measurements are subsequently decoded with the -minimization in (2
) to estimate the unknown vector
.Many real-world sparse datasets have additional structures beyond simple sparsity. For example, in a demographic dataset with (one-hot encoded) categorical features, a person’s income level may be related to their education. Similarly, in a text dataset with bag-of-words representation, it is much more likely for two related words (e.g., football and game) to appear in the same document than two unrelated words (e.g., football and biology). A third example is that in a genome dataset, certain groups of gene features may be correlated. In this paper, the goal is to
learn a measurement matrix to leverage such additional structure.Our method is an autoencoder for sparse data, with a linear encoder (the measurement matrix) and a complex non-linear decoder that solves an optimization problem. The latent code is the measurement
which forms the bottleneck of the autoencoder that makes the representation interesting. A popular data-driven dimensionality reduction method is Principal Components Analysis (PCA) (see e.g.,
(Hotelling, 1933; Boutsidis et al., 2015; Wu et al., 2016; Li et al., 2017) and the references therein). PCA is also an autoencoder where both the encoder and decoder are linear and learned from samples. Given a data matrix (each row is a sample), PCA projects each data point onto the subspace spanned by the top right-singular vectors of . As is well-known, PCA provides the lowest mean-squared error when used with a linear decoder. However, when the data is sparse, non-linear recovery algorithms like (2) can yield significantly better recovery performance.In this paper, we focus on learning a linear encoder for sparse data. Compared to non-linear embedding methods such as kernel PCA (Mika et al., 1999), a linear method enjoys two broad advantages: 1) it is easy to compute, as it only needs a matrix-vector multiplication; 2) it is easy to interpret, as every column of the encoding matrix can be viewed as a feature embedding. Interestingly, Arora et al. (2018) recently observe that the pre-trained word embeddings such as GloVe and word2vec (Mikolov et al., 2013; Pennington et al., 2014) form a good measurement matrix for text data. Those measurement matrices, when used with -minimization, need fewer measurements than the random matrices to achieve near-perfect recovery.
Given a sparse dataset that has additional (but unknown) structure, our goal is to learn a good measurement matrix , when the recovery algorithm is the -minimization in (2). More formally, given sparse samples , our problem of finding the best can be formulated as
Here is the decoder defined in (2). Unfortunately, there is no easy way to compute the gradient of with respect to , due to the optimization problem defining . Our main insight, which will be elaborated on in Section 3.1, is that replacing the -minimization with a -step projected subgradient update of it, results in gradients being (approximately) computable. In particular, consider the approximate function defined as
(3) | ||||
As we will show, now it is possible to compute the gradients of with respect to . This is idea is sometimes called unrolling and has appeared in various other applications as we discuss in the related work section. To the best of our knowledge, we are the first to use unrolling to learn a measurement matrix for compressed sensing.
Our contributions can be summarized as follows:
We design a novel autoencoder, called -AE , to learn an efficient and compressed representation for structured sparse vectors. Our autoencoder is easy to train and has only two tuning hyper-parameters associated with the network architecture: the encoding dimension and the network depth . The architecture of -AE is inspired by the projected subgradient method of solving the -minimization in (2). While the exact projected subgradient method requires computing the pseudoinverse, we circumvent this by observing that it is possible to replace the expensive pseudoinverse operation by a simple transpose (see Lemma 1).
The most surprising result in this paper is that we can learn a linear encoder using an unfolded -step projected subgradient decoder and the learned measurement matrix still performs well for the original -minimization decoder. We empirically compare 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). As shown in Figure 2, using the measurement matrix learned from our autoencoder, we can compress the sparse vectors (in the test set) to a lower dimension (by a factor of 1.1-3x) than random Gaussian matrices while still being able to perfectly recover the original sparse vectors (see also Table 2). This demonstrates the superior ability of our autoencoder in learning and adapting to the additional structures in the given data.
Although our autoencoder is specifically designed for -minimization decoder, the learned measurement matrix also performs well (and can perform even better) with the model-based decoder (Baraniuk et al., 2010) (Figure 3). This further demonstrates the benefit of learning a measurement matrix from data. As a baseline algorithm, we slightly modify the original model-based CoSaMP algorithm by adding a positivity constraint without changing the theoretical guarantee (Appendix C), which could be of independent interest.
Besides the application in compressed sensing, one interesting direction for furture research is to use the proposed autoencoder -AE in other supervised tasks. We illustrate a potential application of -AE in extreme multi-label learning. We show that -AE can be used to learn label embeddings for multi-label datasets. We compare the resulted method with one of the state-of-the-art embedding-based methods SLEEC (Bhatia et al., 2015) over two benchmark datasets. Our method is able to achieve better or comparable precision scores than SLEEC (see Table 3).
We briefly review the relevant work, and highlight the differences compared to our paper.
Model-based compressed sensing (CS). Model-based CS (Baraniuk et al., 2010; Hegde et al., 2014) extends the conventional compressed sensing theory by considering more realistic structured models than simple sparsity. It requires to know the sparsity structure a priori, which is not always possible in practice. Our approach, by contrast, does not require a priori knowledge about the sparsity structure.
Learning-based measurement design. Most theoretical results in CS are based on random measurement matrices. There are a few approaches proposed to learn a measurement matrix from training data. One approach is to learn a near-isometric embedding that preserves pairwise distance (Hegde et al., 2015; Bah et al., 2013). This approach usually requires computing the pairwise difference vectors, and hence is computationally expensive if both the number of training samples and the dimension are large (which is the setting that we are interested in). Another approach restricts the form of the measurement matrix, e.g., matrices formed by a subset of rows of a given basis matrix. The learning problem then becomes selecting the best subset of rows (Baldassarre et al., 2016; Li & Cevher, 2016; Gözcü et al., 2018). In Figure 2, we compare our method with the learning-based subsampling method proposed in (Baldassarre et al., 2016), and show that our method needs fewer measurements to recover the original sparse vector.
Adaptive CS. In adaptive CS (Seeger & Nickisch, 2011; Arias-Castro et al., 2013; Malloy & Nowak, 2014), the new measurement is designed based on the previous measurements in order to maximize the the gain of new information. This is in contrast to our setting, where we are given a set of training samples. Our goal is to learn a good measurement matrix to leverage additional structure in the given samples.
Dictionary learning. Dictionary learning (Aharon et al., 2006; Mairal et al., 2009) is the problem of learning an overcomplete set of basis vectors from data so that every datapoint (presumably dense) can be represented as a sparse linear combination of the basis vectors. By contrast, this paper focuses on learning a good measurement matrix for structured sparse data.
Sparse coding. The goal of sparse coding (Olshausen & Field, 1996; Donoho & Elad, 2003)
is to find the sparse representation of a dense vector, given a fixed family of basis vectors (aka a dictionary). Training a deep neural network to compute the sparse codes becomes popular recently
(Gregor & LeCun, 2010; Sprechmann et al., 2015; Wang et al., 2016). Several recent papers (see, e.g., (Kulkarni et al., 2016; Xin et al., 2016; Shi et al., 2017; Jin et al., 2017; Mardani et al., 2017; Mousavi et al., 2017, 2019, 2015; Mousavi & Baraniuk, 2017; He et al., 2017; Zhang & Ghanem, 2017; Lohit et al., 2018)and the references therein) proposes new convolutional neural network architectures for image reconstruction from low-dimensional measurements. Note that some architectures such as
(Lohit et al., 2018; Shi et al., 2017; Mousavi et al., 2015, 2017, 2019) also have an image sensing component, and hence the overall architecture becomes an autoencoder. Sparse coding is different from this paper, because we focus on learning a good measurement/encoding matrix rather than learning a good recovery/decoding algorithm.Autoencoders.
An autoencoder is a popular neural network architecture for unsupervised learning. It has applications in dimensionality reduction
(Hinton & Salakhutdinov, 2006), pre-training (Erhan et al., 2010), image compression and recovery (Lohit et al., 2018; Mousavi et al., 2015, 2017, 2019), denoising (Vincent et al., 2010), and generative modeling (Kingma & Welling, 2014). In this paper we design a novel autoencoder -AE to learn a compressed sensing measurement matrix for the decoder. We focus on high-dimensional sparse (non-image) data such as the categorical data and bag-of-words data (see Table 1).Unrolling. The idea of unfolding an iterative algorithm (e.g., gradient descent of an optimization problem) into a neural network structure is a natural way of making the algorithm trainable (see, e.g., (Gregor & LeCun, 2010; Hershey et al., 2014; Sprechmann et al., 2015; Xin et al., 2016; Wang et al., 2016; Shi et al., 2017; Jin et al., 2017; Mardani et al., 2017; He et al., 2017; Zhang & Ghanem, 2017) and references therein). The main difference between the previous papers and this paper is that, the previous papers seek a trained neural network as a replacement of the original optimization-based algorithm, while in this paper we design an autoencoder based on the unrolling idea, and after training, we show that the learned measurement matrix still performs well with the original -minimization decoder.
Extreme multi-label learning (XML).
The goal of XML is to learn a classifier to identify (for each datapoint) a subset of relevant labels from a extreme large label set. Different approaches have been proposed for XML, e.g., embedding-based
(Bhatia et al., 2015; Yu et al., 2014; Mineiro & Karampatziakis, 2015; Tagami, 2017), tree-based (Prabhu & Varma, 2014; Jain et al., 2016; Jasinska et al., 2016; Prabhu et al., 2018a), 1-vs-all (Prabhu et al., 2018b; Babbar & Schölkopf, 2017; Yen et al., 2016, 2017; Niculescu-Mizil & Abbasnejad, 2017; Hariharan et al., 2012), and deep learning
(Jernite et al., 2017; Liu et al., 2017). Here we focus on the embedding-based approach. In Section 5 we show that the proposed autoencoder can be used to learn label embeddings for multi-label datasets and the resulted method achieves better or comparable precision scores as SLEEC (Bhatia et al., 2015) over two benchmark datasets.Our goal is to learn a measurement matrix from the given sparse vectors. This is done via training a novel autoencoder, called -AE. In this section, we will describe the key idea behind the design of -AE. In this paper we focus on the vectors that are sparse in the standard basis and also nonnegative^{2}^{2}2Extending our method to more general settings is left for future research.. This is a natural setting for many real-world datasets, e.g., categorical data and bag-of-words data.
Our design is strongly motivated by the projected subgradient method used to solve an -minimization problem. Consider the following -minimization problem:
(4) |
where and . We assume that and that has rank . One approach^{3}^{3}3
Another approach is via linear programming.
to solving (4) is the projected subgradient method. The update is given by(5) |
where denotes the (Euclidean) projection onto the convex set , is the sign function, i.e., the subgradient of the objective function at , and is the step size at the -th iteration. Since has full row rank, has a closed-form solution given by
(6) | ||||
(7) | ||||
(8) |
where is the Moore-Penrose inverse of matrix . Substituting (8) into (5), and using the fact that , we get the following update equation
(9) |
where
is the identity matrix. We use
(which satisfies the constraint ) as the starting point.As mentioned in the Introduction, our main idea is to replace the solution of an decoder given in (4) by a -step projected subgradient update given in (9). One technical difficulty^{4}^{4}4One approach is to replace by a trainable matrix . This approach performs worse than ours (see Figure 3). in simulating (9
) using neural networks is backpropagating through the pesudoinverse
. Fortunately, Lemma 1 shows that it is possible to replace by .For any vector , and any matrix () with rank , there exists an
with all singular values being ones, such that the following two
-minimization problems have the same solution:(10) | |||
(11) |
Furthermore, the projected subgradient update of is
(12) |
A natural choice for is , where
can be any unitary matrix.
Lemma 1 essentially says that: 1) Instead of searching over all matrices (of size -by- with rank ), it is enough to search over a subset of matrices , whose singular values are all ones. This is because and has the same recovery performance for -minimization (this is true as long as and have the same null space). 2) The key benefit of searching over matrices with singular values being all ones is that the corresponding projected subgradient update has a simpler form: the annoying pesudoinverse term in (9) is replaced by a simple matrix transpose in (12).
As we will show in the next section, our decoder is designed to simulate (12) instead of (9): the only difference is that the pesudoinverse term is replaced by matrix transpose . Ideally we should train our -AE by enforcing the constraint that the matrices have singular values being ones. In practice, we do not enforce that constraint during training. We empirically observe that the learned measurement matrix is not far from the constraint set (see Appendix D.4).
As shown in Figure 1, -AE has a simple linear encoder and a non-linear decoder. When a data point comes, it is encoded as , where is the encoding matrix that will be learned from data. A decoder is then used to recover the original vector from its embedding .
The decoder network of -AE consists of blocks connected in a feedforward manner: the output vector of the -th block is the input vector to the -th block. The network structure inside each block is identical. Let . For , if is the input to the -th block, then its output vector is
(13) |
where are scalar variables to be learned from data. We empirically observe that regularizing to have the following form^{5}^{5}5This form is sometimes called square summable but not summable (Boyd, 2014). for improves test accuracy. Here,
is the only scalar variable to be learned from data. We also add a standard batch normalization (BN) layer
(Ioffe & Szegedy, 2015) inside each block, because empirically it improves the test accuracy (see Figure 3). Afterblocks, we use rectified linear units (ReLU) in the last layer
^{6}^{6}6This makes sense as we focus on nonnegative sparse vectors in this paper. Nonnegativity is a natural setting for many real-world sparse datasets, e.g., categorical data and text data. to obtain the final output : .It is worth noting that the low-rank structure of the weight matrix in (13) is essential for reducing the computational complexity. A fully-connected layer requires a weight matrix of size , which is intractable for large .
Given unlabeled training examples , we will train an -AE to minimize the average squared distance between and :
(14) |
We implement
-AE in Tensorflow 1.4. Our code can be found at
https://github.com/wushanshan/L1AE. The goal of this section is to demonstrate that -AE is able to learn a good measurement matrix for the structured sparse datasets, when we use -minimization for decoding^{7}^{7}7We use Gurobi (a commercial optimization solver) to solve it..Dataset | Dimension | Avg. no. of nonzeros | Train / Valid / Test Size | Description |
---|---|---|---|---|
Synthetic1 | 1000 | 10 | 6000 / 2000 / 2000 | 1-block sparse with block size 10 |
Synthetic2 | 1000 | 10 | 6000 / 2000 / 2000 | 2-block sparse with block size 5 |
Synthetic3 | 1000 | 10 | 6000 / 2000 / 2000 | Power-law structured sparsity |
Amazon | 15626 | 9 | 19661 / 6554 / 6554 | 1-hot encoded categorical data |
Wiki10-31K | 30398 | 19 | 14146 / 3308 / 3308 | Extreme multi-label data |
RCV1 | 47236 | 76 | 13889 / 4630 / 4630 | Text data with TF-IDF features |
Synthetic datasets. As shown in Table 1, we generate three synthetic datasets: two satisfy the block sparsity model^{8}^{8}8A signal is called -block sparse with block size if it satisfies: 1) can be reshaped into a matrix of size , where ; 2) every column of acts as a group, i.e., the entire column is either zero or nonzero; 3) has nonzero columns, and hence has sparsity . (Baraniuk et al., 2010), and one follows the power-law structured sparsity (The
-th coordinate is nonzero with probability proportional to
). Each sample is generated as follows: 1) choose a random support set satisfying the sparsity model; 2) set the nonzeros to be uniform in .Real datasets. Our first dataset is from Kaggle “Amazon Employee Access Challenge”^{9}^{9}9https://www.kaggle.com/c/amazon-employee-access-challenge. Each training example contains 9 categorical features. We use one-hot encoding to transform each example. Our second dataset Wiki10-31K is a multi-label dataset downloaded from this repository (XML, 2017). We only use the label vectors to train our autoencoder. Our third dataset is RCV1(Lewis et al., 2004), a popular text dataset. We use scikit-learn to fetch the training set and randomly split it into train/validation/test sets.
Training.
We use stochastic gradient descent to train the autoencoder. Before training, every sample is normalized to have unit
norm. The parameters are initialized as follows: is a random Gaussian matrix with standard deviation ; is initialized as 1.0. Other hyper-parameters are given in Appendix B. A single NVIDIA Quadro P5000 GPU is used in the experiments. We set the decoder depth for most datasets^{10}^{10}10Although the subgradient method (Boyd, 2014) has a convergence rate, in practice, we found that a small value of (e.g., ) seemed to be good enough (see Table 2).. Training an -AE can be done in several minutes for small-scale datasets and in around an hour for large-scale datasets.We compare 10 algorithms in terms of their recovery performance. The results are shown in Figure 2. All the algorithms follow a two-step “encoding decoding” process.
-AE + -min pos (our algorithm): After training an -AE , we use the encoder matrix as the measurement matrix. To decode, we use Gurobi (a commercial optimization solver) to solve the following -minimization problem with positivity constraint (denoted as “-min pos”):
(15) |
Since we focus on nonnegative sparse vectors in this paper, adding a positivity constraint improves the recovery performance^{11}^{11}11The sufficient and necessary condition (Theorem 3.1 of (Khajehnejad et al., 2011)) for exact recovery using (15) is weaker than the the nullspace property (NSP) (Rauhut, 2010) for (4). (see Appendix D.3).
Gauss + -min pos / model-based CoSaMP pos: A random Gaussian matrix with i.i.d. entries is used as the measurement matrix^{12}^{12}12Additional experiments with random partial Fourier matrices (Haviv & Regev, 2017) can be found in Appendix D.2.. We experiment with two decoding methods: 1) Solve the optimization problem given in (15); 2) Use the model-based CoSaMP algorithm^{13}^{13}13Model-based method needs the explicit sparsity model, and hence is not applicable for RCV1, Wiki10-31K, and Synthetic3. (Algorithm 1 in (Baraniuk et al., 2010)) with an additional positivity constraint (see Appendix C).
PCA or PCA + -min pos
: We perform truncated singular value decomposition (SVD) on the training set. Let
be the top- singular vectors. For PCA, every vector in the test set is estimated as . We can also use “-min pos” (15) as the decoder.Simple AE or Simple AE + -min pos: We train a simple autoencoder: for an input vector , the output is , where both and
are learned from data. We use the same loss function as our autoencoder. After training, we use the learned
matrix as the measurement matrix. Decoding is performed either by the learned decoder or solving “-min pos” (15).LBCS + -min pos / model-based CoSaMP pos: We implement the learning-based compressive subsampling (LBCS) method in (Baldassarre et al., 2016). The idea is to select a subset (of size ) of coordinates (in the transformed space) that preserves the most energy. We use Gaussian matrix as the basis matrix and “-min pos” as the decoder^{14}^{14}14We tried four variations of LBCS: two different basis matrices (random Gaussian matrix and DCT matrix), two different decoders (-minimization and linear decoder). The combination of Gaussian and -minimization performs the best (see Appendix D.5).. Decoding with “model-based CoSaMP pos” is in Figure 3.
4-layer AE: We train a standard 4-layer autoencoder (we do not count the input layer), whose encoder network (and decoder network) consists of two feedforward layers with ReLU activation. The dimension of the 1st (and 3rd) layer is tuned based on the performance on the validation set.
The experimental results are shown in Figure 2. Two performance metrics are compared. The first one is the fraction of exactly recovered test samples. We say that a vector is exactly recovered by an algorithm if it produces an that satisfies . The second metric is the root mean-squared error (RMSE) over the test set^{15}^{15}15Note that in Figure 2, test RMSE has similar scale across all datasets, because the vectors are normalized to have unit norm..
As shown in Figure 2, our algorithm “-AE + -min pos” outperforms the rest baselines over all datasets. By learning a data-dependent linear encoding matrix, our method requires fewer measurements to achieve perfect recovery.
Learned decoder versus -min decoder. We now compare two methods: -AE and -AE + -min pos. They have a common encoder but different decoders. As shown in Figure 2, “-AE + -min pos” almost always gives a smaller RMSE. In fact, as shown in Table 2, “-min pos” is able to achieve reconstruction errors in the order of 1e-15, which is impossible for a neural network. The strength of optimization-based decoder over neural network-based decoder has been observed before, e.g., see Figure 1 in (Bora et al., 2017)^{16}^{16}16As indicated by Figure 1 in (Bora et al., 2017), LASSO gives better reconstruction than GAN-based method when given enough Gaussian measurements.. Nevertheless, neural network-based decoder usually has an advantage that it can handle nonlinear encoders, for which the corresponding optimization problem may become non-convex and hard to solve exactly.
Dataset | Synthetic1 | Amazon | ||||
---|---|---|---|---|---|---|
# measurements | 10 | 30 | 50 | 40 | 80 | 120 |
-AE | 0.465 | 0.129 | 0.118 | 0.638 | 0.599 | 0.565 |
-AE + -min pos (ours) | 0.357 | 9.9e-3 | 1.4e-15 | 0.387 | 0.023 | 2.8e-15 |
# measurements | 20 | 30 |
---|---|---|
0.097 | 9.9e-3 | |
0.063 | 1.2e-15 | |
9.6e-16 | 1.1e-15 |
Model-based decoder versus learned encoder. It is interesting to see that our algorithm even outperforms model-based method (Baraniuk et al., 2010), even though the model-based decoder has more information about the given data than -minimization decoder. For the Amazon dataset, compared to “Gauss + model-based CoSaMP pos”, our method reduces the number of measurements needed for exact recovery by about 2x. This is possible because the model-based decoder only knows that the input vector comes from one-hot encoding, which is a coarse model for the underlying sparsity model. By contrast, -AE learns a measurement matrix directly from the given training data.
A natural question to ask is whether the measurement matrix learned by -AE can improve the recovery performance of the model-based decoding algorithm. As shown in Figure 3, the recovery performance of “-AE + model-based CoSaMP pos” is better than “Gauss + model-based CoSaMP pos”. This further demonstrates the benefit of learning a data-adaptive measurement matrix.
Variations of -AE. We now examine how slightly varying the decoder structure would affect the performance. We make the following changes to the decoder structure: 1) remove the Batch Normalization layer; 2) remove the ReLU operation in the last layer; 3) change the nonlinearity from sign to ; 4) replace the term in the decoder network by a matrix that is learned from data; 5) use one-layer neural network as the decoder, i.e., set in -AE . Each change is applied in isolation. As shown in Figure 3, our design gives the best recovery performance among all the variations for the Amazon dataset.
Decoder depth of -AE. The decoder depth is a tuning parameter of -AE . Empirically, the performance of the learned matrix improves as increases (see Table 2). On the other hand, the training time increases as increases. The parameter is tuned as follows: we start with , and stop if the validation performance improvement is smaller than a threshold or if equals . The hyper-parameters used for training -AE are given in Appendix B. We set for Synthetic2 and Synthetic3, for Synthetic1, RCV1, and Wiki10-31K, and for Amazon dataset. The autoencoder is trained on a single GPU. Training an -AE takes a few minutes for small datasets and around an hour for large datasets.
We have proposed a novel autoencoder -AE to learn a compressed sensing measurement matrix for high-dimensional sparse data. Besides the application of -AE in compressed sensing, one interesting direction for future research is to use
-AE in other machine learning tasks. Here we illustrate a potential application of
-AE in extreme multi-label learning (XML). For each data point, the goal of XML is to predict a subset of relevant labels from a extremely large label set. As a result, the output label vector is high-dimensional, sparse, and nonnegative (with 1’s denoting the relevant labels and 0’s otherwise).Many approaches have been proposed for XML(see (XML, 2017) for the benchmark algorithms and datasets). Here we focus on the embedding-based approach^{17}^{17}17Other approaches include tree-based, 1-vs-all, etc., and one of the state-of-the-art embedding-based approaches^{18}^{18}18AnnexML (Tagami, 2017) is a graph-embedding approach for XML. Some of its techniques (such as better partition of the input data points) can be potentially used with our method. is SLEEC (Bhatia et al., 2015). Given training samples , , where , , we use and to denote the stacked feature matrix and label matrix. SLEEC works in two steps. In Step 1, SLEEC reduces the dimension of the labels by learning an low-dimension embedding for each training sample. Let (where ) denote the learned embedding matrix. Note that only the matrix is used for learning in this step. In Step 2, SLEEC learns a linear mapping such that . To predict the label vector for a new sample , SLEEC uses nearest neighbors method: first computes the embedding , identifies a few nearest neighbors (from the training set) in the embedding space, and uses the sum of their label vectors as prediction.
The method that we propose follows SLEEC’s two-step procedure. The main difference is that in Step 1, we train an autoencoder -AE to learn embeddings for the labels . Note that in XML, the label vectors are high-dimensional, sparse, and nonnegative. Let be the learned measurement matrix for , then the embedding matrix is . In Step 2, we use the same subroutine as SLEEC to learn a linear mapping from to . To predict the label vector for a new sample, we compared three methods in our experiments: 1) use the nearest neighbor method (same as SLEEC); 2) use the decoder of the trained -AE (which maps from the embedding space to label space); 3) use an average of the label vectors obtained from 1) and 2). The three methods are denoted as “-AE 1/2/3” in Table 3.
In Table 3, we compare the precision score P@1 over two benchmark datasets. The second row is the number of models combined in the ensemble. According to the XML benchmark website (XML, 2017), SLEEC achieves a precision score 0.7926 for EURLex-4K and 0.8588 for Wiki10-31K, which are consistent with what we obtained by running their code (after combining 5 models in the ensemble). The embedding dimensions are for EURLex-4K and for Wiki10-31K. We set for -AE. As shown in Table 3, our method has higher score than SLEEC for EURLex-4K. For Wiki10-31K, a single model of our method has higher score than SLEEC. When 3 or 5 models are ensembled, our method has comparable precision score to SLEEC. More results can be found in Appendix D.6.
Dataset | Feature | Label | Train/ | # labels |
---|---|---|---|---|
dimension | dimension | Test | /point | |
EURLex-4K | 5000 | 3993 | 15539/3809 | 5.31 |
Wiki10-31K | 101938 | 30938 | 14146/6616 | 18.64 |
Dataset | EURLex-4K | Wiki10-31K | ||||
---|---|---|---|---|---|---|
# models | 1 | 3 | 5 | 1 | 3 | 5 |
SLEEC | 0.7600 | 0.7900 | 0.7944 | 0.8356 | 0.8603 | 0.8600 |
-AE 1 | 0.7655 | 0.7928 | 0.7931 | 0.8529 | 0.8564 | 0.8597 |
-AE 2 | 0.7949 | 0.8033 | 0.8070 | 0.8560 | 0.8579 | 0.8581 |
-AE 3 | 0.8062 | 0.8151 | 0.8136 | 0.8617 | 0.8640 | 0.8630 |
Combining ideas from compressed sensing, convex optimization and deep learning, we proposed a novel unsupervised learning framework for high-dimensional sparse data. The proposed autoencoder -AE is able to learn an efficient measurement matrix by adapting to the sparsity structure of the given data. The learned measurement matrices can be subsequently used in other machine learning tasks such as extreme multi-label learning. We expect that the learned
-AE can lead to useful representations in various supervised learning pipelines, for datasets that are well represented by large sparse vectors. Investigating the relation between the training data and the learned matrix (see Appendix
D.1 for a toy example) is an interesting direction for future research.A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and lstms.
In International Conference on Learning Representations (ICLR), 2018.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 449–458, 2016.Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pp. 1532–1543, 2014.Large scale bayesian inference and experimental design for sparse linear models.
SIAM Journal on Imaging Sciences, 4(1):166–199, 2011.Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.For convenience, we re-state Lemma 1 here and then give the proof.
For any vector , and any matrix () with rank , there exists an with all singular values being ones, such that the following two -norm minimization problems have the same solution:
(16) | |||
(17) |
Furthermore, the projected subgradient update of is given as
A natural choice for is , where can be any unitary matrix.
To prove that and give the same solution, it suffices to show that their constraint sets are equal, i.e.,
(18) |
Since and , it then suffices to show that and have the same nullspace:
(19) |
If satisfies , then , which implies . Conversely, we suppose that . Since is unitary, is full-rank, , which implies that . Therefore, (19) holds.
Dataset | Depth | Batch size | Learning rate | |||
---|---|---|---|---|---|---|
Toy | 10 | 128 | 0.01 | 2e4 | 10 | 5 |
Synthetic1 | 10 | 128 | 0.01 | 2e4 | 10 | 5 |
Synthetic2 | 5 | 128 | 0.01 | 2e4 | 10 | 1 |
Synthetic3 | 5 | 128 | 0.01 | 2e4 | 10 | 1 |
Amazon | 60 | 256 | 0.01 | 2e4 | 1 | 1 |
Wiki10-31K | 10 | 256 | 0.001 | 5e3 | 10 | 1 |
RCV1 | 10 | 256 | 0.001 | 1e3 | 1 | 50 |
Table 4 lists the parameters used to train -AE in our experiments. We explain the parameters as follows.
Depth: The number of blocks in the decoder, indicated by in Figure 1.
Batch size: The number of training samples in a batch.
Learning rate: The learning rate for SGD.
: Maximum number of training epochs.
: Validation error is computed every epochs. This is used for early-stopping.
: Training is stopped if the validation error does not improve for epochs.
The CoSaMP algorithm (Needell & Tropp, 2009) is a simple iterative and greedy algorithm used to recover a -sparse vector from the linear measurements. The model-based CoSaMP algorithm (Algorithm 1 of (Baraniuk et al., 2010)) is a modification of the CoSaMP algorithm. It uses the prior knowledge about the support of the -sparse vector, which is assumed to follow a predefined structured sparsity model. In this section we slightly modify the model-based CoSaMP algorithm to ensure that the output vector follows the given sparsity model and is also nonnegative.
To present the pseudocode, we need a few definitions. We begin with a formal definition for the structured sparsity model and the sparse approximation algorithm . For a vector , let be entries of in the index set . Let be the complement of set .
A structured sparsity model is defined as the union of canonical -dimensional subspaces
(23) |
where is the set containing all allowed supports, with for each , and each subspace contains all signals with .
We define as the algorithm that obtains the best -term structured sparse approximation of in the union of subspaces :
(24) |
We next define an enlarged set of subspaces and the associated sparse approximation algorithm.
The -order sum for the set , with an integer, is defined as
(25) |
We define as the algorithm that obtains the best approximation of in the union of subspaces :
(26) |
Algorithm 1 presents the model-based CoSaMP with positivity constraint. Comparing Algorithm 1 with the original model-based CoSaMP algorithm (Algorithm 1 of (Baraniuk et al., 2010)), we note that the only different is that Algorithm 1 has an extra step (Step 6). In Step 6 we take a ReLU operation on to ensure that is always nonnegative after Step 7.
Inputs: measurement matrix , measurements , structured sparse approximation algorithm | ||
Output: -sparse approximation to the true signal , which is assumed to be nonnegative | ||
, ; {initialize} | ||
while | halting criterion false do | |
1. | ||
2. | {form signal residual estimate} | |
3. | {prune residual estimate according to structure} | |
4. | {merge supports} | |
5. , | {form signal estimate by least-squares} | |
6. | {set the negative entries to be zero} | |
7. | {prune signal estimate according to structure} | |
8. | {update measurement residual} | |
end while | ||
return |
We now show that Algorithm 1 has the same performance guarantee as the original model-based CoSaMP algorithm for structured sparse signals. Speficially, we will show that Theorem 4 of (Baraniuk et al., 2010) also applies to Algorithm 1. In (Baraniuk et al., 2010), the proof of Theorem 4 is based on six lemmas (Appendix D), among which the only lemma that is related to Step 6-7 is Lemma 6. It then suffices to prove that this lemma is also true for Algorithm 1 under the constraint that the true vector is nonnegative.
The pruned approximation is such that
(27) |
Since is the -best approximation of in , and , we have
(28) |
where the last inequality follows from that , and . ∎
The above lemma matches Lemma 6, which is used to prove Theorem 4 in (Baraniuk et al., 2010). Since the other lemmas (i.e., Lemma 1-5 in Appendix D of (Baraniuk et al., 2010)) still hold for Algorithm 1, we conclude that the performance guarantee for structured sparse signals (i.e., Theorem 4 of (Baraniuk et al., 2010)) is also true for Algorithm 1.
In Figure 4, we compare the recovery performance of two decoding algorithms: 1) model-based CoSaMP algorithm (Algorithm 1 of (Baraniuk et al., 2010)) and 2) model-based CoSaMP algorithm with positivity constraint (indicated by “Model-based CoSaMP pos” in Figure 4). We use random Gaussian matrices as the measurement matrices. Since our sparse datasets are all nonnegative, adding the positivity constraint to the decoding algorithm is able to improve the recovery performance.
We use a simple example to illustrate that the measurement matrix learned from our autoencoder is adapted to the training samples. The toy dataset is generated as follows: each vector has 5 nonzeros randomly located in the first 20 dimensions; the nonzeros are random values between [0,1]. We train -AE on a training set with 6000 samples. The parameters are , , and learning rate 0.01. A validation set with 2000 samples is used for early-stopping.After training, we plot the matrix in Figure 5. The entries with large values are concentrated in the first 20 dimensions. This agrees with the specific structure in the toy dataset.
Figure 6 is a counterpart of Figure 2. The only difference is that in Figure 6 we use random partial Fourier matrices in place of random Gaussian matrices. A random partial Fourier matrix is obtained by choosing rows uniformly and independently with replacement from the
discrete Fourier transform (DFT) matrix. We then scale each entry to have absolute value
(Haviv & Regev, 2017). Because the DFT matrix is complex, to obtain real measurements, we draw random rows from a DFT matrix to form the partial Fourier matrix.A random partial Fourier matrix is a Vandermonde matrix. According to (Donoho & Tanner, 2005), one can exactly recover a -sparse nonnegative vector from measurements using a Vandermonde matrix (Donoho & Tanner, 2005). However, the Vandermonde matrices are numerically unstable in practice (Pan, 2016), which is consistent with our empirical observation. Comparing Figure 6 with Figure 2
, we see that the recovery performance of a random partial Fourier matrix has larger variance than that of a random Gaussian matrix.
We compare the recovery performance between solving an -min (4) and an -min with positivity constraint (15). The results are shown in Figure 7. We experiment with two measurement matrices: 1) the one obtained from training our autoencoder, and 2) random Gaussian matrices. As shown in Figure 7, adding a positivity constraint to the -minimization improves the recovery performance for nonnegative input vectors.
We have shown that the measurement matrix obtained from training our autoencoder is able to capture the sparsity structure of the training data. We are now interested in looking at those data-dependent measurement matrices more closely. Table 5 shows that those matrices have singular values close to one. Recall that in Section 3.1 we show that matrices with all singular values being ones have a simple form for the projected subgradient update (12). Our decoder is designed based on this simple update rule. Although we do not explicitly enforce this constraint during training, Table 5 indicates that the learned matrices are not far from the constraint set.
Dataset | ||
---|---|---|
Synthetic1 | 1.117 0.003 | 0.789 0.214 |
Synthetic2 | 1.113 0.006 | 0.929 0.259 |
Synthetic3 | 1.162 0.014 | 0.927 0.141 |
Amazon | 1.040 0.021 | 0.804 0.039 |
Wiki10-31K | 1.097 0.003 | 0.899 0.044 |
RCV1 | 1.063 0.016 | 0.784 0.034 |
We experimented with four variations of LBCS: two different basis matrices (random Gaussian matrix and DCT matrix), two different decoders (-minimization and linear decoder). As shown in Figure 8, the combination of Gaussian and -minimization performs the best.
Table 6 compares the precision scores (P@1, P@3, P@5) over two benchmark datasets. For SLEEC, the precision scores we obtained by running their code (and combining 5 models in the ensemble) are consistent with those reported in the benchmark website (XML, 2017). Compared to SLEEC, our method (which learns label embeddings via training an autoencoder -AE ) is able to achieve better or comparable precision scores. For our method, we have experimented with three prediction approaches (denoted as “-AE 1/2/3” in Table 6): 1) use the nearest neighbor method (same as SLEEC); 2) use the decoder of the trained -AE (which maps from the embedding space to label space); 3) use an average of the label vectors obtained from 1) and 2). As indicated in Table 6, the third prediction approach performs the best.
Dataset | EURLex-4K | Wiki10-31K | ||||
---|---|---|---|---|---|---|
# models in the ensemble | 1 | 3 | 5 | 1 | 3 | 5 |
SLEEC | 0.7600 | 0.7900 | 0.7944 | 0.8356 | 0.8603 | 0.8600 |
-AE 1 | 0.7655 | 0.7928 | 0.7931 | 0.8529 | 0.8564 | 0.8597 |
-AE 2 | 0.7949 | 0.8033 | 0.8070 | 0.8560 | 0.8579 | 0.8583 |
-AE 3 | 0.8062 | 0.8151 | 0.8136 | 0.8617 | 0.8640 | 0.8630 |
Dataset | EURLex-4K | Wiki10-31K | ||||
---|---|---|---|---|---|---|
# models in the ensenble | 1 | 3 | 5 | 1 | 3 | 5 |
SLEEC | 0.6116 | 0.6403 | 0.6444 | 0.7046 | 0.7304 | 0.7357 |
-AE 1 | 0.6094 | 0.0.6347 | 0.6360 | 0.7230 | 0.7298 | 0.7323 |
-AE 2 | 0.6284 | 0.6489 | 0.6575 | 0.7262 | 0.7293 | 0.7296 |
-AE 3 | 0.6500 | 0.6671 | 0.6693 | 0.7361 | 0.7367 | 0.7373 |
Dataset | EURLex-4K | Wiki10-31K | ||||
---|---|---|---|---|---|---|
# models in the ensemble | 1 | 3 | 5 | 1 | 3 | 5 |
SLEEC | 0.4965 | 0.5214 | 0.5275 | 0.5979 | 0.6286 | 0.6311 |
-AE 1 | 0.4966 | 0.5154 | 0.5209 | 0.6135 | 0.6198 | 0.6230 |
-AE 2 | 0.5053 | 0.5315 | 0.5421 | 0.6175 | 0.6245 | 0.6268 |
-AE 3 | 0.5353 | 0.5515 | 0.5549 | 0.6290 | 0.6322 | 0.6341 |
Comments
There are no comments yet.