 # The Sparse Recovery Autoencoder

Linear encoding of sparse vectors is widely popular, but is most commonly data-independent -- missing any possible extra (but a-priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used ℓ_1 decoder. The convex ℓ_1 decoder prevents gradient propagation as needed in standard autoencoder training. Our method is based on the insight that unfolding the convex decoder into T projected gradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing matrix. Our experiments show that there is indeed additional structure beyond sparsity in several real datasets. Our autoencoder is able to discover it and exploit it to create excellent reconstructions with fewer measurements compared to the previous state of the art methods.

## Code Repositories

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Assume we have some unknown data vector . We can observe only a few () linear equations of its entries and we would like to design these projections by creating a measurement matrix such that the projections allow exact (or near-exact) recovery of the original vector .

If , this is an ill-posed problem in general: we are observing linear equations with unknowns, so any vector on the subspace satisfies our observed measurements. In this high-dimensional regime, the only hope is to make a structural assumption on , so that unique reconstruction is possible. A natural approach is to assume that the data vector is sparse. The problem of designing measurement matrices and reconstruction algorithms that recover sparse vectors from linear observations is called Compressed Sensing (CS), Sparse Approximation or Sparse Recovery Theory (Donoho, 2006; Candès et al., 2006).

A natural way to recover is to search for the sparsest solution that satisfies the linear measurements:

 argminx′∈Rd∥x′∥0s.t. Ax′=y. (1)

Unfortunately this problem is NP-hard and for that reason the norm is relaxed into an -norm minimization111Other examples are greedy (Tropp & Gilbert, 2007), and iterative algorithms, e.g., CoSaMP (Needell & Tropp, 2009), IHT (Blumensath & Davies, 2009), and AMP (Donoho et al., 2009).

 D(A,y):=argminx′∈Rd∥x′∥1s.t. Ax′=y. (2)

Remarkably, if the measurement matrix satisfies some properties (e.g. Restricted Isometry Property (RIP) (Candès, 2008) or the nullspace condition (NSP) (Rauhut, 2010)) it is possible to prove that the minimization in (2) produces the same output as the intractable minimization in (1). However, it is NP-hard to test whether a matrix satisfies RIP (Bandeira et al., 2013).

In this paper we are interested in vectors that are not only sparse but have additional structure in their support. This extra structure may not be known or a-priori specified. We propose a data-driven algorithm that learns a good linear measurement matrix from data samples. Our linear measurements are subsequently decoded with the -minimization in (2

) to estimate the unknown vector

.

Many real-world sparse datasets have additional structures beyond simple sparsity. For example, in a demographic dataset with (one-hot encoded) categorical features, a person’s income level may be related to their education. Similarly, in a text dataset with bag-of-words representation, it is much more likely for two related words (e.g., football and game) to appear in the same document than two unrelated words (e.g., football and biology). A third example is that in a genome dataset, certain groups of gene features may be correlated. In this paper, the goal is to

learn a measurement matrix to leverage such additional structure.

Our method is an autoencoder for sparse data, with a linear encoder (the measurement matrix) and a complex non-linear decoder that solves an optimization problem. The latent code is the measurement

which forms the bottleneck of the autoencoder that makes the representation interesting. A popular data-driven dimensionality reduction method is Principal Components Analysis (PCA) (see e.g.,

(Hotelling, 1933; Boutsidis et al., 2015; Wu et al., 2016; Li et al., 2017) and the references therein). PCA is also an autoencoder where both the encoder and decoder are linear and learned from samples. Given a data matrix (each row is a sample), PCA projects each data point onto the subspace spanned by the top right-singular vectors of . As is well-known, PCA provides the lowest mean-squared error when used with a linear decoder. However, when the data is sparse, non-linear recovery algorithms like (2) can yield significantly better recovery performance.

In this paper, we focus on learning a linear encoder for sparse data. Compared to non-linear embedding methods such as kernel PCA (Mika et al., 1999), a linear method enjoys two broad advantages: 1) it is easy to compute, as it only needs a matrix-vector multiplication; 2) it is easy to interpret, as every column of the encoding matrix can be viewed as a feature embedding. Interestingly, Arora et al. (2018) recently observe that the pre-trained word embeddings such as GloVe and word2vec (Mikolov et al., 2013; Pennington et al., 2014) form a good measurement matrix for text data. Those measurement matrices, when used with -minimization, need fewer measurements than the random matrices to achieve near-perfect recovery.

Given a sparse dataset that has additional (but unknown) structure, our goal is to learn a good measurement matrix , when the recovery algorithm is the -minimization in (2). More formally, given sparse samples , our problem of finding the best can be formulated as

 minA∈Rm×df(A), where f(A):=n∑i=1∥xi−D(A,Axi)∥22.

Here is the decoder defined in (2). Unfortunately, there is no easy way to compute the gradient of with respect to , due to the optimization problem defining . Our main insight, which will be elaborated on in Section 3.1, is that replacing the -minimization with a -step projected subgradient update of it, results in gradients being (approximately) computable. In particular, consider the approximate function defined as

 ~f(A):= n∑i=1∥xi−^xi∥22, where (3) ^xi= T-step projected subgradient of D(A,Axi), for i=1,⋯,n.

As we will show, now it is possible to compute the gradients of with respect to . This is idea is sometimes called unrolling and has appeared in various other applications as we discuss in the related work section. To the best of our knowledge, we are the first to use unrolling to learn a measurement matrix for compressed sensing.

Our contributions can be summarized as follows:

• We design a novel autoencoder, called -AE , to learn an efficient and compressed representation for structured sparse vectors. Our autoencoder is easy to train and has only two tuning hyper-parameters associated with the network architecture: the encoding dimension and the network depth . The architecture of -AE is inspired by the projected subgradient method of solving the -minimization in (2). While the exact projected subgradient method requires computing the pseudoinverse, we circumvent this by observing that it is possible to replace the expensive pseudoinverse operation by a simple transpose (see Lemma 1).

• The most surprising result in this paper is that we can learn a linear encoder using an unfolded -step projected subgradient decoder and the learned measurement matrix still performs well for the original -minimization decoder. We empirically compare 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). As shown in Figure 2, using the measurement matrix learned from our autoencoder, we can compress the sparse vectors (in the test set) to a lower dimension (by a factor of 1.1-3x) than random Gaussian matrices while still being able to perfectly recover the original sparse vectors (see also Table 2). This demonstrates the superior ability of our autoencoder in learning and adapting to the additional structures in the given data.

• Although our autoencoder is specifically designed for -minimization decoder, the learned measurement matrix also performs well (and can perform even better) with the model-based decoder (Baraniuk et al., 2010) (Figure 3). This further demonstrates the benefit of learning a measurement matrix from data. As a baseline algorithm, we slightly modify the original model-based CoSaMP algorithm by adding a positivity constraint without changing the theoretical guarantee (Appendix C), which could be of independent interest.

• Besides the application in compressed sensing, one interesting direction for furture research is to use the proposed autoencoder -AE in other supervised tasks. We illustrate a potential application of -AE in extreme multi-label learning. We show that -AE can be used to learn label embeddings for multi-label datasets. We compare the resulted method with one of the state-of-the-art embedding-based methods SLEEC (Bhatia et al., 2015) over two benchmark datasets. Our method is able to achieve better or comparable precision scores than SLEEC (see Table 3).

## 2 Related work

We briefly review the relevant work, and highlight the differences compared to our paper.

Model-based compressed sensing (CS). Model-based CS (Baraniuk et al., 2010; Hegde et al., 2014) extends the conventional compressed sensing theory by considering more realistic structured models than simple sparsity. It requires to know the sparsity structure a priori, which is not always possible in practice. Our approach, by contrast, does not require a priori knowledge about the sparsity structure.

Learning-based measurement design. Most theoretical results in CS are based on random measurement matrices. There are a few approaches proposed to learn a measurement matrix from training data. One approach is to learn a near-isometric embedding that preserves pairwise distance (Hegde et al., 2015; Bah et al., 2013). This approach usually requires computing the pairwise difference vectors, and hence is computationally expensive if both the number of training samples and the dimension are large (which is the setting that we are interested in). Another approach restricts the form of the measurement matrix, e.g., matrices formed by a subset of rows of a given basis matrix. The learning problem then becomes selecting the best subset of rows (Baldassarre et al., 2016; Li & Cevher, 2016; Gözcü et al., 2018). In Figure 2, we compare our method with the learning-based subsampling method proposed in (Baldassarre et al., 2016), and show that our method needs fewer measurements to recover the original sparse vector.

Adaptive CS. In adaptive CS (Seeger & Nickisch, 2011; Arias-Castro et al., 2013; Malloy & Nowak, 2014), the new measurement is designed based on the previous measurements in order to maximize the the gain of new information. This is in contrast to our setting, where we are given a set of training samples. Our goal is to learn a good measurement matrix to leverage additional structure in the given samples.

Dictionary learning. Dictionary learning (Aharon et al., 2006; Mairal et al., 2009) is the problem of learning an overcomplete set of basis vectors from data so that every datapoint (presumably dense) can be represented as a sparse linear combination of the basis vectors. By contrast, this paper focuses on learning a good measurement matrix for structured sparse data.

Sparse coding. The goal of sparse coding (Olshausen & Field, 1996; Donoho & Elad, 2003)

is to find the sparse representation of a dense vector, given a fixed family of basis vectors (aka a dictionary). Training a deep neural network to compute the sparse codes becomes popular recently

(Gregor & LeCun, 2010; Sprechmann et al., 2015; Wang et al., 2016). Several recent papers (see, e.g., (Kulkarni et al., 2016; Xin et al., 2016; Shi et al., 2017; Jin et al., 2017; Mardani et al., 2017; Mousavi et al., 2017, 2019, 2015; Mousavi & Baraniuk, 2017; He et al., 2017; Zhang & Ghanem, 2017; Lohit et al., 2018)

and the references therein) proposes new convolutional neural network architectures for image reconstruction from low-dimensional measurements. Note that some architectures such as

(Lohit et al., 2018; Shi et al., 2017; Mousavi et al., 2015, 2017, 2019) also have an image sensing component, and hence the overall architecture becomes an autoencoder. Sparse coding is different from this paper, because we focus on learning a good measurement/encoding matrix rather than learning a good recovery/decoding algorithm.

Autoencoders.

An autoencoder is a popular neural network architecture for unsupervised learning. It has applications in dimensionality reduction

(Hinton & Salakhutdinov, 2006), pre-training (Erhan et al., 2010), image compression and recovery (Lohit et al., 2018; Mousavi et al., 2015, 2017, 2019), denoising (Vincent et al., 2010), and generative modeling (Kingma & Welling, 2014). In this paper we design a novel autoencoder -AE to learn a compressed sensing measurement matrix for the decoder. We focus on high-dimensional sparse (non-image) data such as the categorical data and bag-of-words data (see Table 1).

Unrolling. The idea of unfolding an iterative algorithm (e.g., gradient descent of an optimization problem) into a neural network structure is a natural way of making the algorithm trainable (see, e.g., (Gregor & LeCun, 2010; Hershey et al., 2014; Sprechmann et al., 2015; Xin et al., 2016; Wang et al., 2016; Shi et al., 2017; Jin et al., 2017; Mardani et al., 2017; He et al., 2017; Zhang & Ghanem, 2017) and references therein). The main difference between the previous papers and this paper is that, the previous papers seek a trained neural network as a replacement of the original optimization-based algorithm, while in this paper we design an autoencoder based on the unrolling idea, and after training, we show that the learned measurement matrix still performs well with the original -minimization decoder.

Extreme multi-label learning (XML).

The goal of XML is to learn a classifier to identify (for each datapoint) a subset of relevant labels from a extreme large label set. Different approaches have been proposed for XML, e.g., embedding-based

(Bhatia et al., 2015; Yu et al., 2014; Mineiro & Karampatziakis, 2015; Tagami, 2017), tree-based (Prabhu & Varma, 2014; Jain et al., 2016; Jasinska et al., 2016; Prabhu et al., 2018a), 1-vs-all (Prabhu et al., 2018b; Babbar & Schölkopf, 2017; Yen et al., 2016, 2017; Niculescu-Mizil & Abbasnejad, 2017; Hariharan et al., 2012)

, and deep learning

(Jernite et al., 2017; Liu et al., 2017). Here we focus on the embedding-based approach. In Section 5 we show that the proposed autoencoder can be used to learn label embeddings for multi-label datasets and the resulted method achieves better or comparable precision scores as SLEEC (Bhatia et al., 2015) over two benchmark datasets.

## 3 Our algorithm

Our goal is to learn a measurement matrix from the given sparse vectors. This is done via training a novel autoencoder, called -AE. In this section, we will describe the key idea behind the design of -AE. In this paper we focus on the vectors that are sparse in the standard basis and also nonnegative222Extending our method to more general settings is left for future research.. This is a natural setting for many real-world datasets, e.g., categorical data and bag-of-words data.

### 3.1 Intuition

Our design is strongly motivated by the projected subgradient method used to solve an -minimization problem. Consider the following -minimization problem:

 minx′∈Rd∥x′∥1% s.t. Ax′=y, (4)

where and . We assume that and that has rank . One approach333

Another approach is via linear programming.

to solving (4) is the projected subgradient method. The update is given by

 x(t+1)=Π(x(t)−αtg(t)), where g(t)=% sign(x(t)) (5)

where denotes the (Euclidean) projection onto the convex set , is the sign function, i.e., the subgradient of the objective function at , and is the step size at the -th iteration. Since has full row rank, has a closed-form solution given by

 Π(z) =argminh∥h−z∥22s.t. Ah=y (6) =z+argminh′∥h′∥22s.t. Ah′=y−Az (7) =z+A†(y−Az), (8)

where is the Moore-Penrose inverse of matrix . Substituting (8) into (5), and using the fact that , we get the following update equation

 x(t+1)=x(t)−αt(I−A†A)sign(x(t)), (9)

where

is the identity matrix. We use

(which satisfies the constraint ) as the starting point.

As mentioned in the Introduction, our main idea is to replace the solution of an decoder given in (4) by a -step projected subgradient update given in (9). One technical difficulty444One approach is to replace by a trainable matrix . This approach performs worse than ours (see Figure 3). in simulating (9

) using neural networks is backpropagating through the pesudoinverse

. Fortunately, Lemma 1 shows that it is possible to replace by .

###### Lemma 1.

For any vector , and any matrix () with rank , there exists an

with all singular values being ones, such that the following two

-minimization problems have the same solution:

 P1:minx′∈Rd∥x′∥1s.t. Ax′=Ax. (10) P2:minx′∈Rd∥x′∥1s.t. ~Ax′=~Ax. (11)

Furthermore, the projected subgradient update of is

 x(t+1)=x(t)−αt(I−~AT~A)sign(x(t)),x(1)=~AT~Ax. (12)

A natural choice for is , where

can be any unitary matrix.

Lemma 1 essentially says that: 1) Instead of searching over all matrices (of size -by- with rank ), it is enough to search over a subset of matrices , whose singular values are all ones. This is because and has the same recovery performance for -minimization (this is true as long as and have the same null space). 2) The key benefit of searching over matrices with singular values being all ones is that the corresponding projected subgradient update has a simpler form: the annoying pesudoinverse term in (9) is replaced by a simple matrix transpose in (12).

As we will show in the next section, our decoder is designed to simulate (12) instead of (9): the only difference is that the pesudoinverse term is replaced by matrix transpose . Ideally we should train our -AE by enforcing the constraint that the matrices have singular values being ones. In practice, we do not enforce that constraint during training. We empirically observe that the learned measurement matrix is not far from the constraint set (see Appendix D.4).

### 3.2 Network Structure of ℓ1-Ae

As shown in Figure 1, -AE has a simple linear encoder and a non-linear decoder. When a data point comes, it is encoded as , where is the encoding matrix that will be learned from data. A decoder is then used to recover the original vector from its embedding .

The decoder network of -AE consists of blocks connected in a feedforward manner: the output vector of the -th block is the input vector to the -th block. The network structure inside each block is identical. Let . For , if is the input to the -th block, then its output vector is

 x(t+1)=x(t)−αt(I−ATA)sign(x(t)), (13)

where are scalar variables to be learned from data. We empirically observe that regularizing to have the following form555This form is sometimes called square summable but not summable (Boyd, 2014). for improves test accuracy. Here,

is the only scalar variable to be learned from data. We also add a standard batch normalization (BN) layer

(Ioffe & Szegedy, 2015) inside each block, because empirically it improves the test accuracy (see Figure 3). After

blocks, we use rectified linear units (ReLU) in the last layer

666This makes sense as we focus on nonnegative sparse vectors in this paper. Nonnegativity is a natural setting for many real-world sparse datasets, e.g., categorical data and text data. to obtain the final output : .

It is worth noting that the low-rank structure of the weight matrix in (13) is essential for reducing the computational complexity. A fully-connected layer requires a weight matrix of size , which is intractable for large .

Given unlabeled training examples , we will train an -AE to minimize the average squared distance between and :

 minA∈Rm×d,β∈R1nn∑i=1∥xi−^xi∥22. (14)

## 4 Experiments

We implement

-AE in Tensorflow 1.4. Our code can be found at

https://github.com/wushanshan/L1AE. The goal of this section is to demonstrate that -AE is able to learn a good measurement matrix for the structured sparse datasets, when we use -minimization for decoding777We use Gurobi (a commercial optimization solver) to solve it..

### 4.1 Datasets and training Figure 2: Best viewed in color. Recovery performance over the test set: fraction of exactly recovered data points (the 1st row and the left half of the 3rd row); reconstruction error (the 2nd row and the right half of the 3rd row). We only plot the test RMSE for the following baselines: simple AE, 4-layer AE, ℓ1-AE, and PCA. This is because they cannot produce a perfect reconstruction ^x that satisfies ∥x−^x∥2≤10−10 (see also Table 2). For synthetic data, we plot the mean and standard deviation (indicated by the error bars) across 10 randomly generated datasets. Note that model-based CoSaMP decoder is not applicable for the Synthetic3, Wiki10-31K, and the RCV1 datasets. Although the model-based CoSaMP decoder has more information about the given data (such as block sparsity and one-hot sparsity) than the ℓ1-minimization decoder, our “ℓ1-AE + ℓ1-min pos” still gives the best recovery performance across all datasets.

Synthetic datasets. As shown in Table 1, we generate three synthetic datasets: two satisfy the block sparsity model888A signal is called -block sparse with block size if it satisfies: 1) can be reshaped into a matrix of size , where ; 2) every column of acts as a group, i.e., the entire column is either zero or nonzero; 3) has nonzero columns, and hence has sparsity . (Baraniuk et al., 2010), and one follows the power-law structured sparsity (The

-th coordinate is nonzero with probability proportional to

). Each sample is generated as follows: 1) choose a random support set satisfying the sparsity model; 2) set the nonzeros to be uniform in .

Real datasets. Our first dataset is from Kaggle “Amazon Employee Access Challenge”. Each training example contains 9 categorical features. We use one-hot encoding to transform each example. Our second dataset Wiki10-31K is a multi-label dataset downloaded from this repository (XML, 2017). We only use the label vectors to train our autoencoder. Our third dataset is RCV1(Lewis et al., 2004), a popular text dataset. We use scikit-learn to fetch the training set and randomly split it into train/validation/test sets.

Training.

We use stochastic gradient descent to train the autoencoder. Before training, every sample is normalized to have unit

norm. The parameters are initialized as follows: is a random Gaussian matrix with standard deviation ; is initialized as 1.0. Other hyper-parameters are given in Appendix B. A single NVIDIA Quadro P5000 GPU is used in the experiments. We set the decoder depth for most datasets101010Although the subgradient method (Boyd, 2014) has a convergence rate, in practice, we found that a small value of (e.g., ) seemed to be good enough (see Table 2).. Training an -AE can be done in several minutes for small-scale datasets and in around an hour for large-scale datasets.

### 4.2 Algorithms

We compare 10 algorithms in terms of their recovery performance. The results are shown in Figure 2. All the algorithms follow a two-step “encoding decoding” process.

-AE + -min pos (our algorithm): After training an -AE , we use the encoder matrix as the measurement matrix. To decode, we use Gurobi (a commercial optimization solver) to solve the following -minimization problem with positivity constraint (denoted as “-min pos”):

 minx′∈Rd∥x′∥1% s.t. Ax′=y,x′≥0. (15)

Since we focus on nonnegative sparse vectors in this paper, adding a positivity constraint improves the recovery performance111111The sufficient and necessary condition (Theorem 3.1 of (Khajehnejad et al., 2011)) for exact recovery using (15) is weaker than the the nullspace property (NSP) (Rauhut, 2010) for (4). (see Appendix D.3).

Gauss + -min pos / model-based CoSaMP pos: A random Gaussian matrix with i.i.d. entries is used as the measurement matrix121212Additional experiments with random partial Fourier matrices (Haviv & Regev, 2017) can be found in Appendix D.2.. We experiment with two decoding methods: 1) Solve the optimization problem given in (15); 2) Use the model-based CoSaMP algorithm131313Model-based method needs the explicit sparsity model, and hence is not applicable for RCV1, Wiki10-31K, and Synthetic3. (Algorithm 1 in (Baraniuk et al., 2010)) with an additional positivity constraint (see Appendix C).

PCA or PCA + -min pos

: We perform truncated singular value decomposition (SVD) on the training set. Let

be the top- singular vectors. For PCA, every vector in the test set is estimated as . We can also use “-min pos” (15) as the decoder.

Simple AE or Simple AE + -min pos: We train a simple autoencoder: for an input vector , the output is , where both and

are learned from data. We use the same loss function as our autoencoder. After training, we use the learned

matrix as the measurement matrix. Decoding is performed either by the learned decoder or solving “-min pos” (15).

LBCS + -min pos / model-based CoSaMP pos: We implement the learning-based compressive subsampling (LBCS) method in (Baldassarre et al., 2016). The idea is to select a subset (of size ) of coordinates (in the transformed space) that preserves the most energy. We use Gaussian matrix as the basis matrix and “-min pos” as the decoder141414We tried four variations of LBCS: two different basis matrices (random Gaussian matrix and DCT matrix), two different decoders (-minimization and linear decoder). The combination of Gaussian and -minimization performs the best (see Appendix D.5).. Decoding with “model-based CoSaMP pos” is in Figure 3.

4-layer AE: We train a standard 4-layer autoencoder (we do not count the input layer), whose encoder network (and decoder network) consists of two feedforward layers with ReLU activation. The dimension of the 1st (and 3rd) layer is tuned based on the performance on the validation set.

### 4.3 Results and analysis Figure 3: Best viewed in color. Left three plots: Although ℓ1-AE is designed for the ℓ1-min decoder, the matrix learned from ℓ1-AE also improves the recovery performance when the decoder is model-based CoSaMP. Right plot: Recovery performance of “ℓ1-AE + ℓ1-min pos” on the Amazon test set when we slightly change the decoder structure. Each change is applied in isolation.

The experimental results are shown in Figure 2. Two performance metrics are compared. The first one is the fraction of exactly recovered test samples. We say that a vector is exactly recovered by an algorithm if it produces an that satisfies . The second metric is the root mean-squared error (RMSE) over the test set151515Note that in Figure 2, test RMSE has similar scale across all datasets, because the vectors are normalized to have unit norm..

As shown in Figure 2, our algorithm “-AE + -min pos” outperforms the rest baselines over all datasets. By learning a data-dependent linear encoding matrix, our method requires fewer measurements to achieve perfect recovery.

Learned decoder versus -min decoder. We now compare two methods: -AE and -AE + -min pos. They have a common encoder but different decoders. As shown in Figure 2, “-AE + -min pos” almost always gives a smaller RMSE. In fact, as shown in Table 2, “-min pos” is able to achieve reconstruction errors in the order of 1e-15, which is impossible for a neural network. The strength of optimization-based decoder over neural network-based decoder has been observed before, e.g., see Figure 1 in (Bora et al., 2017)161616As indicated by Figure 1 in (Bora et al., 2017), LASSO gives better reconstruction than GAN-based method when given enough Gaussian measurements.. Nevertheless, neural network-based decoder usually has an advantage that it can handle nonlinear encoders, for which the corresponding optimization problem may become non-convex and hard to solve exactly.

Model-based decoder versus learned encoder. It is interesting to see that our algorithm even outperforms model-based method (Baraniuk et al., 2010), even though the model-based decoder has more information about the given data than -minimization decoder. For the Amazon dataset, compared to “Gauss + model-based CoSaMP pos”, our method reduces the number of measurements needed for exact recovery by about 2x. This is possible because the model-based decoder only knows that the input vector comes from one-hot encoding, which is a coarse model for the underlying sparsity model. By contrast, -AE learns a measurement matrix directly from the given training data.

A natural question to ask is whether the measurement matrix learned by -AE can improve the recovery performance of the model-based decoding algorithm. As shown in Figure 3, the recovery performance of “-AE + model-based CoSaMP pos” is better than “Gauss + model-based CoSaMP pos”. This further demonstrates the benefit of learning a data-adaptive measurement matrix.

Variations of -AE. We now examine how slightly varying the decoder structure would affect the performance. We make the following changes to the decoder structure: 1) remove the Batch Normalization layer; 2) remove the ReLU operation in the last layer; 3) change the nonlinearity from sign to ; 4) replace the term in the decoder network by a matrix that is learned from data; 5) use one-layer neural network as the decoder, i.e., set in -AE . Each change is applied in isolation. As shown in Figure 3, our design gives the best recovery performance among all the variations for the Amazon dataset.

Decoder depth of -AE. The decoder depth is a tuning parameter of -AE . Empirically, the performance of the learned matrix improves as increases (see Table 2). On the other hand, the training time increases as increases. The parameter is tuned as follows: we start with , and stop if the validation performance improvement is smaller than a threshold or if equals . The hyper-parameters used for training -AE are given in Appendix B. We set for Synthetic2 and Synthetic3, for Synthetic1, RCV1, and Wiki10-31K, and for Amazon dataset. The autoencoder is trained on a single GPU. Training an -AE takes a few minutes for small datasets and around an hour for large datasets.

## 5 Application in extreme multi-label learning

We have proposed a novel autoencoder -AE to learn a compressed sensing measurement matrix for high-dimensional sparse data. Besides the application of -AE in compressed sensing, one interesting direction for future research is to use

-AE in other machine learning tasks. Here we illustrate a potential application of

-AE in extreme multi-label learning (XML). For each data point, the goal of XML is to predict a subset of relevant labels from a extremely large label set. As a result, the output label vector is high-dimensional, sparse, and nonnegative (with 1’s denoting the relevant labels and 0’s otherwise).

Many approaches have been proposed for XML(see (XML, 2017) for the benchmark algorithms and datasets). Here we focus on the embedding-based approach171717Other approaches include tree-based, 1-vs-all, etc., and one of the state-of-the-art embedding-based approaches181818AnnexML (Tagami, 2017) is a graph-embedding approach for XML. Some of its techniques (such as better partition of the input data points) can be potentially used with our method. is SLEEC (Bhatia et al., 2015). Given training samples , , where , , we use and to denote the stacked feature matrix and label matrix. SLEEC works in two steps. In Step 1, SLEEC reduces the dimension of the labels by learning an low-dimension embedding for each training sample. Let (where ) denote the learned embedding matrix. Note that only the matrix is used for learning in this step. In Step 2, SLEEC learns a linear mapping such that . To predict the label vector for a new sample , SLEEC uses nearest neighbors method: first computes the embedding , identifies a few nearest neighbors (from the training set) in the embedding space, and uses the sum of their label vectors as prediction.

The method that we propose follows SLEEC’s two-step procedure. The main difference is that in Step 1, we train an autoencoder -AE to learn embeddings for the labels . Note that in XML, the label vectors are high-dimensional, sparse, and nonnegative. Let be the learned measurement matrix for , then the embedding matrix is . In Step 2, we use the same subroutine as SLEEC to learn a linear mapping from to . To predict the label vector for a new sample, we compared three methods in our experiments: 1) use the nearest neighbor method (same as SLEEC); 2) use the decoder of the trained -AE (which maps from the embedding space to label space); 3) use an average of the label vectors obtained from 1) and 2). The three methods are denoted as “-AE 1/2/3” in Table 3.

In Table 3, we compare the precision score P@1 over two benchmark datasets. The second row is the number of models combined in the ensemble. According to the XML benchmark website (XML, 2017), SLEEC achieves a precision score 0.7926 for EURLex-4K and 0.8588 for Wiki10-31K, which are consistent with what we obtained by running their code (after combining 5 models in the ensemble). The embedding dimensions are for EURLex-4K and for Wiki10-31K. We set for -AE. As shown in Table 3, our method has higher score than SLEEC for EURLex-4K. For Wiki10-31K, a single model of our method has higher score than SLEEC. When 3 or 5 models are ensembled, our method has comparable precision score to SLEEC. More results can be found in Appendix D.6.

## 6 Conclusion

Combining ideas from compressed sensing, convex optimization and deep learning, we proposed a novel unsupervised learning framework for high-dimensional sparse data. The proposed autoencoder -AE is able to learn an efficient measurement matrix by adapting to the sparsity structure of the given data. The learned measurement matrices can be subsequently used in other machine learning tasks such as extreme multi-label learning. We expect that the learned

-AE can lead to useful representations in various supervised learning pipelines, for datasets that are well represented by large sparse vectors. Investigating the relation between the training data and the learned matrix (see Appendix

D.1 for a toy example) is an interesting direction for future research.

## References

• Aharon et al. (2006) Aharon, M., Elad, M., and Bruckstein, A. k-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
• Arias-Castro et al. (2013) Arias-Castro, E., Candès, E. J., and Davenport, M. A. On the fundamental limits of adaptive sensing. IEEE Transactions on Information Theory, 59(1):472–481, 2013.
• Arora et al. (2018) Arora, S., Khodak, M., Nikunj, S., and Vodrahalli, K.

A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and lstms.

In International Conference on Learning Representations (ICLR), 2018.
• Babbar & Schölkopf (2017) Babbar, R. and Schölkopf, B. Dismec: distributed sparse machines for extreme multi-label classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 721–729. ACM, 2017.
• Bah et al. (2013) Bah, B., Sadeghian, A., and Cevher, V. Energy-aware adaptive bi-lipschitz embeddings. In Proceedings of the 10th International Conference on Sampling Theory and Applications, 2013.
• Baldassarre et al. (2016) Baldassarre, L., Li, Y.-H., Scarlett, J., Gözcü, B., Bogunovic, I., and Cevher, V. Learning-based compressive subsampling. IEEE Journal of Selected Topics in Signal Processing, 10(4):809–822, 2016.
• Bandeira et al. (2013) Bandeira, A. S., Dobriban, E., Mixon, D. G., and Sawin, W. F. Certifying the restricted isometry property is hard. IEEE transactions on information theory, 59(6):3448–3450, 2013.
• Baraniuk et al. (2010) Baraniuk, R. G., Cevher, V., Duarte, M. F., and Hegde, C. Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
• Bhatia et al. (2015) Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. Sparse local embeddings for extreme multi-label classification. In Advances in Neural Information Processing Systems, pp. 730–738, 2015.
• Blumensath & Davies (2009) Blumensath, T. and Davies, M. E. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.
• Bora et al. (2017) Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. In International Conference on Machine Learning, pp. 537–546, 2017.
• Boutsidis et al. (2015) Boutsidis, C., Garber, D., Karnin, Z., and Liberty, E. Online principal components analysis. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms (SODA), pp. 887–901. Society for Industrial and Applied Mathematics, 2015.
• Boyd (2014) Boyd, S. Subgradient methods. Notes for EE364b, Stanford University, Spring 2013–14, 2014.
• Candès (2008) Candès, E. J. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9-10):589–592, 2008.
• Candès et al. (2006) Candès, E. J., Romberg, J., and Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.
• Donoho (2006) Donoho, D. L. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
• Donoho & Elad (2003) Donoho, D. L. and Elad, M. Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
• Donoho & Tanner (2005) Donoho, D. L. and Tanner, J. Sparse nonnegative solution of underdetermined linear equations by linear programming. Proceedings of the National Academy of Sciences of the United States of America, 102(27):9446–9451, 2005.
• Donoho et al. (2009) Donoho, D. L., Maleki, A., and Montanari, A. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.
• Erhan et al. (2010) Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.
• Gözcü et al. (2018) Gözcü, B., Mahabadi, R. K., Li, Y.-H., Ilıcak, E., Çukur, T., Scarlett, J., and Cevher, V. Learning-based compressive mri. arXiv preprint arXiv:1805.01266, 2018.
• Gregor & LeCun (2010) Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 399–406, 2010.
• Hariharan et al. (2012) Hariharan, B., Vishwanathan, S., and Varma, M. Efficient max-margin multi-label classification with applications to zero-shot learning. Machine learning, 88(1-2):127–155, 2012.
• Haviv & Regev (2017) Haviv, I. and Regev, O. The restricted isometry property of subsampled fourier matrices. In Geometric Aspects of Functional Analysis, pp. 163–179. Springer, 2017.
• He et al. (2017) He, H., Xin, B., Ikehata, S., and Wipf, D. From bayesian sparsity to gated recurrent nets. In Advances in Neural Information Processing Systems (NIPS), pp. 5560–5570, 2017.
• Hegde et al. (2014) Hegde, C., Indyk, P., and Schmidt, L. Approximation-tolerant model-based compressive sensing. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms (SODA), pp. 1544–1561. SIAM, 2014.
• Hegde et al. (2015) Hegde, C., Sankaranarayanan, A. C., Yin, W., and Baraniuk, R. G. Numax: A convex approach for learning near-isometric linear embeddings. IEEE Transactions on Signal Processing, 63(22):6109–6121, 2015.
• Hershey et al. (2014) Hershey, J. R., Roux, J. L., and Weninger, F. Deep unfolding: Model-based inspiration of novel deep architectures. arXiv preprint arXiv:1409.2574, 2014.
• Hinton & Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
• Hotelling (1933) Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
• Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 448–456, 2015.
• Jain et al. (2016) Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944. ACM, 2016.
• Jasinska et al. (2016) Jasinska, K., Dembczynski, K., Busa-Fekete, R., Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016.
• Jernite et al. (2017) Jernite, Y., Choromanska, A., and Sontag, D. Simultaneous learning of trees and representations for extreme classification and density estimation. In Proceedings of the 34th International Conference on Machine Learning, pp. 1665–1674, 2017.
• Jin et al. (2017) Jin, K. H., McCann, M. T., Froustey, E., and Unser, M. Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.
• Khajehnejad et al. (2011) Khajehnejad, M. A., Dimakis, A. G., Xu, W., and Hassibi, B. Sparse recovery of nonnegative signals with minimal expansion. IEEE Transactions on Signal Processing, 59(1):196–208, 2011.
• Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In The International Conference on Learning Representations (ICLR), 2014.
• Kulkarni et al. (2016) Kulkarni, K., Lohit, S., Turaga, P., Kerviche, R., and Ashok, A. Reconnet: Non-iterative reconstruction of images from compressively sensed measurements. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp. 449–458, 2016.
• Lewis et al. (2004) Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.
• Li et al. (2017) Li, C. J., Wang, M., and Zhang, T. Diffusion approximations for online principal component estimation and global convergence. In Advances in Neural Information Processing Systems (NIPS), 2017.
• Li & Cevher (2016) Li, Y.-H. and Cevher, V. Learning data triage: linear decoding works for compressive mri. In ICASSP, 2016.
• Liu et al. (2017) Liu, J., Chang, W.-C., Wu, Y., and Yang, Y. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124. ACM, 2017.
• Lohit et al. (2018) Lohit, S., Kulkarni, K., Kerviche, R., Turaga, P., and Ashok, A. Convolutional neural networks for non-iterative reconstruction of compressively sensed images. IEEE Transactions on Computational Imaging, 2018.
• Mairal et al. (2009) Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 689–696. ACM, 2009.
• Malloy & Nowak (2014) Malloy, M. L. and Nowak, R. D. Near-optimal adaptive compressed sensing. IEEE Transactions on Information Theory, 60(7):4001–4012, 2014.
• Mardani et al. (2017) Mardani, M., Monajemi, H., Papyan, V., Vasanawala, S., Donoho, D., and Pauly, J. Recurrent generative adversarial networks for proximal learning and automated compressive image recovery. arXiv preprint arXiv:1711.10046, 2017.
• Mika et al. (1999) Mika, S., Schölkopf, B., Smola, A. J., Müller, K.-R., Scholz, M., and Rätsch, G. Kernel pca and de-noising in feature spaces. In Advances in neural information processing systems, pp. 536–542, 1999.
• Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
• Mineiro & Karampatziakis (2015) Mineiro, P. and Karampatziakis, N. Fast label embeddings via randomized linear algebra. In Joint European conference on machine learning and knowledge discovery in databases, pp. 37–51. Springer, 2015.
• Mousavi & Baraniuk (2017) Mousavi, A. and Baraniuk, R. G. Learning to invert: Signal recovery via deep convolutional networks. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.
• Mousavi et al. (2015) Mousavi, A., Patel, A. B., and Baraniuk, R. G. A deep learning approach to structured signal recovery. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pp. 1336–1343. IEEE, 2015.
• Mousavi et al. (2017) Mousavi, A., Dasarathy, G., and Baraniuk, R. G. Deepcodec: Adaptive sensing and recovery via deep convolutional neural networks. In 55th Annual Allerton Conference on Communication, Control and Computing, 2017.
• Mousavi et al. (2019) Mousavi, A., Dasarathy, G., and Baraniuk, R. G. A data-driven and distributed approach to sparse signal representation and recovery. In International Conference on Learning Representations, 2019.
• Needell & Tropp (2009) Needell, D. and Tropp, J. A. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.
• Niculescu-Mizil & Abbasnejad (2017) Niculescu-Mizil, A. and Abbasnejad, E. Label filters for large scale multilabel classification. In Artificial Intelligence and Statistics, pp. 1448–1457, 2017.
• Olshausen & Field (1996) Olshausen, B. A. and Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607, 1996.
• Pan (2016) Pan, V. Y. How bad are vandermonde matrices? SIAM Journal on Matrix Analysis and Applications, 37(2):676–694, 2016.
• Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

, pp. 1532–1543, 2014.
• Prabhu & Varma (2014) Prabhu, Y. and Varma, M. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 263–272. ACM, 2014.
• Prabhu et al. (2018a) Prabhu, Y., Kag, A., Gopinath, S., Dahiya, K., Harsola, S., Agrawal, R., and Varma, M. Extreme multi-label learning with label features for warm-start tagging, ranking & recommendation. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 441–449. ACM, 2018a.
• Prabhu et al. (2018b) Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., and Varma, M. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 993–1002, 2018b.
• Rauhut (2010) Rauhut, H. Compressive sensing and structured random matrices. Theoretical Foundations and Numerical Methods for Sparse Recovery, 9:1–92, 2010.
• Seeger & Nickisch (2011) Seeger, M. W. and Nickisch, H.

Large scale bayesian inference and experimental design for sparse linear models.

SIAM Journal on Imaging Sciences, 4(1):166–199, 2011.
• Shi et al. (2017) Shi, W., Jiang, F., Zhang, S., and Zhao, D. Deep networks for compressed image sensing. In Multimedia and Expo (ICME), 2017 IEEE International Conference on, pp. 877–882. IEEE, 2017.
• Sprechmann et al. (2015) Sprechmann, P., Bronstein, A. M., and Sapiro, G. Learning efficient sparse and low rank models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1821–1833, 2015.
• Tagami (2017) Tagami, Y. Annexml: Approximate nearest neighbor search for extreme multi-label classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 455–464. ACM, 2017.
• Tropp & Gilbert (2007) Tropp, J. A. and Gilbert, A. C. Signal recovery from partial information via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, 2007.
• Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A.

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
• Wang et al. (2016) Wang, Z., Ling, Q., and Huang, T. Learning deep l0 encoders. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
• Wu et al. (2016) Wu, S., Bhojanapalli, S., Sanghavi, S., and Dimakis, A. G. Single pass pca of matrix products. In Advances in Neural Information Processing Systems (NIPS), pp. 2585–2593, 2016.
• Xin et al. (2016) Xin, B., Wang, Y., Gao, W., Wipf, D., and Wang, B. Maximal sparsity with deep networks? In Advances in Neural Information Processing Systems (NIPS), pp. 4340–4348, 2016.
• XML (2017) XML. The extreme classification repository: Multi-label datasets and code.
• Yen et al. (2017) Yen, I. E., Huang, X., Dai, W., Ravikumar, P., Dhillon, I., and Xing, E. Ppdsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 545–553. ACM, 2017.
• Yen et al. (2016) Yen, I. E.-H., Huang, X., Ravikumar, P., Zhong, K., and Dhillon, I. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In International Conference on Machine Learning, pp. 3069–3077, 2016.
• Yu et al. (2014) Yu, H.-F., Jain, P., Kar, P., and Dhillon, I. Large-scale multi-label learning with missing labels. In International conference on machine learning, pp. 593–601, 2014.
• Zhang & Ghanem (2017) Zhang, J. and Ghanem, B. Ista-net: Iterative shrinkage-thresholding algorithm inspired deep network for image compressive sensing. arXiv preprint arXiv:1706.07929, 2017.

## Appendix A Proof of Lemma 1

For convenience, we re-state Lemma 1 here and then give the proof.

###### Lemma.

For any vector , and any matrix () with rank , there exists an with all singular values being ones, such that the following two -norm minimization problems have the same solution:

 P1:minx′∈Rd∥x′∥1s.t. Ax′=Ax. (16) P2:minx′∈Rd∥x′∥1s.t. ~Ax′=~Ax. (17)

Furthermore, the projected subgradient update of is given as

 x(t+1)=x(t)−αt(I−~AT~A)sign(x(t)),x(1)=~AT~Ax.

A natural choice for is , where can be any unitary matrix.

###### Proof.

To prove that and give the same solution, it suffices to show that their constraint sets are equal, i.e.,

 {x:Ax=Az}={x:~Ax=~Az}. (18)

Since and , it then suffices to show that and have the same nullspace:

 null(A)=null(~A). (19)

If satisfies , then , which implies . Conversely, we suppose that . Since is unitary, is full-rank, , which implies that . Therefore, (19) holds.

The projected subgradient of has the following update

 x(t+1) =x(t)−αt(I−~AT(~A~AT)−1~A)sign(x(t)), (20) x(1) =~AT(~A~AT)−1~Az (21)

Since , we have

 ~A~AT =U(AAT)−1/2AAT(AAT)−1/2UT =U(AAT)−1/2(AAT)1/2(AAT)1/2(AAT)−1/2UT =I. (22)

Substituting (22) into (21) gives the desired recursion:

 x(t+1)=x(t)−αt(I−~AT~A)sign(x(t)),x(1)=~AT~Az.

## Appendix B Training parameters

Table 4 lists the parameters used to train -AE in our experiments. We explain the parameters as follows.

• Depth: The number of blocks in the decoder, indicated by in Figure 1.

• Batch size: The number of training samples in a batch.

• Learning rate: The learning rate for SGD.

• : Maximum number of training epochs.

• : Validation error is computed every epochs. This is used for early-stopping.

• : Training is stopped if the validation error does not improve for epochs.

## Appendix C Model-based CoSaMP with additional positivity constraint

The CoSaMP algorithm (Needell & Tropp, 2009) is a simple iterative and greedy algorithm used to recover a -sparse vector from the linear measurements. The model-based CoSaMP algorithm (Algorithm 1 of (Baraniuk et al., 2010)) is a modification of the CoSaMP algorithm. It uses the prior knowledge about the support of the -sparse vector, which is assumed to follow a predefined structured sparsity model. In this section we slightly modify the model-based CoSaMP algorithm to ensure that the output vector follows the given sparsity model and is also nonnegative.

To present the pseudocode, we need a few definitions. We begin with a formal definition for the structured sparsity model and the sparse approximation algorithm . For a vector , let be entries of in the index set . Let be the complement of set .

###### Definition 1 ((Baraniuk et al., 2010)).

A structured sparsity model is defined as the union of canonical -dimensional subspaces

 MK=mK⋃m=1Xms.t. Xm={x:x|Ωm∈RK,x|ΩCm=0}, (23)

where is the set containing all allowed supports, with for each , and each subspace contains all signals with .

We define as the algorithm that obtains the best -term structured sparse approximation of in the union of subspaces :

 M(x,K)=argmin¯x∈MK∥x−¯x∥2. (24)

We next define an enlarged set of subspaces and the associated sparse approximation algorithm.

###### Definition 2 ((Baraniuk et al., 2010)).

The -order sum for the set , with an integer, is defined as

 MBK={B∑r=1x(r),with x(r)∈MK}. (25)

We define as the algorithm that obtains the best approximation of in the union of subspaces :

 MB(x,K)=argmin¯x∈MBK∥x−¯x∥2. (26)

Algorithm 1 presents the model-based CoSaMP with positivity constraint. Comparing Algorithm 1 with the original model-based CoSaMP algorithm (Algorithm 1 of (Baraniuk et al., 2010)), we note that the only different is that Algorithm 1 has an extra step (Step 6). In Step 6 we take a ReLU operation on to ensure that is always nonnegative after Step 7.

We now show that Algorithm 1 has the same performance guarantee as the original model-based CoSaMP algorithm for structured sparse signals. Speficially, we will show that Theorem 4 of (Baraniuk et al., 2010) also applies to Algorithm 1. In (Baraniuk et al., 2010), the proof of Theorem 4 is based on six lemmas (Appendix D), among which the only lemma that is related to Step 6-7 is Lemma 6. It then suffices to prove that this lemma is also true for Algorithm 1 under the constraint that the true vector is nonnegative.

###### Lemma (Prunning).

The pruned approximation is such that

 ∥x−^xi∥2≤2∥x−b∥2. (27)
###### Proof.

Since is the -best approximation of in , and , we have

 ∥x−^xi∥2≤∥x−^b∥2+∥^b−^xi∥2≤2∥x−^b∥2≤2∥x−b∥2, (28)

where the last inequality follows from that , and . ∎

The above lemma matches Lemma 6, which is used to prove Theorem 4 in (Baraniuk et al., 2010). Since the other lemmas (i.e., Lemma 1-5 in Appendix D of (Baraniuk et al., 2010)) still hold for Algorithm 1, we conclude that the performance guarantee for structured sparse signals (i.e., Theorem 4 of (Baraniuk et al., 2010)) is also true for Algorithm 1.

In Figure 4, we compare the recovery performance of two decoding algorithms: 1) model-based CoSaMP algorithm (Algorithm 1 of (Baraniuk et al., 2010)) and 2) model-based CoSaMP algorithm with positivity constraint (indicated by “Model-based CoSaMP pos” in Figure 4). We use random Gaussian matrices as the measurement matrices. Since our sparse datasets are all nonnegative, adding the positivity constraint to the decoding algorithm is able to improve the recovery performance. Figure 4: Incorporating the positivity constraint to the model-based CoSaMP algorithm improves its recovery performance.

## Appendix D Additional experimental results

### d.1 A toy experiment

We use a simple example to illustrate that the measurement matrix learned from our autoencoder is adapted to the training samples. The toy dataset is generated as follows: each vector has 5 nonzeros randomly located in the first 20 dimensions; the nonzeros are random values between [0,1]. We train -AE on a training set with 6000 samples. The parameters are , , and learning rate 0.01. A validation set with 2000 samples is used for early-stopping.After training, we plot the matrix in Figure 5. The entries with large values are concentrated in the first 20 dimensions. This agrees with the specific structure in the toy dataset. Figure 5: Visualization of the learned matrix A∈R10×100 on the toy dataset: a color map of the matrix (upper), the column-wise ℓ2 norm (lower). Every sample in the toy dataset has 5 nonzeros, located randomly in the first 20 dimensions.

### d.2 Random partial Fourier matrices

Figure 6 is a counterpart of Figure 2. The only difference is that in Figure 6 we use random partial Fourier matrices in place of random Gaussian matrices. A random partial Fourier matrix is obtained by choosing rows uniformly and independently with replacement from the

discrete Fourier transform (DFT) matrix. We then scale each entry to have absolute value

(Haviv & Regev, 2017). Because the DFT matrix is complex, to obtain real measurements, we draw random rows from a DFT matrix to form the partial Fourier matrix.

A random partial Fourier matrix is a Vandermonde matrix. According to (Donoho & Tanner, 2005), one can exactly recover a -sparse nonnegative vector from measurements using a Vandermonde matrix (Donoho & Tanner, 2005). However, the Vandermonde matrices are numerically unstable in practice (Pan, 2016), which is consistent with our empirical observation. Comparing Figure 6 with Figure 2

, we see that the recovery performance of a random partial Fourier matrix has larger variance than that of a random Gaussian matrix. Figure 6: Recovery performance of random partial Fourier matrices. Best viewed in color. Similar to Figure 2, the error bars represent the standard deviation across 10 randomly generated datasets. We see that the recovery performance of a random partial Fourier matrix (shown in this figure) has a larger variance than that of a random Gaussian matrix (shown in Figure 2).

### d.3 ℓ1-minimization with positivity constraint

We compare the recovery performance between solving an -min (4) and an -min with positivity constraint (15). The results are shown in Figure 7. We experiment with two measurement matrices: 1) the one obtained from training our autoencoder, and 2) random Gaussian matrices. As shown in Figure 7, adding a positivity constraint to the -minimization improves the recovery performance for nonnegative input vectors. Figure 7: A comparison of the recovery performance between ℓ1-min (4) and the ℓ1-min with positivity constraint (15). The sparse recovery performance is measured on the test set. Best viewed in color. We plot the mean and standard deviation (indicated by the error bars) across 10 randomly generated datasets. Adding a positivity constraint to the ℓ1-minimization gives better recovery performance than a vanilla ℓ1-minimization.

### d.4 Singular values of the learned measurement matrices

We have shown that the measurement matrix obtained from training our autoencoder is able to capture the sparsity structure of the training data. We are now interested in looking at those data-dependent measurement matrices more closely. Table 5 shows that those matrices have singular values close to one. Recall that in Section 3.1 we show that matrices with all singular values being ones have a simple form for the projected subgradient update (12). Our decoder is designed based on this simple update rule. Although we do not explicitly enforce this constraint during training, Table 5 indicates that the learned matrices are not far from the constraint set.

### d.5 Additional experiments of LBCS

We experimented with four variations of LBCS: two different basis matrices (random Gaussian matrix and DCT matrix), two different decoders (-minimization and linear decoder). As shown in Figure 8, the combination of Gaussian and -minimization performs the best. Figure 8: We compare four variations of the LBCS method proposed in (Baldassarre et al., 2016; Li & Cevher, 2016): two basis matrices (random Gaussian and DCT matrix); two decoders (ℓ1-minimization and linear decoding). The combination of “Gaussian + ℓ1-minimization” performs the best. Best viewed in color. For each method, we plot the mean and standard deviation (indicated by the error bars) across 10 randomly generated datasets.

### d.6 Precision score comparisons for extreme multi-label learning

Table 6 compares the precision scores (P@1, P@3, P@5) over two benchmark datasets. For SLEEC, the precision scores we obtained by running their code (and combining 5 models in the ensemble) are consistent with those reported in the benchmark website (XML, 2017). Compared to SLEEC, our method (which learns label embeddings via training an autoencoder -AE ) is able to achieve better or comparable precision scores. For our method, we have experimented with three prediction approaches (denoted as “-AE 1/2/3” in Table 6): 1) use the nearest neighbor method (same as SLEEC); 2) use the decoder of the trained -AE (which maps from the embedding space to label space); 3) use an average of the label vectors obtained from 1) and 2). As indicated in Table 6, the third prediction approach performs the best.