Simulation Experiments for Provable Dictionary Learning using ReLU Autoencoders
In "Dictionary Learning" one tries to recover incoherent matrices A^* ∈R^n × h (typically overcomplete and whose columns are assumed to be normalized) and sparse vectors x^* ∈R^h with a small support of size h^p for some 0 <p < 1 while having access to observations y ∈R^n where y = A^*x^*. In this work we undertake a rigorous analysis of whether gradient descent on the squared loss of an autoencoder can solve the dictionary learning problem. The "Autoencoder" architecture we consider is a R^n →R^n mapping with a single ReLU activation layer of size h. Under very mild distributional assumptions on x^*, we prove that the norm of the expected gradient of the standard squared loss function is asymptotically (in sparse code dimension) negligible for all points in a small neighborhood of A^*. This is supported with experimental evidence using synthetic data. We also conduct experiments to suggest that A^* is a local minimum. Along the way we prove that a layer of ReLU gates can be set up to automatically recover the support of the sparse codes. This property holds independent of the loss function. We believe that it could be of independent interest.READ FULL TEXT VIEW PDF
Simulation Experiments for Provable Dictionary Learning using ReLU Autoencoders
One of the fundamental themes in learning theory is to consider data being sampled from a generative model and to provide efficient methods to recover the original model parameters exactly or with tight approximation guarantees. Classic examples include learning a mixture of gaussians , certain graphical models , full rank square dictionaries [35, 13] and overcomplete dictionaries [2, 7, 8, 9] The problem is usually distilled down to a non-convex optimization problem whose solution can be used to obtain the model parameters. With these hard non-convex problems it has been difficult to find any universal view as to why sometimes gradient descent gives very good and sometimes even exact recovery. In recent times progress has been made towards achieving a geometric understanding of the landscape of such non-convex optimization problems , , . The corresponding question of parameter recovery for neural nets with one layer of activation has been solved in some special cases, [17, 4, 21, 34, 24, 36, 43]
. Almost all of these cases are in the supervised setting where it has also been assumed that the labels are being generated from a net of the same architecture as is being trained. In contrast to these works we address an unsupervised learning problem, and possibly more realistically, we do not tie the data generation model (sensing of sparse vectors by an overcomplete incoherent dictionary) to the neural architecture being analyzed except for assuming knowledge of a few parameters about the ground truth. In a related development, it has been shown by two of the authors here in a previous work, that for two layer deep nets even the exact global minima can be found deterministically in time polynomial in the data size. This work continues that line of investigation to now make use of generative model assumptions to probe the power of a class of two layer deep nets with ReLU activation.
is a neural network that maps
with a single hidden layer of Rectified Linear Unit (ReLU) activations. These networks have been used extensively ([11, 12, 33, 40, 41]) in the past for unsupervised feature learning tasks, and have been found to be successful in generating discriminative features . A number of different autoencoder architectures and regularizers have been proposed which purportedly induce sparsity, at the hidden layer [10, 16, 23, 29]. There has also been some investigation into what autoencoders learn about the data distribution .
We used TensorFlow to train two ReLU autoencoders mapping . These networks were trained on a subset of the MNIST dataset of handwritten digits. One of the nets had a single hidden layer of size and the other one had two hidden layers of size and
(and a fixed identity matrix giving the output from the second layer of activations). In both the cases the weights of the encoder and decoder were maintained as transposes of each other. We trained the autoencoders on the standard squared loss function using RMSProp. The training was done on images of the digits and from the MNIST dataset. In the following panel we show four pairs (two for each net) of “reconstructed" image i.e output of the trained net when its given as input the “actual" photograph as input.
In our opinion, the above figures add support to the belief that a single and a double layer ReLU activated network can learn an implicit high dimensional structure about the handwritten digits dataset. In particular this demonstrates that though adding more hidden layers obviously helps enhance the reconstruction ability, the single hidden layer autoencoder do hold within them significant power for unsupervised learning of representations. Unfortunately analyzing the RMSProp update rule used in the above experiment is currently beyond our analytic means. However, we take inspiration from these experiments to devise a different mathematical set-up which is much more amenable to analysis taking us towards a better understanding of the power of autoencoders.
For any , an autoencoder is a fully connected neural network with a single hidden layer of activations. We focus on networks that use the Rectified Linear Unit (ReLU) activation which is the function mapping . In this case, the autoencoder can be seen as computing the following function as follows,
Here is the input to the autoencoder,
is the linear transformation implemented by the first layer,is the output of the layer of activations,
is the bias vector andis the output of the autoencoder. Note that we impose the condition that the second layer of weights is simply the transpose of the first layer. We shall define the columns of (rows of ) as .
We assume that our signal is generated using sparse linear combinations of atoms/vectors of an overcomplete dictionary, i.e., , where is a dictionary, and is a non-negative sparse vector, with at most (for some ) non zero elements. The columns of the original dictionary (labeled as ) are assumed to be normalized and also satisfy the incoherence property that for some .
We assume that the sparse code is sampled from a distribution with the following properties. We fix a set of possible supports of , denoted by , where each element of has at most
elements. We consider any arbitrary discrete probability distributionon such that the probability is independent of , and the probability is independent of . A special case is when is the set of all subsets of size , and
is the uniform distribution on. For every there is a distribution say on which is supported on vectors whose support is contained in and which is uncorrelated for pairs of coordinates . Further, we assume that the distributions are such that each coordinate is compactly supported over an interval , where and are independent of both and but will be functions of . Moreover, , and are assumed to be independent of both and but allowed to depend on . For ease of notation henceforth we will keep the dependence of these variables implicit and refer to them as and . All of our results will hold in the special case when are constants (no dependence on ).
First we prove the following theorem which precisely quantifies the sense in which a layer of ReLU gates is able to recover the support of the sparse code when the weight matrix of the deep net is close to the original dictionary. We recall that the size of the support of the sparse vector is for some . We also recall the parameters as defining the support of the marginal distribution of each coordinate of and is the expected value of this marginal distribution (recall that none of these depend on the coordinate or the actual support). These parameters will be referenced in the results below.
Let each column of be within a -ball of the corresponding column of , where for some , such that (where is the coherence parameter). We further assume that . Let the bias of the hidden layer of the autoencoder, as defined in (2) be . Let be the vector defined in (2). Then if , and if with probability at least (with respect to the distribution on ).
As long as is large, i.e., an increasing function of , we can interpret this as saying that the probability of the adverse event is small, and we have successfully achieved support recovery at the hidden layer in the limit of large sparse code dimension.
In this work we analyze the following standard squared loss function for the autoencoder,
In the above we continue to use the variables as defined in equation 2. If we consider a generative model in which
is a square, orthogonal matrix andis a non-negative vector (not necessarily sparse), it is easily seen that the standard squared reconstruction error loss function for the autoencorder has a global minimum at . In our generative model, however, is an incoherent and overcomplete dictionary.
(The Main Theorem) Assume that the hypotheses of Theorem LABEL:theorem:support hold, and (and hence ). Further, assume the distribution parameters satisfy is superpolynomial in (which holds, for example, when are ). Then for ,
We present the proof of the support recovery result, i.e., Theorem 3.1, in Section 4. Section 5 gives the proof of our main result, Theorem 3.2. The argument rests on two critical lemmas (Lemmas 5.1 and 5.2), whose proofs appear in the Supplementary material. In Section 6, we run simulations to verify Theorem 3.2. We also run experiments that strongly suggest that the standard squared loss function has a local minimum in a neighborhood around .
Most sparse coding algorithms are based on an alternating minimization approach, where one iteratively finds a sparse code based on the current estimate of the dictionary, and then uses the estimated sparse code to update the dictionary. The analogue of the sparse coding step in an autoencoder, is the passing through the hidden layer of activations of a certain affine transformation (which behaves as the current estimate of the dictionary) of the input vectors. We show that under certain stochastic assumptions, the hidden layer of ReLU gates in an autoencoder recovers with high probability the support of the sparse vector which corresponds to the present input.
From the model assumptions, we know that the dictionary is incoherent, and has unit norm columns. So, for all , and for all . This means that for ,
Otherwise for ,
where we use the fact that .
Let and let be the support of . Then we define the input to the ReLU activation as
Q_i = ∑_j ∈S ⟨W_i, A^*_j ⟩x^*_j = ⟨W_i, A^*_i ⟩x^*_i 1_i∈S+ ∑_j ∈S ∖i ⟨W_i, A^*_j ⟩x^*_j = ⟨W_i, A^*_i ⟩x^*_i1_i∈S + Z_i.
First we try to get bounds on when . From our assumptions on the distribution of we have, and for all in the support of . For ,
Plugging in the lower bound for and the proposed value for the bias, we get
For , we need:
Now plugging in the values for the various quantities, and and , if we have , then .
Now, for we would like to analyze the following probability:
We first simplify the quantity as follows
Pr[ Q_i ≥ϵ|i ∉supp(x^*) ] = Pr [ Z_i ≥ϵ]
= Pr [ ∑_j ∈S∖i ⟨W_i, A_j^* ⟩x_j^* ≥ϵ]
Using the Chernoff’s bound, we can obtain
where the second inequality follows from (4) and the fact that and are both nonnegative, and the third inequality follows from Hoeffding’s Lemma. Next, we also have
Finally, since and , we have
It turns out that the expectation of the full gradient of the loss function (2) is difficult to analyze directly. Hence corresponding to the true gradient with respect to the column of we create a proxy, denoted by ), by replacing in the expression for the true expectation
every occurrence of the random variable
by the indicator random variable. This proxy is shown to be a good approximant of the expected gradient in the following lemma.
Assume that the hypotheses of Theorem 3.1 hold and additionally let be bounded by a polynomial in . Then we have for each (indexing the columns of ),
This lemma has been proven in Section A of the Appendix. ∎
Assume that the hypotheses of Theorem 3.1 hold, and (and hence ). Then for each indexing the columns of , there exist real valued functions and , and a vector such that , and
This lemma has been proven in Section B of the Appendix.∎
With the above asymptotic results, we are in a position to assemble the proof of Theorem 3.2.
Consider any indexing the columns of . Recall the definition of the proxy gradient at the beginning of this section. Let us define . Using and as defined in Lemma 5.2, we can write the expectation of the true gradient as, . Further, by Lemma 5.1,
Since is superpolynomial in , we obtain
We conduct some experiments on synthetic data in order to check whether the gradient norm is indeed small within the columnwise -ball of . We also make some observations about the landscape of the squared loss function, which has implications for being able to recover the ground-truth dictionary .
We generate random dictionaries () of size where , and and . The dictionary entries are drawn from a standard Gaussian, and the columns of the dictionary are then normalized. These dictionaries are incoherent, with high probability. For each , we generate a dataset containing sparse vectors with non-zero entries, where . In our experiments, the coherence parameter was approximately . We conduct experiments for values of that are at most . Here is the hidden layer dimension of the autoencoder and controls the sparsity of the data used to train the autoencoder. The support of each sparse vector is drawn uniformly from all sets of indices of size , and the non-zero entries in the sparse vectors are drawn from a uniform distribution between and . Once we have generated the sparse vectors, we collect them in a matrix and then compute the signals .
We set up the autoencoder as defined through equation 2. The bias parameter in the hidden layer is set to . Choosing this prefactor of does not violate Theorem 3.1 and it was chosen to have the ReLU layer of the autoencoder recover a large fraction of the support of . We analyze the squared loss function in (2) and its gradient with respect to a column of through their empirical averages over the signals in .
|256||(0.0137, 0.0041)||(0.0138, 0.0044)||(0.0126, 0.0052)||(0.0095, 0.0068)|
|512||(0.0058, 0.0021)||(0.0058, 0.0022)||(0.0054, 0.0027)||(0.0071, 0.0036)|
|1024||(0.0025, 0.0010)||(0.0024, 0.0011)||(0.0026, 0.0014)||(0.0079, 0.0020)|
|2048||(0.0011, 0.0005)||(0.0012, 0.0006)||(0.0025, 0.0007)||(0.0031, 0.0010)|
|4096||(0.0006, 0.0003)||(0.0012, 0.0003)||(0.0013, 0.0004)||(0.0026, 0.0006)|
Once we have generated the data, we compute the empirical average of the gradient of the loss function in (2) at random points which are columnwise away from . We average the gradient over the points which are all at the same distance from , and compare the average column norm of the gradient to . Our experiments show that the average column norm of the gradient is of the order of (and thus falling with for any fixed ) as expected from Theorem 3.2. Results for points sampled at are shown in Table 1.
We also plot the squared loss of the autoencoder along a randomly chosen direction to see if is possibly a local minimum. More precisely, we draw a matrix
from a standard normal distribution, and normalize its columns. We then plot, as well as the gradient norm averaged over all the columns. For purposes of illustration, we show these plots for , in figures 1 and 2, and those for , in figures 3 and 4.
In this paper we have undertaken a rigorous analysis of the loss function of the squared loss of an autoencoder when the data is assumed to be generated by sensing of sparse high dimensional vectors by an overcomplete dictionary. We have shown that the expected gradient of this loss function is very close to zero in a neighborhood of the generating overcomplete dictionary.
Our simulations complement this theoretical result by providing further empirical support. Firstly, they show that the gradient norm in this ball of indeed falls with and is of the same order as as expected from our proof. Secondly, the experiments also strongly suggest ranges of values of and where is a local minima of this loss function and that it has a neighborhood where the reconstruction error is low.
This suggests sparse coding problems can be solved by training autoencoders using gradient descent based algorithms. Further, recent investigations have led to the conjecture/belief that many important unsupervised learning tasks, e.g. recognizing handwritten digits, are sparse coding problems in disguise [25, 26]. Thus, our results could shed some light on the observed phenomenon that gradient descent based algorithms train autoencoders to low reconstruction error for natural data sets, like MNIST.
It remains to rigorously show whether a gradient descent algorithm can be initialized randomly (may be far away from ) and still be shown to converge to this neighborhood of critical points around the dictionary. Towards that it might be helpful to understand the structure of the Hessian outside this neighborhood. Since our analysis applies to the expected gradient, it remains to analyze the sample complexities where these nice results will become prominent.
The possibility also remains open that this standard loss or some other loss functions exist for the autoencoder with the provable property of having a global minima/minimum at the ground truth dictionary. We have mentioned one example of such in a special case (when is square orthogonal and is nonnegative) and even in this special case it remains open to find a provable optimization algorithm.
On the simulation front we have a couple of open challenges yet to be tackled. Firstly, it is left to find efficient implementations of the iterative update rule based on the exact gradient of the proposed loss function which has been given in (2). This would open up avenues for testing the power of this loss function on real data rather than the synthetic data used here. Secondly, a simulation of the main Theorem 3.2 that can probe deeper into its claim would need to be able to sample for different at a fixed value of the incoherence parameter . This sampling question of with these constraints is an unresolved one that is left for future work.
Autoencoders with more than one hidden layer have been used for unsupervised feature learning 
and recently there has been an analysis of the sparse coding performance of convolutional neural networks with one layer and two layers of nonlinearities . The connections between neural networks and sparse coding has also been recently explored in . It remains an exciting open avenue of research to try to do a similar study as in this work to determine if and how deeper architectures under the same generative model might provide better means of doing sparse coding.
Akshay Rangamani and Peter Chin are supported by the AFOSR grant FA9550-12-1-0136. Amitabh Basu and Anirbit Mukherjee gratefully acknowledges support from the NSF grant CMMI1452820. We would like to thank Raman Arora (JHU), and Siva Theja Maguluri (Georgia Institute of Technology) for illuminating comments and discussion.
Journal of Machine Learning Research, 15(1):3563–3593, 2014.
Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 37–49, 2012.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
Contractive auto-encoders: Explicit invariance during feature extraction.In Proceedings of the 28th international conference on machine learning (ICML-11), pages 833–840, 2011.
Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
To make it easy to present this argument let us abstractly think of the function (defined for any ) as where we have defined the random variable . It is to be noted that because of the ReLU term and its derivative this function has a dependency on even outside its dependency through . Let us define another random variable . Then we have,
In the last step above we have used the Cauchy-Schwarz inequality for random variables. We recognize that is precisely what we defined as the proxy gradient . Further for such as in this lemma the support recovery theorem (Theorem 3.1) holds and that is precisely the statement that the term, is small. So we can rewrite the above inequality as,
We remember that is a polynomial in because its dependency is through Frobenius norms of submatrices of and norms of projections of . But the norm of the training vectors (that is ) have been assumed to be bounded by . Also we have the assumption that the columns of are within a ball of the corresponding columns of which in turn is a dimensional matrix of bounded norm because all its columns are normalized. So summarizing we have,
The above inequality immediately implies the claimed lemma. ∎
To recap we imagine being given as input signals (imagined as column vectors), which are generated from an overcomplete dictionary of a fixed incoherence. Let (imagined as column vectors) be the sparse code that generates . The model of the autoencoder that we now have is . is a matrix and the column of is to be denoted as the column vector .
Using the above notation the squared loss of the autoencoder is . But we introduce a dummy constant to be multiplied to because this helps read the complicated equations that would now follow. This marker helps easily spot those terms which depend on the sensing of (those with a factor of ) as opposed to the terms which are “purely” dependent on the neural net (those without the factor of ). Thus we think of the squared loss of our autoencoder as,
where we have defined as,
Then we have,
In the form of a derivative matrix this means,
This helps us write,
Now going over to the proxy gradient corresponding to this term we define the vector as,
Thus we have,
Now we invoke the distributional assumption about i.i.d sampling of the coordinates for a fixed support and the definition of and to write, for all and for , . Thus we get,