DeepAI

# Sparse Coding and Autoencoders

In "Dictionary Learning" one tries to recover incoherent matrices A^* ∈R^n × h (typically overcomplete and whose columns are assumed to be normalized) and sparse vectors x^* ∈R^h with a small support of size h^p for some 0 <p < 1 while having access to observations y ∈R^n where y = A^*x^*. In this work we undertake a rigorous analysis of whether gradient descent on the squared loss of an autoencoder can solve the dictionary learning problem. The "Autoencoder" architecture we consider is a R^n →R^n mapping with a single ReLU activation layer of size h. Under very mild distributional assumptions on x^*, we prove that the norm of the expected gradient of the standard squared loss function is asymptotically (in sparse code dimension) negligible for all points in a small neighborhood of A^*. This is supported with experimental evidence using synthetic data. We also conduct experiments to suggest that A^* is a local minimum. Along the way we prove that a layer of ReLU gates can be set up to automatically recover the support of the sparse codes. This property holds independent of the loss function. We believe that it could be of independent interest.

• 3 publications
• 11 publications
• 14 publications
• 1 publication
• 4 publications
• 5 publications
• 28 publications
02/17/2014

### Performance Limits of Dictionary Learning for Sparse Coding

We consider the problem of dictionary learning under the assumption that...
05/31/2021

### PUDLE: Implicit Acceleration of Dictionary Learning by Backpropagation

The dictionary learning problem, representing data as a combination of f...
03/12/2018

### Representation Learning and Recovery in the ReLU Model

Rectified linear units, or ReLUs, have become the preferred activation f...
02/13/2021

### On the convergence of group-sparse autoencoders

Recent approaches in the theoretical analysis of model-based deep learni...
10/25/2018

### Subgradient Descent Learns Orthogonal Dictionaries

This paper concerns dictionary learning, i.e., sparse coding, a fundamen...
01/30/2021

### Metalearning: Sparse Variable-Structure Automata

Dimension of the encoder output (i.e., the code layer) in an autoencoder...
05/31/2020

We study estimation of a gradient-sparse parameter vector θ^* ∈ℝ^p, havi...

## Code Repositories

### autoencoders_dictionaries

Simulation Experiments for Provable Dictionary Learning using ReLU Autoencoders

## 1 Introduction

One of the fundamental themes in learning theory is to consider data being sampled from a generative model and to provide efficient methods to recover the original model parameters exactly or with tight approximation guarantees. Classic examples include learning a mixture of gaussians [28], certain graphical models [5], full rank square dictionaries [35, 13] and overcomplete dictionaries [2, 7, 8, 9] The problem is usually distilled down to a non-convex optimization problem whose solution can be used to obtain the model parameters. With these hard non-convex problems it has been difficult to find any universal view as to why sometimes gradient descent gives very good and sometimes even exact recovery. In recent times progress has been made towards achieving a geometric understanding of the landscape of such non-convex optimization problems [18], [27], [42]. The corresponding question of parameter recovery for neural nets with one layer of activation has been solved in some special cases, [17, 4, 21, 34, 24, 36, 43]

. Almost all of these cases are in the supervised setting where it has also been assumed that the labels are being generated from a net of the same architecture as is being trained. In contrast to these works we address an unsupervised learning problem, and possibly more realistically, we do not tie the data generation model (sensing of sparse vectors by an overcomplete incoherent dictionary) to the neural architecture being analyzed except for assuming knowledge of a few parameters about the ground truth. In a related development, it has been shown by two of the authors here in a previous work

[6], that for two layer deep nets even the exact global minima can be found deterministically in time polynomial in the data size. This work continues that line of investigation to now make use of generative model assumptions to probe the power of a class of two layer deep nets with ReLU activation.

Here we specialize to the generative model of dictionary learning/sparse coding where one receives samples of vectors that have been generated as where and . We typically assume that the number of non-zero entries in to be no larger than some function of the dimension and that satisfies certain incoherence properties. The question now is to recover from samples of . There have been renewed investigations into the hardness of this problem [38] and many former results have recently been reviewed in these lectures [19]. This question has been a cornerstone of learning theory ever since the ground-breaking paper by Olshausen and Field ([31]) (a recent review by the same authors can be found in [32]). Over the years many algorithms have been developed to solve this problem and a detailed comparison among these various approaches can be found in [13].

An autoencoder

is a neural network that maps

with a single hidden layer of Rectified Linear Unit (ReLU) activations. These networks have been used extensively (

[11, 12, 33, 40, 41]) in the past for unsupervised feature learning tasks, and have been found to be successful in generating discriminative features [15]. A number of different autoencoder architectures and regularizers have been proposed which purportedly induce sparsity, at the hidden layer [10, 16, 23, 29]. There has also been some investigation into what autoencoders learn about the data distribution [3].

Olshausen and Field had, as early as , already made the connection between sparse coding and training neural architectures and in today’s terminology this problem is very naturally reminiscent of the architecture of an autoencoder [30]. However, to the best of our knowledge, there has not been sufficient progress to rigorously establish whether autoencoders can do sparse coding. In this work, we present our progress towards bridging the above mentioned mathematical gap. To the best of our knowledge, there is no theoretical evidence (even under the usual generative assumptions of sparse coding) that the stationary points of any of the usual squared loss functions (with or without any of the usual regularizers) have any resemblance to the original dictionary that is being sought to be learned. The main point of this paper is to rigorously prove that for autoencoders with ReLU activation, the standard squared loss function has a neighborhood around the dictionary where the norm of the expected gradient is very small (for large enough sparse code dimension ). Thus, all points in a neighborhood of , including , are all asymptotic critical points of this standard squared loss. We supplement our theoretical result with experimental evidence for it in Section 6, which also strongly suggests that the standard squared loss function has a local minimum in a neighborhood around . We believe that our results provide theoretical and experimental evidence that the sparse coding problem can be tackled by training autoencoders.

### 1.1 A motivating experiment on MNIST using TensorFlow

We used TensorFlow

[1] to train two ReLU autoencoders mapping . These networks were trained on a subset of the MNIST dataset of handwritten digits. One of the nets had a single hidden layer of size and the other one had two hidden layers of size and

(and a fixed identity matrix giving the output from the second layer of activations). In both the cases the weights of the encoder and decoder were maintained as transposes of each other. We trained the autoencoders on the standard squared loss function using RMSProp

[37]. The training was done on images of the digits and from the MNIST dataset. In the following panel we show four pairs (two for each net) of “reconstructed" image i.e output of the trained net when its given as input the “actual" photograph as input.

In our opinion, the above figures add support to the belief that a single and a double layer ReLU activated network can learn an implicit high dimensional structure about the handwritten digits dataset. In particular this demonstrates that though adding more hidden layers obviously helps enhance the reconstruction ability, the single hidden layer autoencoder do hold within them significant power for unsupervised learning of representations. Unfortunately analyzing the RMSProp update rule used in the above experiment is currently beyond our analytic means. However, we take inspiration from these experiments to devise a different mathematical set-up which is much more amenable to analysis taking us towards a better understanding of the power of autoencoders.

## 2 Introducing the neural architecture and the distributional assumptions

For any , an autoencoder is a fully connected neural network with a single hidden layer of activations. We focus on networks that use the Rectified Linear Unit (ReLU) activation which is the function mapping . In this case, the autoencoder can be seen as computing the following function as follows,

 r =ReLU(Wy−ϵ) ^y =W⊤r (1)

Here is the input to the autoencoder,

is the linear transformation implemented by the first layer,

is the output of the layer of activations,

is the bias vector and

is the output of the autoencoder. Note that we impose the condition that the second layer of weights is simply the transpose of the first layer. We shall define the columns of (rows of ) as .

#### Assumptions on the dictionary and the sparse code.

We assume that our signal is generated using sparse linear combinations of atoms/vectors of an overcomplete dictionary, i.e., , where is a dictionary, and is a non-negative sparse vector, with at most (for some ) non zero elements. The columns of the original dictionary (labeled as ) are assumed to be normalized and also satisfy the incoherence property that for some .

We assume that the sparse code is sampled from a distribution with the following properties. We fix a set of possible supports of , denoted by , where each element of has at most

elements. We consider any arbitrary discrete probability distribution

on such that the probability is independent of , and the probability is independent of . A special case is when is the set of all subsets of size , and

is the uniform distribution on

. For every there is a distribution say on which is supported on vectors whose support is contained in and which is uncorrelated for pairs of coordinates . Further, we assume that the distributions are such that each coordinate is compactly supported over an interval , where and are independent of both and but will be functions of . Moreover, , and are assumed to be independent of both and but allowed to depend on . For ease of notation henceforth we will keep the dependence of these variables implicit and refer to them as and . All of our results will hold in the special case when are constants (no dependence on ).

## 3 Main Results

### 3.1 Recovery of the support of the sparse code by a layer of ReLUs

First we prove the following theorem which precisely quantifies the sense in which a layer of ReLU gates is able to recover the support of the sparse code when the weight matrix of the deep net is close to the original dictionary. We recall that the size of the support of the sparse vector is for some . We also recall the parameters as defining the support of the marginal distribution of each coordinate of and is the expected value of this marginal distribution (recall that none of these depend on the coordinate or the actual support). These parameters will be referenced in the results below.

###### Theorem 3.1.

Let each column of be within a -ball of the corresponding column of , where for some , such that (where is the coherence parameter). We further assume that . Let the bias of the hidden layer of the autoencoder, as defined in (2) be . Let be the vector defined in (2). Then if , and if with probability at least (with respect to the distribution on ).

As long as is large, i.e., an increasing function of , we can interpret this as saying that the probability of the adverse event is small, and we have successfully achieved support recovery at the hidden layer in the limit of large sparse code dimension.

### 3.2 Asymptotic Criticality of the Autoencoder around A∗

In this work we analyze the following standard squared loss function for the autoencoder,

 L=12||^y−y||2 (2)

In the above we continue to use the variables as defined in equation 2. If we consider a generative model in which

is a square, orthogonal matrix and

is a non-negative vector (not necessarily sparse), it is easily seen that the standard squared reconstruction error loss function for the autoencorder has a global minimum at . In our generative model, however, is an incoherent and overcomplete dictionary.

###### Theorem 3.2.

(The Main Theorem) Assume that the hypotheses of Theorem LABEL:theorem:support hold, and (and hence ). Further, assume the distribution parameters satisfy is superpolynomial in (which holds, for example, when are ). Then for ,

 ∥∥∥E[∂L∂Wi]∥∥∥2≤o(max{m21,m2}h1−p).

We present the proof of the support recovery result, i.e., Theorem 3.1, in Section 4. Section 5 gives the proof of our main result, Theorem 3.2. The argument rests on two critical lemmas (Lemmas 5.1 and 5.2), whose proofs appear in the Supplementary material. In Section 6, we run simulations to verify Theorem 3.2. We also run experiments that strongly suggest that the standard squared loss function has a local minimum in a neighborhood around .

## 4 A Layer of ReLU Gates can Recover the Support of the Sparse Code (Proof of Theorem 3.1)

Most sparse coding algorithms are based on an alternating minimization approach, where one iteratively finds a sparse code based on the current estimate of the dictionary, and then uses the estimated sparse code to update the dictionary. The analogue of the sparse coding step in an autoencoder, is the passing through the hidden layer of activations of a certain affine transformation (

which behaves as the current estimate of the dictionary) of the input vectors. We show that under certain stochastic assumptions, the hidden layer of ReLU gates in an autoencoder recovers with high probability the support of the sparse vector which corresponds to the present input.

###### Proof of Theorem 3.1.

From the model assumptions, we know that the dictionary is incoherent, and has unit norm columns. So, for all , and for all . This means that for ,

 |⟨Wi,A∗j⟩| =|⟨Wi−A∗i,A∗j⟩|+|⟨A∗i,A∗j⟩| ≤||Wi−A∗i||2||A∗j||2+μ√n≤(δ+μ√n) (3)

Otherwise for ,

 ⟨Wi,A∗i⟩=⟨Wi−A∗i,A∗i⟩+⟨A∗i,A∗i⟩=⟨Wi−A∗i,A∗i⟩+1,

and thus,

 1−δ≤⟨Wi,A∗i⟩≤1+δ, (4)

where we use the fact that .

Let and let be the support of . Then we define the input to the ReLU activation as Q_i = ∑_j ∈S ⟨W_i, A^*_j ⟩x^*_j = ⟨W_i, A^*_i ⟩x^*_i 1_i∈S+ ∑_j ∈S ∖i ⟨W_i, A^*_j ⟩x^*_j = ⟨W_i, A^*_i ⟩x^*_i1_i∈S + Z_i.
First we try to get bounds on when . From our assumptions on the distribution of we have, and for all in the support of . For ,

 Qi =⟨Wi,A∗i⟩x∗i+Zi ⟹Qi ≥(1−δ)a+Zi

where we use (4). Using (4), has the following bounds:

 −bk(δ+μ√n)≤Zi≤bk(δ+μ√n)

Plugging in the lower bound for and the proposed value for the bias, we get

 Qi−ϵ ≥(1−δ)a−bk(δ+μ√n)−2m1k(δ+μ√n)

For , we need:

 a≥(b+2m1)(δ+μ√n)k1−δ

Now plugging in the values for the various quantities, and and , if we have , then .

Now, for we would like to analyze the following probability:

 Pr[Qi−ϵ≥0|i∉supp(x∗)]

We first simplify the quantity as follows

Pr[ Q_i ≥ϵ|i ∉supp(x^*) ] = Pr [ Z_i ≥ϵ]
= Pr [ ∑_j ∈S∖i ⟨W_i, A_j^* ⟩x_j^* ≥ϵ]
Using the Chernoff’s bound, we can obtain

 Pr[Zi≥ϵ] =inft≥0e−tϵ∏j∈S∖iE[et⟨Wi,A∗j⟩x∗j] ≤inft≥0e−tϵEk[et(δ+μ√n)x∗j]

where the second inequality follows from  (4) and the fact that and are both nonnegative, and the third inequality follows from Hoeffding’s Lemma. Next, we also have

 Pr[Zi≥ϵ] ≤inft≥0e−t(ϵ−k(δ+μ√n)m1)+t2k8(δ+μ√n)2(b−a)2 =e−(ϵ−k(δ+μ√n)m1)2k2(δ+μ√n)2(b−a)2.

Finally, since and , we have

 exp⎛⎜⎝−2(ϵ−km1(δ+μ√n))2hp(δ+μ√n)2(b−a)2⎞⎟⎠=exp(−2hpm21(b−a)2)

## 5 Criticality of a neighborhood of A∗ (Proof of Theorem 3.2)

It turns out that the expectation of the full gradient of the loss function (2) is difficult to analyze directly. Hence corresponding to the true gradient with respect to the column of we create a proxy, denoted by ), by replacing in the expression for the true expectation

every occurrence of the random variable

. This proxy is shown to be a good approximant of the expected gradient in the following lemma.

###### Lemma 5.1.

Assume that the hypotheses of Theorem 3.1 hold and additionally let be bounded by a polynomial in . Then we have for each (indexing the columns of ),

 ∣∣∣∣∣∣ˆ∇iL−E[∂L∂Wi]∣∣∣∣∣∣2≤poly(h)exp(−hpm212(b−a)2)
###### Proof.

This lemma has been proven in Section A of the Appendix. ∎

###### Lemma 5.2.

Assume that the hypotheses of Theorem 3.1 hold, and (and hence ). Then for each indexing the columns of , there exist real valued functions and , and a vector such that , and

 αi=Θ(m2hp−1)+o(m21hp−1) βi=Θ(m2hp−1)+o(m21hp−1) αi−βi=o(max{m21,m2}hp−1) ||ei||2=o(max{m21,m2}hp−1)
###### Proof.

This lemma has been proven in Section B of the Appendix.∎

With the above asymptotic results, we are in a position to assemble the proof of Theorem 3.2.

###### Proof of Theorem 3.2.

Consider any indexing the columns of . Recall the definition of the proxy gradient at the beginning of this section. Let us define . Using and as defined in Lemma 5.2, we can write the expectation of the true gradient as, . Further, by Lemma 5.1,

 ∥γi∥≤poly(h)exp(−hpm212(b−a)2).

Since is superpolynomial in , we obtain

 ∥∥∥E[∂L∂Wi]∥∥∥2 =||αiWi−βiA∗i+ei−γi||2 =||αi(Wi−A∗i)+(αi−βi)A∗i+ei−γi||2 ≤|αi|∥Wi−A∗i∥2+|αi−βi|+||ei−γi||2 ≤Θ(m2hp−1)h2p+θ2+o(max{m21,m2}hp−1) +o(max{m21,m2}hp−1) =o(max{m21,m2}hp−1)

## 6 Simulations

We conduct some experiments on synthetic data in order to check whether the gradient norm is indeed small within the columnwise -ball of . We also make some observations about the landscape of the squared loss function, which has implications for being able to recover the ground-truth dictionary .

#### Data Generation Model:

We generate random dictionaries () of size where , and and . The dictionary entries are drawn from a standard Gaussian, and the columns of the dictionary are then normalized. These dictionaries are incoherent, with high probability. For each , we generate a dataset containing sparse vectors with non-zero entries, where . In our experiments, the coherence parameter was approximately . We conduct experiments for values of that are at most . Here is the hidden layer dimension of the autoencoder and controls the sparsity of the data used to train the autoencoder. The support of each sparse vector is drawn uniformly from all sets of indices of size , and the non-zero entries in the sparse vectors are drawn from a uniform distribution between and . Once we have generated the sparse vectors, we collect them in a matrix and then compute the signals .

We set up the autoencoder as defined through equation 2. The bias parameter in the hidden layer is set to . Choosing this prefactor of does not violate Theorem 3.1 and it was chosen to have the ReLU layer of the autoencoder recover a large fraction of the support of . We analyze the squared loss function in (2) and its gradient with respect to a column of through their empirical averages over the signals in .

#### Results:

Once we have generated the data, we compute the empirical average of the gradient of the loss function in (2) at random points which are columnwise away from . We average the gradient over the points which are all at the same distance from , and compare the average column norm of the gradient to . Our experiments show that the average column norm of the gradient is of the order of (and thus falling with for any fixed ) as expected from Theorem 3.2. Results for points sampled at are shown in Table 1.

We also plot the squared loss of the autoencoder along a randomly chosen direction to see if is possibly a local minimum. More precisely, we draw a matrix

from a standard normal distribution, and normalize its columns. We then plot

, as well as the gradient norm averaged over all the columns. For purposes of illustration, we show these plots for , in figures 1 and 2, and those for , in figures 3 and 4.

From the first four plots, we can observe that the loss function value, and the gradient norm keeps decreasing as we get close to . Since is a randomly chosen direction, this suggests that is a local minimum for the squared loss function. The plots we show here are in the log-scale along the y-axis, which is why it seems as though there is a sharp decrease in the function value. Viewed in normal scale, the function seems to decrease smoothly to a local minimum at .

In figures 5 and 6 we plot the function and gradient norm for and . This value of is much larger than the coherence parameter , and hence outside the region where the support recovery result, Theorem 3.1 is valid. We suspect that is now in a region where , which means the function is flat in a small neighborhood of .

## 7 Conclusion

In this paper we have undertaken a rigorous analysis of the loss function of the squared loss of an autoencoder when the data is assumed to be generated by sensing of sparse high dimensional vectors by an overcomplete dictionary. We have shown that the expected gradient of this loss function is very close to zero in a neighborhood of the generating overcomplete dictionary.

Our simulations complement this theoretical result by providing further empirical support. Firstly, they show that the gradient norm in this ball of indeed falls with and is of the same order as as expected from our proof. Secondly, the experiments also strongly suggest ranges of values of and where is a local minima of this loss function and that it has a neighborhood where the reconstruction error is low.

This suggests sparse coding problems can be solved by training autoencoders using gradient descent based algorithms. Further, recent investigations have led to the conjecture/belief that many important unsupervised learning tasks, e.g. recognizing handwritten digits, are sparse coding problems in disguise [25, 26]. Thus, our results could shed some light on the observed phenomenon that gradient descent based algorithms train autoencoders to low reconstruction error for natural data sets, like MNIST.

It remains to rigorously show whether a gradient descent algorithm can be initialized randomly (may be far away from ) and still be shown to converge to this neighborhood of critical points around the dictionary. Towards that it might be helpful to understand the structure of the Hessian outside this neighborhood. Since our analysis applies to the expected gradient, it remains to analyze the sample complexities where these nice results will become prominent.

The possibility also remains open that this standard loss or some other loss functions exist for the autoencoder with the provable property of having a global minima/minimum at the ground truth dictionary. We have mentioned one example of such in a special case (when is square orthogonal and is nonnegative) and even in this special case it remains open to find a provable optimization algorithm.

On the simulation front we have a couple of open challenges yet to be tackled. Firstly, it is left to find efficient implementations of the iterative update rule based on the exact gradient of the proposed loss function which has been given in (2). This would open up avenues for testing the power of this loss function on real data rather than the synthetic data used here. Secondly, a simulation of the main Theorem 3.2 that can probe deeper into its claim would need to be able to sample for different at a fixed value of the incoherence parameter . This sampling question of with these constraints is an unresolved one that is left for future work.

Autoencoders with more than one hidden layer have been used for unsupervised feature learning [22]

and recently there has been an analysis of the sparse coding performance of convolutional neural networks with one layer

[20] and two layers of nonlinearities [39]. The connections between neural networks and sparse coding has also been recently explored in [14]. It remains an exciting open avenue of research to try to do a similar study as in this work to determine if and how deeper architectures under the same generative model might provide better means of doing sparse coding.

## Acknowledgements

Akshay Rangamani and Peter Chin are supported by the AFOSR grant FA9550-12-1-0136. Amitabh Basu and Anirbit Mukherjee gratefully acknowledges support from the NSF grant CMMI1452820. We would like to thank Raman Arora (JHU), and Siva Theja Maguluri (Georgia Institute of Technology) for illuminating comments and discussion.

## Appendix A The proxy gradient is a good approximation of the true expectation of the gradient (Proof of Lemma 5.1)

###### Proof.

To make it easy to present this argument let us abstractly think of the function (defined for any ) as where we have defined the random variable . It is to be noted that because of the ReLU term and its derivative this function has a dependency on even outside its dependency through . Let us define another random variable . Then we have,

 ∥∥Ex∗[f(y,W,X)]−Ex∗[f(y,W,Y)]∥∥ℓ2 ≤ Ex∗[|f(y,W,X)−f(y,W,Y)|ℓ2] ≤ Ex∗[|f(y,W,X)(1X=Y+1X≠Y)−f(y,W,Y)(1X=Y+1X≠Y)|ℓ2] ≤ Ex∗[|(f(y,W,X)−f(y,W,Y))|ℓ21X≠Y] ≤ √Ex∗[∣∣f(y,W,X)−f(y,W,Y)∣∣22]√Ex∗[1X≠Y]

In the last step above we have used the Cauchy-Schwarz inequality for random variables. We recognize that is precisely what we defined as the proxy gradient . Further for such as in this lemma the support recovery theorem (Theorem 3.1) holds and that is precisely the statement that the term, is small. So we can rewrite the above inequality as,

 ∥∥∥Ex∗[∂L∂Wi]−ˆ∇iL∥∥∥2≤√Ex∗[∣∣f(y,W,X)−f(y,W,Y)∣∣22]exp(−hpm212(b−a)2)

We remember that is a polynomial in because its dependency is through Frobenius norms of submatrices of and norms of projections of . But the norm of the training vectors (that is ) have been assumed to be bounded by . Also we have the assumption that the columns of are within a ball of the corresponding columns of which in turn is a dimensional matrix of bounded norm because all its columns are normalized. So summarizing we have,

 ∥∥∥Ex∗[∂L∂Wi]−ˆ∇iL∥∥∥2≤poly(h)exp(−hpm212(b−a)2)

The above inequality immediately implies the claimed lemma. ∎

## Appendix B The asymptotics of the coefficients of the gradient of the squared loss (Proof of Lemma 5.2)

To recap we imagine being given as input signals (imagined as column vectors), which are generated from an overcomplete dictionary of a fixed incoherence. Let (imagined as column vectors) be the sparse code that generates . The model of the autoencoder that we now have is . is a matrix and the column of is to be denoted as the column vector .

### b.1 Derivative of the standard squared loss of a ReLU autoencoder

Using the above notation the squared loss of the autoencoder is . But we introduce a dummy constant to be multiplied to because this helps read the complicated equations that would now follow. This marker helps easily spot those terms which depend on the sensing of (those with a factor of ) as opposed to the terms which are “purely” dependent on the neural net (those without the factor of ). Thus we think of the squared loss of our autoencoder as,

 L=12||^y−Dy||2=12(W⊤ReLU(Wy−ϵ)−Dy)⊤(W⊤ReLU(Wy−ϵ)−Dy)=12fTf

where we have defined as,

 f=W⊤ReLU(Wy−ϵ)−Dy

Then we have,

 JWi(f)ab=∂fa∂Wib=ReLU(W⊤iy−ϵ)δab+Th(WTiy−ϵ)Wiayb

In the form of a derivative matrix this means,

 JWi(f)=[∂fa∂Wib]=ReLU(W⊤iy−ϵ)I+Th(W⊤iy−ϵ)Wiy⊤

This helps us write,

 ∂L∂Wi =JWi(f))⊤f =(ReLU(W⊤iy−ϵ)I+Th(W⊤iy−ϵ)Wiy⊤)⊤[W⊤ReLU(Wy−ϵ)−Dy] =Th(W⊤iy−ϵi)[(W⊤iy−ϵi)I+yW⊤i](h∑j=1ReLU(W⊤jy−ϵj)Wj−Dy)

Now going over to the proxy gradient corresponding to this term we define the vector as,

 ˆ∇iL =ES∈S⎡⎣1i∈S×Ex∗S⎡⎣[(W⊤iy−ϵi)I+yW⊤i]⎛⎝∑j∈S(W⊤jy−ϵj)Wj−Dy⎞⎠⎤⎦⎤⎦ =ES∈S[1i∈S×Gi]

Thus we have,

 Gi =Ex∗S⎡⎣[(W⊤iA∗x∗−ϵi)I+(A∗x∗)W⊤i]⎛⎝∑j∈S(W⊤jA∗x∗−ϵj)Wj−DA∗x∗⎞⎠⎤⎦ =Ex∗S⎡⎣(W⊤iA∗x∗−ϵi)⎛⎝∑j∈S(W⊤jA∗x∗−ϵj)Wj−DA∗x∗⎞⎠⎤⎦Term 1 +Ex∗S⎡⎣(A∗x∗)W⊤i⎛⎝∑j∈S(W⊤jA∗x∗−ϵj)Wj−DA∗x∗⎞⎠⎤⎦Term 2 =Ex∗S⎡⎣∑j∈SϵiϵjWj−∑j,k∈Sϵi(W⊤jA∗k)Wjx∗k−∑j,k∈Sϵj(W⊤iA∗k)Wjx∗k+∑j,k,l∈S(W⊤iA∗k)(W⊤jA∗l)Wjx∗lx∗k⎤⎦From Term 1 +Ex∗S⎡⎣−D∑j,k∈S(W⊤iA∗k)A∗jx∗kx∗j+D∑j∈SϵiA∗jx∗j⎤⎦From Term 1+Ex∗S⎡⎣−D∑j,k∈S(A∗⊤kWi)A∗jx∗kx∗j⎤⎦From Term 2 +Ex∗S⎡⎣−∑j,k∈SϵjA∗k(W⊤iWj)x∗k⎤⎦From Term 2+Ex∗S⎡⎣∑j,k,l∈S(W⊤iWj)(W⊤jA∗l)A∗kx∗kx∗l⎤⎦From Term 2

Now we invoke the distributional assumption about i.i.d sampling of the coordinates for a fixed support and the definition of and to write, for all and for , . Thus we get,

 Gi =∑j∈SϵiϵjWj−m1∑j,k∈S(W⊤jA∗k)Wjϵi−m1∑j,k∈Sϵj(W⊤iA∗k)WjG1i From Term 1 +m2∑j,k∈S(W⊤iA∗k)(W⊤jA∗k)Wj+m21∑j,k,l∈Sk≠l(W⊤iA∗k)(W⊤jA∗l)WjG2i From Term 1 +⎡