# On Optimality Conditions for Auto-Encoder Signal Recovery

Auto-Encoders are unsupervised models that aim to learn patterns from observed data by minimizing a reconstruction cost. The useful representations learned are often found to be sparse and distributed. On the other hand, compressed sensing and sparse coding assume a data generating process, where the observed data is generated from some true latent signal source, and try to recover the corresponding signal from measurements. Looking at auto-encoders from this signal recovery perspective enables us to have a more coherent view of these techniques. In this paper, in particular, we show that the true hidden representation can be approximately recovered if the weight matrices are highly incoherent with unit ℓ^2 row length and the bias vectors takes the value (approximately) equal to the negative of the data mean. The recovery also becomes more and more accurate as the sparsity in hidden signals increases. Additionally, we empirically demonstrate that auto-encoders are capable of recovering the data generating dictionary when only data samples are given.

## Authors

• 21 publications
• 25 publications
• 13 publications
• 4 publications
• 11 publications
• ### Robust Compressed Sensing and Sparse Coding with the Difference Map

In compressed sensing, we wish to reconstruct a sparse signal x from obs...
10/31/2013 ∙ by Will Landecker, et al. ∙ 0

• ### Scalable Convolutional Dictionary Learning with Constrained Recurrent Sparse Auto-encoders

Given a convolutional dictionary underlying a set of observed signals, c...
07/12/2018 ∙ by Bahareh Tolooshams, et al. ∙ 0

• ### Sparse Choice Models

Choice models, which capture popular preferences over objects of interes...
11/19/2010 ∙ by Vivek F. Farias, et al. ∙ 0

• ### Sparse Non-Negative Recovery from Biased Subgaussian Measurements using NNLS

We investigate non-negative least squares (NNLS) for the recovery of spa...
01/17/2019 ∙ by Yonatan Shadmi, et al. ∙ 0

• ### Joint Design of Measurement Matrix and Sparse Support Recovery Method via Deep Auto-encoder

Sparse support recovery arises in many applications in communications an...
10/10/2019 ∙ by Shuaichao Li, et al. ∙ 0

• ### Reconstruction of Hidden Representation for Robust Feature Extraction

This paper aims to develop a new and robust approach to feature represen...
10/08/2017 ∙ by Zeng Yu, et al. ∙ 0

• ### Scoring and Classifying with Gated Auto-encoders

Auto-encoders are perhaps the best-known non-probabilistic methods for r...
12/20/2014 ∙ by Daniel Jiwoong Im, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

* Equal contribution

Recovering hidden signal from measurement vectors (observations) is a long studied problem in compressed sensing and sparse coding with a lot of successful applications. On the other hand, auto-encoders (AEs) (Bourlard and Kamp, 1988) are useful for unsupervised representation learning for uncovering patterns in data. AEs focus on learning a mapping , where the reconstructed vector is desired to be as close to as possible for the entire data distribution. What we show in this paper is that if we consider is actually generated from some true sparse signal by some process (see section 3), then switching our perspective on AE to analyze shows that AE is capable of recovering the true signal that generated the data and yields useful insights into the optimality of model parameters of auto-encoders in terms of signal recovery. In other words, this perspective lets us look at AEs from a signal recovery point of view where forward propagating recovers the true signal . We analyze the conditions under which the encoder part of an AE recovers the true from , while the decoder part acts as the data generating process. Our main result shows that the true sparse signal

(with mild distribution assumptions) can be approximately recovered by the encoder of an AE with high probability under certain conditions on the weight matrix and bias vectors. Additionally, we empirically show that in a practical setting when only data is observed, optimizing the AE objective leads to the recovery of both the data generating dictionary

and the true sparse signal , which together is not well studied in the auto-encoder framework, to the best of our knowledge.

## 2 Sparse Signal Recovery Perspective

While it is known both empirically and theoretically, that useful features learned by AEs are usually sparse (Memisevic et al., 2014; Nair and Hinton, 2010; Arpit et al., 2016). An important question that hasn’t been answered yet is whether AEs are capable of recovering sparse signals, in general. This is an important question for Sparse Coding, which entails recovering the sparsest that approximately satisfies , for any given data vector and overcomplete weight matrix . However, since this problem is NP complete (Amaldi and Kann, 1998), it is usually relaxed to solving an expensive optimization problem (Candes et al., 2006; Candes and Tao, 2006),

 argminh∥x−WTh∥2+λ∥h∥1 (1)

where is a fixed overcomplete () dictionary, is the regularization coefficient, is the data and is the signal to recover. For this special case, Makhzani and Frey (2013) analyzed the condition under which linear AEs can recover the support of the hidden signal.

The general AE objective, on the other hand, minimizes the expected reconstruction cost

 (2)

for some reconstruction cost

, encoding and decoding activation function

and , and bias vectors and . In this paper we consider linear activation because it is a more general case. Notice however, in the case of auto-encoders, the activation functions can be non-linear in general, in contrast to the sparse coding objective. In addition, in case of AEs we do not have a separate parameter for the hidden representation corresponding to every data sample individually. Instead, the hidden representation for every sample is a parametric function of the sample itself. This is an important distinction between the optimization in eq. 1 and our problem – the identity of in eq. 1 is only well defined in the presence of regularization due to the overcompleteness of the dictionary. However, in our problem, we assume a true signal generates the observed data as , where the dictionary and bias vector are fixed. Hence, what we mean by recovery of sparse signals in an AE framework is that if we generate data using the above generation process, then

can the estimate

indeed recover the true for some activation functions , and bias vector ? And if so, what properties of and lead to good recovery? However, when given an and the true overcomplete , the solution to is not unique. Then the question arises about the possibility of recovering such an . However, as we show, recovery using the AE mechanism is strongest when the signal is the sparsest possible one, which from compressed sensing theory, guarantees uniqueness of if is sufficiently incoherent 111Coherence is defined as .

## 3 Data Generation Process

We consider the following data generation process:

 x=WTh+bd+e (3)

where is the observed data, is a bias vector, is a noise vector, is the weight matrix and is the true hidden representation (signal) that we want to recover. Throughout our analysis, we assume that the signal belongs to the following class of distribution,

###### Assumption 1.

Bounded Independent Non-negative Sparse (BINS): Every hidden unit

is an independent random variable with the following density function:

 f(hj)={(1−pj)δ0(hj)ifhj=0pjfc(hj)ifhj∈(0,lmaxj] (4)

where

can be any arbitrary normalized distribution bounded in the interval

, mean , and is the Dirac Delta function at zero. As a short hand, we say that follows the distribution BINS(). Notice that .

The above distribution assumption fits naturally with sparse coding, when the intended signal is non-negative sparse. From the AE perspective, it is also justified based on the following observation. In neural networks with ReLU activations, hidden unit pre-activations have a Gaussian like symmetric distribution

(Hyvärinen and Oja, 2000; Ioffe and Szegedy, 2015). If we assume these distributions are mean centered222

This happens for instance as a result of the Batch Normalization

(Ioffe and Szegedy, 2015) technique, which leads to significantly faster convergence. It is thus a good practice to have a mean centered pre-activation distribution.
, then the hidden units’ distribution after ReLU has a large mass at while the rest of the mass concentrates in for some finite positive , because the pre-activations concentrate symmetrically around zero. As we show in the next section, ReLU is indeed capable of recovering such signals. On a side note, the distribution from assumption 1

can take shapes similar to that of Exponential or Rectified Gaussian distribution

333depending on the distribution

(which are generally used for modeling biological neurons) but is simpler to analyze. This is because we allow

to be any arbitrary normalized distribution. The only restriction assumption 1 has is that to be bounded. However, this does not change the representative power of this distribution significantly because: a) the distributions used for modeling neurons have very small tail mass; b) in practice, we are generally interested in signals with upper bounded values.

The generation process considered in this section (i.e. eq. 3 and assumptions 1) is justified because:
1. This data generation model finds applications in a number of areas (Yang et al., 2009; Kavukcuoglu et al., 2010; Wright et al., 2009). Notice that while is the measurement vector (observed data), which can in general be noisy, denotes the actual signal (internal representation) because it reflects the combination of dictionary () atoms involved in generating the observed samples and hence serves as the true identity of the data.
2.

(Hinton, 1984) is both observed and desired in hidden representations. It has been empirically shown that representations that are truly sparse (i.e. large number of true zeros) and distributed usually yield better linear separability and performance (Glorot et al., 2011; Wright et al., 2009; Yang et al., 2009).

Decoding bias (): Consider the data generation process (exclude noise for now) . Here is a bias vector which can take any arbitrary value but similar to , it is fixed for any particular data generation process. However, the following remark shows that if an AE can recover the sparse code () from a data sample generated as , then it is also capable of recovering the sparse code from the data generated as and vice versa.

###### Remark 1.

Let where , and . Let where is a fixed vector. Let and . Then iff .

Thus without any loss of generality, we will assume our data is generated from .

## 4 Signal Recovery Analysis

We analyse two separate class of signals in this category– continuous sparse, and binary sparse signals that follow BINS. For notational convenience, we will drop the subscript of and simply refer this parameter as since it is the only bias vector (we are not considering the other bias due to remark 1). The Auto-Encoder signal recovery mechanism that we analyze throughout this paper is defined as,

###### Definition 1.

Let a data sample be generated by the process where is a fixed matrix, is noise and . Then we define the Auto-Encoder signal recovery mechanism as that recovers the estimate where is an activation function.

### 4.1 Binary Sparse Signal Analysis

First we consider the noiseless case of data generation,

###### Theorem 1.

(Noiseless Binary Signal Recovery): Let each element of follow BINS() and let be an auto-encoder signal recovery mechanism with Sigmoid activation function and bias for a measurement vector such that . If we set , then ,

 Pr(1m∥^h−h∥1≤δ)≥1− m∑i=1⎛⎜ ⎜⎝(1−pi)e−2(δ′+piaii)2∑mj=1,j≠ia2ij+pie−2(δ′+(1−pi)aii)2∑mj=1,j≠ia2ij⎞⎟ ⎟⎠ (5)

where , and is the row of the matrix cast as a column vector.

Analysis: We first analyse the properties of the weight matrix that results in strong recovery bound. Notice the terms and need to be as large as possible, while simultaneously, the term needs to be as close to zero as possible. For the sake of analysis, lets set444Setting

is not such a bad choice after all because for binary signals, we can recover the exact true signal with high probability by simply binarize the signal recovered by Sigmoid with some threshold.

(achieved when ). Then our problem gets reduced to maximizing the ratio , where is the angle between and . From the property of coherence, if the rows of the weight matrix is highly incoherent, then is close to . Again, for the ease of analysis, lets replace each with a small positive number . Then . Finally, since we would want this term to be maximized for each hidden unit equally, the obvious choice for each weight length () is to set it to .

Finally, lets analyse the bias vector. Notice we have instantiated each element of the encoding bias to take value . Since is essentially the mean of each binary hidden unit , we can say that .

Signal recovery is strong for binary signals when the recovery mechanism is given by

 ^hi≜Sigmoid(WTi(x−Eh[x])) (6)

where the rows of are highly incoherent and each hidden weight has length ones (), and each dimension of data is approximately uncorrelated (see theorem 3).

Now we state the recovery bound for the noisy data generation scenario.

###### Proposition 1.

(Noisy Binary Signal Recovery): Let each element of follow BINS() and let be an auto-encoder signal recovery mechanism with Sigmoid activation function and bias for a measurement vector where is any noise vector independent of . If we set , then ,

 Pr(1m∥^h−h∥1≤δ)≥1−m∑i=1⎛⎜ ⎜⎝(1−pi)e−2(δ′−WTi(e−Ee[e])+piaii)2∑mj=1,j≠ia2ij (7) +pie−2(δ′−WTi(e−Ee[e])+(1−pi)aii)2∑mj=1,j≠ia2ij⎞⎟ ⎟⎠ (8)

where , and is the row of the matrix cast as a column vector.

We have not assumed any distribution on the noise random variable and this term has no effect on recovery (compared to the noiseless case) if the noise distribution is orthogonal to the hidden weight vectors. Again, the same properties of lead to better recovery as in the noiseless case. In the case of bias, we have set each element of the bias . Notice from the definition of BINS, . Thus in essence, . Expanding , we get, . Thus the expression of bias is unaffected by error statistics as long as we can compute the data mean.

In this section, we will first consider the case when data () is generated by linear process , and if and encoding bias have certain properties, then the signal recovery bound () is strong. We will then consider the case when data generated by a non-linear process (for certain class of functions ) can be recovered as well by the same mechanism. For deep non-linear networks, this means that forward propagating data to hidden layers, such that the network parameters satisfy the required conditions, implies each hidden layer recovers the true signal that generated the corresponding data. We have moved all the proofs to appendix for better readability.

### 4.2 Continuous Sparse Signal Recovery

###### Theorem 2.

(Noiseless Continuous Signal Recovery): Let each element of follow BINS() distribution and let be an auto-encoder signal recovery mechanism with Rectified Linear activation function (ReLU) and bias for a measurement vector such that . If we set , then ,

 Pr(1m∥^h−h∥1≤δ)≥1−m∑i=1⎛⎜ ⎜ ⎜⎝e−2(δ+∑j(1−pj)(lmaxj−2pjμhj)max(0,aij))2∑ja2ijl2maxj+e−2(δ+∑j(1−pj)(lmaxj−2pjμhj)max(0,−aij))2∑ja2ijl2maxj⎞⎟ ⎟ ⎟⎠ (9)

where s are vectors such that

 aij={WTiWjifi≠jWTiWi−1ifi=j (10)

is the row of the matrix cast as a column vector.

Analysis: We first analyze the properties of the weight matrix that results in strong recovery bound. We find that for strong recovery, the terms and should be as large as possible, while simultaneously, the term needs to be as close to zero as possible. First, notice the term . Since by definition, we have that both terms containing are always positive and contributes towards stronger recovery if is less than (sparse), and becomes stronger as the signal becomes sparser (smaller ).

Now if we assume the rows of the weight matrix are highly incoherent and that each row of has unit length, then it is safe to assume each () is close to from the definition of and properties of we have assumed. Then for any small positive value of , we can approximately say where each is very close to zero. The same argument holds similarly for the other term. Thus we find that a strong signal recovery bound would be obtained if the weight matrix is highly incoherent and all hidden vectors are of unit length.

In the case of bias, we have set each element of the bias . Notice from the definition of BINS, . Thus in essence, . Expanding , we get .

The recovery bound is strong for continuous signals when the recovery mechanism is set to

 ^hi≜ReLU(WTi(x−Ex[x])+Ehi[hi]) (11)

and the rows of are highly incoherent and each hidden weight has length ones ().

Now we state the recovery bound for the noisy data generation scenario.

###### Proposition 2.

(Noisy Continuous Signal Recovery): Let each element of follow BINS() distribution and let be an auto-encoder signal recovery mechanism with Rectified Linear activation function (ReLU) and bias for a measurement vector such that where is any noise random vector independent of . If we set , then ,

 Pr(1m∥^h−h∥1≤δ)≥1−m∑i=1⎛⎜ ⎜ ⎜⎝e−2(δ−WTi(e−Ee[e])+∑j(1−pj)(lmaxj−2pjμhj)max(0,aij))2∑ja2ijl2maxj +e−2(δ−WTi(e−Ee[e])+∑j(1−pj)(lmaxj−2pjμhj)max(0,−aij))2∑ja2ijl2maxj⎞⎟ ⎟ ⎟⎠ (12)

where s are vectors such that

 aij={WTiWjifi≠jWTiWi−1ifi=j (13)

is the row of the matrix cast as a column vector.

Notice that we have not assumed any distribution on variable , which denotes the noise. Also, this term has no effect on recovery (compared to the noiseless case) if the noise distribution is orthogonal to the hidden weight vectors. On the other hand, the same properties of lead to better recovery as in the noiseless case. However, in the case of bias, we have set each element of the bias . From the definition of BINS, . Thus . Expanding , we get, . Thus the expression of bias is unaffected by error statistics as long as we can compute the data mean (i.e. the recovery is the same as shown in eq. 11).

### 4.3 Properties of Generated Data

Since the data we observe results from the hidden signal given by , it would be interesting to analyze the distribution of the generated data. This would provide us more insight into what kind of pre-processing would ensure stronger signal recovery.

###### Theorem 3.

(Uncorrelated Distribution Bound): If data is generated as where has covariance matrix , () and () is such that each row of has unit length and the rows of are maximally incoherent, then the covariance matrix of the generated data is approximately spherical (uncorrelated) satisfying,

 minα∥Σ−αI∥F≤√1n(m∥ζ∥22−∥ζ∥21) (14)

where is the covariance matrix of the generated data.

Analysis: Notice that for any vector , , and the equality holds when each element of the vector is identical.

Data generated using a maximally incoherent dictionary (with unit row length) as guarantees is highly uncorrelated if is uncorrelated with near identity covariance. This would ensure the hidden units at the following layer are also uncorrelated during training. Further the covariance matrix of

is identity, if all hidden units have equal variance.

This analysis acts as a justification for data whitening where data is processed to have zero mean and identity covariance matrix. Notice that although the generated data does not have zero mean, the recovery process (eq. 11) subtracts data mean and hence it does not affect recovery.

### 4.4 Connections with existing work

Auto-Encoders (AE): Our analysis reveals the conditions on parameters of an AE that lead to strong recovery of (for both continuous and binary case), which ultimately implies low data reconstruction error.

However, the above arguments hold for AEs from a recovery point of view. Training an AE on data may lead to learning of the identity function. Thus usually AEs are trained along with a bottle-neck to make the learned representation useful. One such bottle-neck is the De-noising criteria given by,

 JDAE=minW,b∥x−WTse(W~x+b)∥2 (15)

where is the activation function and is a corrupted version of . It has been shown that the Taylor’s expansion of DAE (Theorem 3 of Arpit et al., 2016) has the term . If we constrain the lengths of the weight vectors to have fixed length, then this regularization term minimizes a weighted sum of cosine of the angle between every pair of weight vectors. As a result, the weight vectors become increasingly incoherent. Hence we achieve both our goals by adding one additional constraint to DAE– constraining weight vectors to have unit length. Even if we do not apply an explicit constraint, we can expect the weight lengths to be upper bounded from the basic AE objective itself, which would explain the learning of incoherent weights due to the DAE regularization.On a side note, our analysis also justifies the use of tied weights in auto-encoders.

Sparse Coding (SC): SC involves minimizing using the sparsest possible . The analysis after theorem 2 shows signal recovery using the AE mechanism becomes stronger for sparser signals (as also confirmed experimentally in section 5). In other words, for any given data sample and weight matrix, as long as the conditions on the weight matrix and bias are met, the AE recovery mechanism recovers the sparsest possible signal; which justifies using auto-encoders for recovering sparse codes (see Henaff et al., 2011; Makhzani and Frey, 2013; Ng, 2011, for work along this line).

Independent Component Analysis(Hyvärinen and Oja, 2000) (ICA): ICA assumes we observe data generated by the process where all elements of the are independent and is a mixing matrix. The task of ICA is to recover both and given data. This data generating process is precisely what we assumed in section 3. Based on this assumption, our results show that 1) the properties of that can recover such independent signals ; and 2) auto-encoders can be used for recovering such signals and weight matrix .

k-Sparse AEs : Makhzani and Frey (2013) propose to zero out all the values of hidden units smaller than the top-k values for each sample during training. This is done to achieve sparsity in the learned hidden representation. This strategy is justified from the perspective of our analysis as well. This is because the PAC bound (theorem 2) derived for signal recovery using the AE signal recovery mechanism shows we recover a noisy version of the true sparse signal. Since the noise in each recovered signal unit is roughly proportional to the original value, de-noising such recovered signals can be achieved by thresholding the hidden unit values (exploiting the fact that the signal is sparse). This can be done either by using a fixed threshold or picking the top k values.

Data Whitening: Theorem 3 shows that data generated from BINS and incoherent weight matrices are roughly uncorrelated. Thus recovering back such signals using auto-encoders would be easier if we pre-process the data to have uncorrelated dimensions.

## 5 Empirical Verification

We empirically verify the fundamental predictions made in section 4 which both serve to justify the assumptions we have made, as well as confirm our results. We verify the following: a) the optimality of the rows of a weight matrix to have unit length and being highly incoherent for AE signal recovery; b) effect of sparsity on AE signal recovery; and c) in practice, AE can recover not only the true sparse signal , but also the dictionary that used to generate the data.

### 5.1 Optimal Properties of Weights and Bias

Our analysis on signal recovery in section 4 (eq. 11) shows signal recovery bound is strong when a) the data generating weight matrix has rows of unit length; b) the rows of are highly incoherent; c) each bias vector element is set to the negative expectation of the pre-activation; d) signal has each dimension independent. In order to verify this, we generate signals from BINS(=,=uniform, =,=) with

set to uniform distribution for simplicity. We then generate the corresponding

data sample using an incoherent weight matrix (each element sampled from zero mean Gaussian, the columns are then orthogonalized, and length of each row rescaled to ; notice the rows cannot be orthogonal). We then recover each signal using,

 ^hi≜ReLU(cWTi(x−Eh[x])+Ehi[hi]+Δb) (16)

where and are scalars that we vary between and respectively. We also generate signals from BINS(,, ,) with set to Dirac delta function at 1. We then generate the corresponding data sample following the same procedure as for the continuous signal case. The signal is recovered using

 (17)

where

is the sigmoid function. For the recovered signals, we calculate the

Average Percentage Recovery Error (APRE) as,

 APRE=100NmN,m∑i=1,j=1whij1(|^hij−hij|>ϵ) (18)

where we set to for continuous signals and for binary case, is the indicator operator, denotes the dimension of the recovered signal corresponding to the true signal and,

 whij=⎧⎨⎩0.5pifhij>00.51−pifhij=0 (19)

The error is weighted with so that the recovery error for both zero and non-zero s are penalized equally. This is specially needed in this case, because is sparse and a low error can also be achieved by trivially setting all the recovered s to zero. Along with the incoherent weight matrix, we also generate data separately using a highly coherent weight matrix that we get by sampling each element randomly from a uniform distribution on and scaling each row to unit length. According to our analysis, we should get least error for and for the incoherent matrix while the coherent matrix should yield both higher recovery error and a different choice of and (which is unpredictable). The error heat maps for both continuous and binary recovery555We use 0.55 as the threshold to binarize the recovered signal using sigmoid function. are shown in fig. 2. For the incoherent weight matrix, we see that the empirical optimal is precisely and (which is exactly as predicted) with 0.21 and 0.0 APRE for continuous and binary recovery, respectively. It is interesting to note that the binary recovery is quite robust with the choice of and , which is because 1) the recovery is denoised through thresholding, and 2) the binary signal inherently contains less information and thus is easier to recover. For the coherent weight matrix, we get 45.75 and 32.63 APRE instead (see fig. 5).

We also experiment on the noisy recovery case, where we generate the data using incoherent weight matrix with and

. For each data dimension we add independent Gaussian noise with mean 100 with standard deviation varying from

to . Both signal recovery schemes are quite robust against noise (see fig. 2). In particular, the binary signal recovery is very robust, which conforms with our previous observation.

### 5.2 Effect of Sparsity on Signal Recovery

We analyze the effect of sparsity of signals on their recovery using the mechanism shown in section 4. In order to do so, we generate incoherent matrices using two different methods– Gaussian666Gaussian and Xavier (Glorot and Bengio, 2010) initialization becomes identical after weight length normalization and orthogonal (Saxe et al., 2013). In addition, all the generated weight matrices are normalized to have unit row length. We then sample signals and generate data using the same configurations as mentioned in section 5.1; only this time, we fix and , vary hidden unit activation probability in , and duplicate the generated data while adding noise to the copy, which we sample from a Gaussian distribution with mean and standard deviation . According to our analysis, noise mean should have no effect on recovery so the mean value of shouldn’t have any effect; only standard deviation affects recovery. We find for all weight matrices, recovery error reduces with increasing sparsity (decreasing , see fig. 4). Additionally, we find that both recovery are robust against noise. We also find the recovery error trend is almost always lower for orthogonal weight matrices, especially when the signal is sparse. 777notice the rows of are not orthogonal for overcomplete filters, rather the columns are orthogonalized, unless is undercomplete Recall theorem 2 suggests stronger recovery for more incoherent matrices. So we look into the row coherence of sampled from Gaussian and Orthogonal methods with and varying . We found that orthogonal initialized matrices have significantly lower coherence even though the orthogonalization is done column-wise (see fig. 6.). This explains significantly lower recovery error for orthogonal matrices in figure 4.

### 5.3 Recovery of Data Dictionary

We showed the conditions on and for good recovery of sparse signal . In practice, however, one does not have access to , in general. Therefore, in this section, we empirically demonstrate that AE can indeed recover both and through optimizing the AE objective. We generate signals with the same BINS distribution as in section 5.1. The data are then generate as using an incoherent weight matrix (same as in section 5.1). We then recover the data dictionary by:

 ^W=argminWEx[∥x−WTse(W(x−Ex[x]))∥2],where∥Wi∥22=1∀i (20)

Notice that although given sparse signal the data dictionary is unique (Hillar and Sommer, 2015), there are number of equivalent solutions for , since we can permute dimension of in AE. To check if the original data dictionary is recovered, we therefore pair up the rows of and by greedily select the pairs that result in the highest dot product value. We then measure the goodness of the recovery by looking at the values of all the paired dot products. In addition, since we know the pairing, we can calculate APRE to evaluate the quality of recovered hidden signal. As can be observed from fig. 4, by optimizing the AE objective we can recover the the original data dictionary (almost all of the cosine distances are 1). The final achieved and APRE for continuous and binary signal recovery, which is a bit less than what we achieved in section 5.1. However, one should note that for this set of experiments we only observed data and no other information regarding is exposed. Not surprisingly, we again observed that the binary signal recovery is more robust as compared to the continuous counterpart, which may attribute to its lower information content. We also did experiments on noisy data and achieved similar performance as in section 5.1 when the noise is less significant (see supplementary materials for more details). These results strongly suggests that AEs are capable of recovering the true hidden signal in practice.

## 6 Conclusion

In this paper we looked at the sparse signal recovery problem from the Auto-Encoder perspective and provide novel insights into conditions under which AEs can recover such signals. In particular, 1) from the signal recovery stand point, if we assume that the observed data is generated from some sparse hidden signals according to the assumed data generating process, then, the true hidden representation can be approximately recovered if a) the weight matrices are highly incoherent with unit row length, and b) the bias vectors are as described in equation 11 (theorem 2)888For binary recovery, the bias equation is described in 6. The recovery also becomes more and more accurate with increasing sparsity in hidden signals. 2) From the data generation perspective, we found that data generated from such signals (assumption 1) have the property of being roughly uncorrelated (theorem 3), and thus pre-process the data to have uncorrelated dimensions may encourage stronger signal recovery. 3) Given only measurement data, we empirically show that the AE reconstruction objective recovers the data generating dictionary, and hence the true signal . 4) These conditions and observations allow us to view various existing techniques, such as data whitening, independent component analysis, etc., in a more coherent picture when considering signal recovery.

## References

• Amaldi and Kann (1998) Edoardo Amaldi and Viggo Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(1):237–260, 1998.
• Arpit et al. (2016) Devansh Arpit, Yingbo Zhou, Hung Ngo, and Venu Govindaraju. Why regularized auto-encoders learn sparse representation? In ICML, 2016.
• Bourlard and Kamp (1988) H. Bourlard and Y. Kamp.

Auto-association by multilayer perceptrons and singular value decomposition.

Biological Cybernetics, 59(4-5):291–294, 1988. ISSN 0340-1200.
• Candes and Tao (2006) Emmanuel J Candes and Terence Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, 2006.
• Candes et al. (2006) Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207–1223, 2006.
• Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

International conference on artificial intelligence and statistics

, pages 249–256, 2010.
• Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, pages 315–323, 2011.
• Henaff et al. (2011) Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. Unsupervised learning of sparse features for scalable audio classification. ISMIR, 11:445. 2011, 2011.
• Hillar and Sommer (2015) Christopher J Hillar and Friedrich T Sommer. When can dictionary learning uniquely recover sparse data from subsamples? IEEE Transactions on Information Theory, 61(11):6290–6297, 2015.
• Hinton (1984) Geoffrey E Hinton. Distributed representations. 1984.
• Hyvärinen and Oja (2000) Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411–430, 2000.
• Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 448–456. JMLR.org, 2015.
• Kavukcuoglu et al. (2010) Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467, 2010.
• Makhzani and Frey (2013) Alireza Makhzani and Brendan Frey. k-sparse autoencoders. CoRR, abs/1312.5663, 2013. URL http://arxiv.org/abs/1312.5663.
• Memisevic et al. (2014) Roland Memisevic, Kishore Reddy Konda, and David Krueger.

Zero-bias autoencoders and the benefits of co-adapting features.

In ICLR, 2014.
• Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. In

Proceedings of the 27th International Conference on Machine Learning (ICML-10)

, pages 807–814, 2010.
• Ng (2011) Andrew Ng. Sparse autoencoder. CSE294 Lecture notes, 2011.
• Saxe et al. (2013) Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
• Wright et al. (2009) J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma.

Robust face recognition via sparse representation.

IEEEE TPAMI, 31(2):210 –227, Feb. 2009.
• Yang et al. (2009) Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pages 1794–1801, 2009.

## 1 Proofs

###### Remark 1.

Let where , and . Let where is a fixed vector. Let and . Then iff .

Proof: Let . Thus . On the other hand, . The other direction can be proved similarly.

###### Theorem 1.

Let each element of follow BINS() and let be an auto-encoder signal recovery mechanism with Sigmoid activation function and bias for a measurement vector such that . If we set , then ,

 Pr(1m∥^h−h∥1≤δ)≥1−m∑i=1⎛⎜ ⎜⎝(1−pi)e−2(δ′+piaii)2∑mj=1,j≠ia2ij+pie−2(δ′+(1−pi)aii)2∑mj=1,j≠ia2ij⎞⎟ ⎟⎠ (21)

where , and is the row of the matrix cast as a column vector.

###### Proof.

Notice that,

 Pr(|^hi−hi|≥δ)=Pr(|^hi−hi|≥δ∣∣hi=0)Pr(hi=0)+Pr(|^hi−hi|≥δ∣∣hi=1)Pr(hi=1) (22)

and from definition 1,

 ^hi=σ(∑jaijhj+bi) (23)

Thus,

 Pr(|^hi−hi|≥δ)=(1−pi)Pr(σ(∑jaijhj+bi)≥δ∣∣hi=0) +piPr(σ(−∑jaijhj−bi)≥δ∣∣hi=1) (24)

Notice that . Let and . Then, setting , using Chernoff’s inequality, for any ,

 Pr(zi≥δ′∣∣hi=0)≤ Eh[etzi]etδ′=Eh[et∑j≠iaij(hj−pj)−tpiaii]etδ′ =Eh[∏j≠ietaij(hj−pj)]et(δ′+piaii)=∏j≠iEhj[etaij(hj−pj)]et(δ′+piaii) (25)

Let . Then,

 Tj=(1−pj)e−tpjaij+pjet(1−pj)aij=e−tpjaij(1−pj+pjetaij) (26)

Let , thus,

 g(t)=−tpjaij+ln(1−pj+pjetaij)⟹g(0)=0 (27)