Representation Learning and Recovery in the ReLU Model

03/12/2018 ∙ by Arya Mazumdar, et al. ∙ 0

Rectified linear units, or ReLUs, have become the preferred activation function for artificial neural networks. In this paper we consider two basic learning problems assuming that the underlying data follow a generative model based on a ReLU-network -- a neural network with ReLU activations. As a primarily theoretical study, we limit ourselves to a single-layer network. The first problem we study corresponds to dictionary-learning in the presence of nonlinearity (modeled by the ReLU functions). Given a set of observation vectors y^i ∈R^d, i =1, 2, ... , n, we aim to recover d× k matrix A and the latent vectors {c^i}⊂R^k under the model y^i = ReLU(Ac^i +b), where b∈R^d is a random bias. We show that it is possible to recover the column space of A within an error of O(d) (in Frobenius norm) under certain conditions on the probability distribution of b. The second problem we consider is that of robust recovery of the signal in the presence of outliers, i.e., large but sparse noise. In this setting we are interested in recovering the latent vector c from its noisy nonlinear sketches of the form v = ReLU(Ac) + e+w, where e∈R^d denotes the outliers with sparsity s and w∈R^d denote the dense but small noise. This line of work has recently been studied (Soltanolkotabi, 2017) without the presence of outliers. For this problem, we show that a generalized LASSO algorithm is able to recover the signal c∈R^k within an ℓ_2 error of O(√((k+s) d/d)) when A is a random Gaussian matrix.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rectified Linear Unit (ReLU) is a basic nonlinear function defined to be as . For any matrix , denotes the matrix obtained by applying the ReLU function on each of the coordinates of the matrix . ReLUs are building blocks of many nonlinear data-fitting problems based on deep neural networks (see, e.g., Soltanolkotabi (2017) for a good exposition).

Let be a collection of message vectors that are of interest to us. Depending on the application at hand, the message vectors, i.e., the constituents of , may range from images, speech signals, network access patterns to user-item rating vectors and so on. We assume that the message vectors satisfy a generative model, where each message vector can be approximated by a map from the latent space to the ambient space, i.e., for each ,


Motivated by the recent results on developing the generative models for various real-life signals (see e.g., Goodfellow et al. (2014); Kingma and Welling (2014); Bora et al. (2017)), the non-linear maps that take the following form warrant special attention.


i.e., is the function corresponding to an -layer neural network with the activation function . Here, for , with and , denotes the weight matrix for the -th layer of the network. In the special case, where the activation function is the function, the message vectors of the interest satisfy the following.


where, for ,

denotes the biases of the neurons (or output units) at the

-th layer of the network.

The specific generative model in (3) raises multiple interesting questions that play fundamental role in understanding the underlying data and designing systems and algorithms for processing the data. Two such most basic questions are as follows:

  1. Learning the representation: Given the observations from the model (cf. (3)), recover the parameters of the model, i.e., , and such that


    Note that this question is different from training the model, in which case the set is known (and possibly chosen accordingly).

  2. Recovery of the signal in the presence of errors: Given the erroneous (noisy) version of a vector generated by the model (cf. (3)), denoise the observation or recover the latent vector. Formally, given


    and the knowledge of model parameters, obtain or such that or is small, respectively. In (5), and correspond to outliers, i.e., large but sparse errors, and (dense but small) noise, respectively.

Apart from being closely related, one of our main motivations behind studying these two problems together comes from the recent work on associative memory Karbasi et al. (2014); Mazumdar and Rawat (2015, 2017). An associative memory consists of a learning phase, where a generative model is learned from a given dataset; and a recovery phase, where given a noisy version of a data point generated by the generative model, the noise-free version is recovered with the help of the knowledge of the generative model.

There have been a recent surge of interest in learning ReLUs, and the above two questions are of basic interest even for a single-layer network (i.e., nonlinearity comprising of a single ReLU function). It is conceivable that understanding the behavior of a single-layer network would allow one to use some ‘iterative peeling off’ technique to develop a theory for multiple layers. In Goel et al. (2017), the problem of recovering -model under Reliable Agnostic learning model of Kalai et al. (2012) is considered. Informally speaking, under very general distributional assumptions (the rows of are sampled from some distribution), given and , Goel et al. (2017)

propose an algorithm that recovers a hypothesis which has an error-rate (under some natural loss function defined therein) of

with respect to the true underlying -model. Moreover, the algorithm runs in time polynomial in and exponential in . As opposed to this, given and the corresponding output of the -network , we focus on the problem of recovering itself. Here, we note that the the model considered in Goel et al. (2017) does not consider the presence of outliers.

Soltanolkotabi (2017) obtained further results on this model under somewhat different learning guarantees. Assuming that the entries of the matrix to be i.i.d. Gaussian, Soltanolkotabi (2017) show that with high probability a gradient descent algorithm recovers within some precision in terms of -loss: the relative error decays exponentially with the number of steps in the gradient descent algorithm. The obtained result is more general as it extends to constrained optimizations in the presence of some regularizers (for example, can be restricted to be a sparse vector, etc.).

However both of these works do not consider the presence of outliers (sparse but large noise) in the observation. The sparse noise is quite natural to assume, as many times only partial observations of a signal vector are obtained. The ReLU model with outliers as considered in this paper can be thought of as a nonlinear version of the problem of recovering from linear observations of the form , with denoting the outliers. This problem with linear observations was studied in the celebrated work of Candès and Tao (2005). We note that the technique of Candès and Tao (2005) does not extend to the case when there is a dense (but bounded) noise component present. Our result in this case is a natural generalization and complementary to the one in Soltanolkotabi (2017) in that 1) we present a recovery method which is robust to outliers and 2) instead of analyzing gradient descent we directly analyze the performance of the minimizer of our optimization program (a generalized LASSO) using the ideas from Plan and Vershynin (2016); Nguyen and Tran (2013).

On the other hand, to the best of our knowledge, the representation learning problem for single-layer networks has not been studied as such. The representation learning problem for single-layer ReLUs bears some similarity with matrix completion problems, a fact we greatly exploit later. In low rank matrix completion, a matrix is visible only partially, and the task is to recover the unknown entries by exploiting the fact that it is low rank. In the case of (4), we are more likely to observe the positive entries of the matrix , which, unlike a majority of matrix completion literature, creates the dependence between the matrix and the sampling procedure.

Main result for representation learning. We assume to have observed matrix where is a matrix, is a matrix, both unknown, is a random i.i.d. bias, and denote the Kronecker product111This is to ensure that the bias is random, but does not change over different observation of the data samples.. We show that a relaxed maximum-likelihood method guarantees the recovery of the matrix with an error in Frobenius norm at most with high probability (see Theorem 3 for the formal statement). Then leveraging a known result for recovering column space of a perturbed matrix (see Theorem. 5 in the appendix), we show that it is possible to also recover the column space of with similar guarantee.

The main technique that we use to obtain this result is inspired by the recent work on matrix completion by Davenport et al. (2014). One of the main challenges that we face in recovery here is that while an entry of the matrix

is a random variable (since

is a random bias), whether that is being observed or being cut-off by the ReLU function (for being negative) depends on the value of the entry itself. In general matrix completion literature, the entries of the matrix being observed are sampled i.i.d. (see, for example, Candès and Recht (2009); Keshavan et al. (2010); Chatterjee (2015) and references therein). For the aforementioned reason we cannot use most of these results off-the-shelf. However, similar predicament is (partially) present in Davenport et al. (2014), where entries are quantized while being observed.

Similar to Davenport et al. (2014), the tools that prove helpful in this situation are the symmetrization trick and the contraction inequality Ledoux and Talagrand (2013). However, there are crucial difference of our model from Davenport et al. (2014)

: in our case the bias vector, while random, do not change over observations. This translates to less freedom during the transformation of the original matrix to the observed matrix, leading to dependence among the elements in a row. Furthermore, the analysis becomes notably different since the positive observations are not quantized.

Main result for noisy recovery. We plan to recover from observations , where is a standard i.i.d. Gaussian matrix, is the vector containing outliers (sparse noise) with , and is bounded dense noise such that . To recover (and ) we employ the LASSO algorithm, which is inspired by the work of Plan and Vershynin (2016) and Nguyen and Tran (2013). In particular, Plan and Vershynin (2016) recently showed that a signal can be provably recovered (up to a constant multiple) from its nonlinear Gaussian measurements via the LASSO algorithm by treating the measurements as linear observations. In the context of model, for outlier-free measurements , it follows from Plan and Vershynin (2016) that LASSO algorithm outputs as the solution with , where is a Gaussian random variable and is a random variable denoting bias associated with the function. We show that this approach guarantees with high probability recovery of within an error of even when the measurements are corrupted by outliers . This is achieved by jointly minimizing the square loss over after treating our measurements as linear measurements and adding an regularizer to the loss function to promote the sparsity of the solution for (we also recover , see Theorem 4 for a formal description).

Organization. The paper is organized as follows. In section 2, we describe some of the notations used throughout the paper and introduce the some technical tools that would be useful to prove our main results. In the same section (subsection 2.3), we provide the formal models of the problem we are studying. In section 3, we provide detailed proofs for our main results on the representation learning problem (see, Theorem 3). Section 4 contains the proofs and the techniques used for the recovery problem in the presence of outliers (see, Theorem 4).

2 Notations and Technical Tools

2.1 Notation

For any positive integer , define . Given a matrix , for , denotes the -th entry of . For , denotes the vector containing the elements of the -th row of the matrix . Similarly, for , denotes the -th column of the matrix . Recall that the function takes the following form.


For a matrix , we use to denote the matrix obtained by applying the ReLU function on each of the entries of the matrix . For two matrix and , we use to represent the Kronecker product of and .

Given a matrix ,

denotes its Frobenius norm. Also, let denote the operator norm of

, i.e. the maximum singular value of

. We let denote the nuclear norm of . Similar to Davenport et al. (2014), we define a flatness parameter associated with a function :


quantifies how flat can be in the interval . We also define a Lipschitz parameter for a function as follows:


2.2 Techniques to bound the supremum of an empirical process

In the course of this paper, namely in the representation learning part, we use the key tools of symmetrization and contraction to bound the supremum of an empirical process following the lead of Davenport et al. (2014) and the analysis of generalization bounds in the statistical learning literature. In particular, we need the following two statements.

Theorem 1 (Symmetrization of expectation).

Let be independent RVs taking values in and be a class of -valued functions on . Furthermore, let be independent Rademacher RVs. Then, for any ,

Theorem 2 (Contraction inequality Ledoux and Talagrand (2013)).

Let be independent Rademacher RVs and be a convex and increasing function. Let be an -Lipschitz functions, i.e.,

which satisfy . Then, for any ,


2.3 System Model

We focus on the problems of learning the representation and recovery of the signal in the presence of errors when the signal is assumed to be generated using a single layer -network. The models of learning representations and recovery is described below.

Model for learning representations. We assume that a signal vector of interest satisfies


where and correspond to the weight (generator) matrix and the bias vector, respectively.

As for the problem of representation learning, we are given message vectors that are generated from the underlying single-layer model. For , the -th signal vector is defined as follows.


We define the observation matrix


Similarly, we define the coefficient matrix


With this notion, we can concisely represent the observation vectors as


where denotes the all-ones vector.

We assume that the bias vector is a random vector comprising of i.i.d. coordinates with each coordinate being copies of a random variable

distributed according to probability density function


Model for recovery. For the recovery problem, we are given a vector , which is obtained by adding noise to a valid signal vector that is well modeled by a single-layer -network, with the matrix and bias . In particular, for some we have,


where denotes the (dense) noise vector with bounded norm. On the other hands, the vector contains (potentially) large corruptions, also referred to as sparse errors or outliers (we assume, ). The robust recovery problem in

-networks corresponds to obtaining an estimate

of the true latent vector from the corrupt observation vector such that the distance between and is small. A related problem of denoising in the presence of outliers only focuses on obtaining an estimate which is close to the true signal vector . For this part, we focus on the setting where the weight matrix

is a random matrix with i.i.d. entries, where each entry of the matrix is distributed according to standard Gaussian distribution. Furthermore, another crucial assumption is that the Hamming error is

oblivious in its nature, i.e., the error vector is not picked in an adversarial manner given the knowledge of and 222It is an interesting problem to extend our results to a setting with adversarial errors. However, we note that this problem is an active area of research even in the case of linear measurement, i.e,  Bhatia et al. (2015, 2017). We plan to explore this problem in future work..

3 Representation learning in a single-layer -network

In the paper, we employ the natural approach to learn the underlying weight matrix from the observation matrix . As the network maps a lower dimensional coefficient vector to obtain a signal vector


in dimension , the matrix (cf. (15)) is a low-rank matrix as long as . In our quest of recovering the weight matrix , we first focus on estimating the matrix , when given access to . This task can be viewed as estimating a low-rank matrix from its partial (randomized) observations. Our work is inspired by the recent work of Davenport et al. (2014) on -bit matrix completion. However, as we describe later, the crucial difference of our model from the model of Davenport et al. (2014) is that the bias vector does not change over observations in our case. Nonetheless we describe the model and main results of -bit matrix completion below to underscore the key ideas.

-bit matrix completion. In Davenport et al. (2014), the following observation model is considered. Given a low-rank matrix and a differentiable function , the matrix is assumed to be generated as follows333The authors assume that the entries of take values in the set . In this paper we state the equivalent model where the binary alphabet is ..


Furthermore, one has access to only those entries of that are indexed by the set , where the set is generated by including each with certain probability . Given the observations , the likelihood function associated with a matrix takes the following form444Throughout this paper, represents the natural logarithm..


Now, in order to estimate the matrix with bounded entries from , it is natural to maximize the log-likelihood function (cf. (19)) under the constraint that the matrix has rank .

subject to

where the last constraint is introduced to model the setting where (w.h.p.) the observations are assumed to have bounded coordinates. We note that such assumptions indeed hold in many observations of interests, such as images. Note that the formulation in (20) is clearly non-convex due to the rank constraint. Thus, Davenport et al. (2014) propose the following program.

subject to

Note that the constraint is a convex-relaxation of the non-convex constraint , which is required to ensure that the program in (21) outputs a low-rank matrix. Let be the output of the program in (21). Davenport et al. (2014) obtain the following result to characterize the quality of the obtained solution .

Proposition 1 ((Davenport et al., 2014, Theorem A.1)).

Assume that and . Let be as defined in (18). Then, for absolute constants and , with probability at least , the solution of (21) satisfies the following:


where the constant depends on the flatness and steepness of the function

Learning in a single layer and -bit matrix completion: Main differences. Note that the problem of estimating the matrix from is related to the problem of -bit matrix completion as defined above. Similar to the -bit matrix completion setup, the observation matrix is obtained by transforming the original matrix in a probabilistic manner, which is dictated by the underlying distribution of the bias vector . In particular, we get to observe the entire observation matrix, i.e., .

However, there is key difference between these two aforementioned setups. The -bit matrix completion setup studied in Davenport et al. (2014) (in fact, most of the literature on non-linear matrix completion Ganti et al. (2015)) assume that each entry of the original matrix is independently transformed to obtain the observation matrix . In contrast to this, such independence in absent in the single-layer -network. In particular, for , the -th row of the matrix is obtained from the corresponding row of by utilizing the shared randomness defined by the bias . Note that the bias associated with a coordinate of the observed vector in our generative model should not vary across observation vectors. This prevents us from applying the known results to the problem of estimating from . However, as we show in the remainder of this paper that the well-behaved nature of the -function allows us to deal with the dependence across the entries of a row in and obtain the recovery guarantees that are similar to those described in Proposition 1.

3.1 Representation learning from rectified observations

We now focus on the task of recovering the matrix from the observation matrix . Recall that, under the single-layer -network, the observation matrix depends on the matrix as follows.


For , we define as the set of positive coordinates of the -th row of the matrix , i.e.,


Note that, for , the original matrix needs to satisfy the following requirements.




Given the original matrix , for and , let denote the -th largest element of the -th row of , i.e., for ,

It is straightforward to verify from (25) that denotes the indices of largest entries of . Furthermore, whenever , we have


Similarly, it follows from (26) that whenever we have , then satisfies the following.


Based on these observation, we define the set of matrices as


Recall that, denote the probability density function of each bias RV. We can then write the likelihood that a matrix results into the observation matrix as follows.


where, for ,


By using the notation and , we can rewrite (31) as follows.


Therefore the log-likelihood of observing given that is the original matrix takes the following form.


In what follows, we work with a slightly modified quantity

In order to recover the matrix from the observation matrix , we employ the natural maximum likelihood approach which is equivalent to the following.


Define to be such that for all with . In what follows, we simply refer this quantity as as and are clear from context. The following result characterizes the performance of the program proposed in (34).

Theorem 3.

Assume that and the observation matrix is related to according to (15). Let be the solution of the program specified in (34), and the bias density function is differentiable with bounded derivative. Then, the following holds with probability at least .


where, is a constant. The quantities and depend on the distribution of the bias and are defined in (7) and (8), respectively.

The proof of Theorem 3 crucially depends on the following lemma.

Lemma 1.

Given the observation matrix which is related to the matrix according to (15), let be as defined in (29). Then, for any , we have


The proof of this lemma is delegated to the appendix. Now we are ready to prove Theorem 3.

Proof of Theorem 3.

Let be the solution of the program in (34). In what follows, we use as a short hand notation for . We have,


which means,


We now employ Lemma 1 to obtain that


We now proceed to upper bound the right hand side of (39). It follows from the standard symmetrization trick Devroye et al. (2013) that, for any integer , we have


where are i.i.d. Rademacher random variables. Note that, for ,


At this point, we can combine the contraction principle with (3.1) to obtain the following.


where and follow from the Cauchy-Schwartz inequality and the fact that, for , , respectively. Now using Markov’s inequality, it follows from (3.1) that


where follows from (3.1); and follows by setting and . ∎

3.2 Recovering the network parameters

As established in Theorem 3, the program proposed in (34) recovers a matrix such that


Let’s denote the recovered matrix as , where denotes the perturbation matrix that has bounded Frobenius norm (cf. (43)). Now the task of recovering the parameters of single-layer -network is equivalent to solving for given


In our setting where we have and with and , is a low-rank matrix with its column space spanned by the columns of . Therefore, as long as the generative model ensures that the matrix has its singular values sufficiently bounded away from , we can resort to standard results from matrix-perturbation theory and output top left singular vectors of as an candidate for the orthonormal basis for the column space of or . In particular, we can employ the result from Yu et al. (2015) which is stated in Appendix A. Let and be the top left singular vectors of and , respectively. Note that, even without the perturbation we could only hope to recover the column space of (or the column space of ) and not the exact matrix . Let , the smallest non-zero singular value of , is at least . Then, it follows from Theorem 5 (cf. Appendix A) and (43

) that there exists an orthogonal matrix

such that


which is a guarantee that the column space of is recovered within an error of in Frobenius norm by the column space of .

4 Robust recovery in single-layer -network

We now explore the second fundamental question that arises in the context of reconstructing a signal vector belonging to the underlying generative model from its erroneous version. Recall that, we are given a vector , which is obtained by adding noise to a valid message vector that is well modeled by a single-layer -network, i.e.,


Here, denotes the (dense) noise vector with bounded norm. On the other hands, the vector contains (potentially) large corruptions, also referred to as outliers. We assume the number of outliers to be bounded above by . The robust recovery problem in -networks corresponds to obtaining an estimate of the true representation from the corrupt observation vector such that the distance between and is small. A related problem of denoising in the presence of outliers only focuses on obtaining an estimate which is close to the true message vector . In the remainder of this paper, we focus on the setting where the weight matrix is a random matrix with i.i.d. entries, where each entry is distributed according to the standard Gaussian distribution. Furthermore, another crucial assumption is that the outlier vector is oblivious in its nature, i.e., the error vector is not picked in an adversarial manner555It is an interesting problem to extend our results to a setting with adversarial errors. However, we note that this problem is an active area of research even in the case of linear measurement, i.e,  Bhatia et al. (2015, 2017). We plan to explore this problem in future work. given the knowledge of and .

Note that Soltanolkotabi (2017) study a problem which is equivalent to recovering the latent vector from the observation vector generated form a single-layer -network without the presence of outliers. In that sense, our work is a natural generalization of the work in Soltanolkotabi (2017) and presents a recovery method which is robust to errors as well. However, our approach significantly differs from that in Soltanolkotabi (2017), where the author analyze the convergence of the gradient descent method to the true representation vector . In contrast, we rely on the recent work of Plan and Vershynin Plan and Vershynin (2016) to employ the LASSO method to recover the representation vector (and the Hamming error vector ).

Given , which corresponds to the corrupted non-linear observations of , we try to fit a linear model to these observations by solving the following optimization problem666Note that this paper deals with a setup where number of observations is greater than the dimension of the signal that needs to be recovered, i.e., . Therefore, we don’t necessarily require the vector to belong to a restricted set, as done in the other version of the robust LASSO methods for linear measurements (see e.g., Nguyen and Tran (2013))..


In the aforementioned formulation, the regularizer part is included to encourage the sparsity in the estimate vector. The following result characterizes the performance of our proposed program (cf. (47)) in recovering the representation and the corruption vector.

Theorem 4.

Let be a random matrix with i.i.d. standard Gaussian random variables as its entires and satisfies


where , and . Let be defined as , where is a standard Gaussian random variable and is a random variable that represents the bias in a coordinate in (48). Let be the outcome of the program described in (47). Then, with high probability, we have